Running An Apache Spark Application on Amazon Elastic MapReduce

This is a series of guided screenshots on how to run an AWS EMR Spark application. Last time we wrote a spark count application that found the list of channels with more than 24 hours of programming. We will run that same application this time on EMR instead of the local hadoop VM.

The general steps are:

  1. Load the application jar onto Amazon S3
  2. Log onto the AWS Console and Create a Cluster with Spark and point to the jar file and hit Create Cluster

Modifying the Application to Look at S3###

The application from last time looks at a local HDFS source. This time we'll have to modify the input and output.

Recall last time it was:

val dataFile = sc.textFile("/user/hue/BT8626/*.tab")

We'll change that to:

val dataFile = sc.textFile("s3://mybucket/input/*.tab")

And change the output location accordingly:

val reduced = channels_with_x_greater_than_24

I removed the coalescing so that we can see the reducer's individual outputs.

Upload Input Files to S3###

Upload tab files to S3

Create AWS EMR CLuster###

Software Configuration (Advanced Setup Option):
AWS EMR Software Configuration
Point the Spark application to look at the custom jar file and the main class to execute:
AWS EMR Spark Application Configuration
Hardware Configuration: We're going to use 3 node cluster, 2 workers and 1 master.
AWS EMR Hardware configuration
AWS EMR General Settings
Finally the security settings. You can remote into the master node to run spark-shell just like you would with the local VM. For this you need to setup a key-pair for ssh in the EC2 KeyPair settings menu. Once the master node is up and running you can ssh into it with:
ssh -i /path/to/pemfile.pem hadoop@instance_public_dns_or_ip
The default user for the Amazon AMI is hadoop.
AWS EMR Security

Once you've launched the cluster you can check out the status of the cluster by expanding the Hardware arrow. You can also click on the master node's EC2 instance details to get the public DNS from here as well.
AWS EMR Provisioning Cluster
Notice that the Spark application is the step we specified earlier. We wait for it to say completed and we check the output in the S3 Bucket:

AWS EMR Spark Application Output S3 Bucket

Make sure that you don't write to the same output location. If files exists, the application will fail.

In the future I'll put together a proper pipeline / workflow so that it's automated instead of running jobs through the AWS Console. Stay tuned!