This is a series of guided screenshots on how to run an AWS EMR Spark application. Last time we wrote a spark count application that found the list of channels with more than 24 hours of programming. We will run that same application this time on EMR instead of the local hadoop VM.
The general steps are:
- Load the application jar onto Amazon S3
- Log onto the AWS Console and Create a Cluster with Spark and point to the jar file and hit Create Cluster
Modifying the Application to Look at S3###
The application from last time looks at a local HDFS source. This time we'll have to modify the input and output.
Recall last time it was:
val dataFile = sc.textFile("/user/hue/BT8626/*.tab")
We'll change that to:
val dataFile = sc.textFile("s3://mybucket/input/*.tab")
And change the output location accordingly:
val reduced = channels_with_x_greater_than_24 .reduceByKey(_+_) .saveAsTextFile("s3://mybucket/output")
I removed the coalescing so that we can see the reducer's individual outputs.
Upload Input Files to S3###
Create AWS EMR CLuster###
Software Configuration (Advanced Setup Option):
Point the Spark application to look at the custom jar file and the main class to execute:
Hardware Configuration: We're going to use 3 node cluster, 2 workers and 1 master.
Finally the security settings. You can remote into the master node to run
spark-shell just like you would with the local VM. For this you need to setup a key-pair for ssh in the EC2 KeyPair settings menu. Once the master node is up and running you can ssh into it with:
ssh -i /path/to/pemfile.pem hadoop@instance_public_dns_or_ip
The default user for the Amazon AMI is hadoop.
Once you've launched the cluster you can check out the status of the cluster by expanding the Hardware arrow. You can also click on the master node's EC2 instance details to get the public DNS from here as well.
Notice that the Spark application is the step we specified earlier. We wait for it to say completed and we check the output in the S3 Bucket:
Make sure that you don't write to the same output location. If files exists, the application will fail.
In the future I'll put together a proper pipeline / workflow so that it's automated instead of running jobs through the AWS Console. Stay tuned!