Aws Emr

EMR

AWS EMR is a managed service provided by AWS to run Spark, HDFS, HIVE and other select software.

Protip: Start the EMR cluster only after you have you project setup to prevent unnecessary cost

We will use EMR to run our Spark and HDFS cluster

  1. Go to AWS Service -> EMR

  2. Click on Create Cluster Create Cluster

  3. Click on the Go to advanced options EMR advanced option

  4. Select the shown options and copy paste the config below into the Edit software settings section

[
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "/usr/bin/python3"
        }
      }
    ]
  }
]

This config is to let the EMR cluster know to use python3 EMR services and config

  1. This example shows c4.large machine, for keeping cost low. You can choose any machine you like(but recommended to keep costs low unless absolutely necessary). You can also choose On-demand vs spot, spot instances are not available by default, you might have to ask aws for some spot instances. Choose the core count to be at least 2. Click Next

Node type

  1. Type in a name for your cluster

Cluster Name

  1. Choose the key pair you created in the 1. AWS Account section above and press Create Cluster.

Choose key pair

  1. Now you will see the cluster starting, here click on the master security group. Here you can also note your EMR ID

Cluster starting

  1. This will take you to the master security group section, here press the add inbound rule button and add an ssh rule allowing access from anywhere (DO NOT DO THIS IN REAL LIFE). Note that this is set because we are building a toy project not a real life project(companies usually have their own VPC)

Cluster starting Create Cluster

  1. You can now ssh into your cluster as shown below.

Create Cluster

  1. The cluster takes a few minutes to start, wait until the cluster shows its status as waiting to begin work.

  2. Once your work is complete select the EMR cluster and press Terminate button at the top to stop your cluster.