Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides
Prerequistes You must have an Amazon Web Services account before you begin. You can sign up for an account here: http://aws.amazon.com and click the Sign Up button. You will have to provide a credit card number during installation. You will likely incur some charges, but we can take steps to minimize these. Although Amazon has a Free Tier we will require more computing power.
Prerequistes This tutorial requires accessing a remote server through SSH. UNIX based Operating Systems (Mac and Linux Distros) have this functionality available from their terminals. Windows does not. You'll need to download and install a SSH client. PuTTY should work for our purposes. I have not personally tested this. A working understanding of Java is also expected when we discuss code examples for MapReduce.
Create EC2 Instances
Your First Server First log into your Amazon Web Console. Go to EC2 (It should be in the upper left). You should see this screen.
Creating a Security Group Click Security Groups in the left menu. Click the Create Security Group Button. Provide a Name and Description when prompted. In the bottom panel, go to the Inbound tab. Authorize all TCP communications. Authorise SSH Access on port 22. Authorize ICMP (Echo Reply). Click the button underneath the rule definiton to apply the changes.
Creating a Security Group
Creating a Security Group
Creating SSH keys Click Key Pairs in the left menu. Click the Create Key Button Provide a name for your key pair. Your private key < keypair-name >.pem will be downloaded automatically. AWS does not store the private keys. If you lose this file, you won't be able to SSH into instances you provision with this key pair. Copy the.pem file into your ~/.ssh directory.
Creating SSH Keys
Launch an EC2 Instance Click Instances in the left menu. Click the Launch Instance button Choose Ubuntu 12.04 LTS 64 bit. Go to the General Purpose tab and select m1.large. In Step 3, choose to create 4 instances. In Step 4, allocate 20GB to the root drive. Continue past Step 5. In Step 6 (Configure Security Group) choose the group that you created earlier. Ignore warnings about the security group. Choose the Key Pair you created earlier. Launch the instances!
Launch an EC2 Instance
Launch an EC2 Instance
Launch an EC2 Instance
Launch an EC2 Instance
Launch an EC2 Instance
Launch an EC2 Instance
Connect to your server Click on Instance in the left menu. Choose one of the instances you just created and copy the public DNS. Ex: ec2-54-193-59-36.us-west1.compute.amazonaws.com
Connect to your server Open a terminal on your local computer Enter the following command to ensure your private key isn't publicly viewable. chmod 400 ~/.ssh/<my-key-pair>.pem Enter the following command to connect to your Amazon instance. ssh -i ~/.ssh/<my-key-pair>.pem ubuntu@<public DNS> EX: ssh -i ~/.ssh/hadoopkey.pem ubuntu@ec2-54-19359-36.us-west-1.compute.amazonaws.com Accept the fingerprint. You are now connected!
Install Cloudera & Hadoop
What's Cloudera Manager? Cloudera was the first, and is currently, the leading provider and supporter of Apache Hadoop for Enterprise users. We will be using Cloudera Manager. Cloudera Manager is adminstrative tool for installing and maintaing Hadoop and many other tools in the Hadoop Ecosystem. CDH is Cloudera's open source distribution of Apache Hadoop.
Installing Cloudera Manager After you've connected to your instance. Enter the following command to download the Cloudera Installer. wget http://archive.cloudera.com/cm4/installer/latest/cloude ra-manager-installer.bin Execute the installer with these commands: sudo su chmod +x cloudera-manager-installer.bin./cloudera-manager-installer.bin Accept the licenses and wait for the installer to finish.
Installing Cloudera Manager
Installing Cloudera Manager
Installing Cloudera Manager
Troubleshooting If the installation pauses at any one step for more than 5 minutes, something has gone wrong. First try to cancel the installation by using CTRL+C. Exit the installater and reexecute the.bin file. If you cannot exit using CTRL+C, close the terminal window, reconnect to the server, and relauch the installer.
Using Cloudera Manager After point your browser to: http:\\<public DNS>:7180 EX: http://ec2-54-193-92-102.us-west1.compute.amazonaws.com:7180 Login using User=admin and Pass=admin
Using Cloudera Manager Select the free version and continue.
Using Cloudera Manager Launch the Classic Wizard.
Using Cloudera Manager Continue.
Using Cloudera Manager Enter the Public DNS for each of your instances. Click Search. Ensure that all instances are selected and continue.
Using Cloudera Manager Unselect Impala and Solr. We won't use them.
Using Cloudera Manager Enter ubuntu as the user. Upload the.pem file that was downloaded earlier.
Using Cloudera Manager Installing...
Using Cloudera Manager Done with this part!
Using Cloudera Manager More installing...
Using Cloudera Manager Looking good so far.
Using Cloudera Manager The system passes inspection.
Using Cloudera Manager We'll only need Core Hadoop for now.
Using Cloudera Manager Use embedded databases. Test the connection and then continue.
Using Cloudera Manager Leave the defaults here. Continue.
Using Cloudera Manager Starting services.
Using Cloudera Manager We now have Hadoop running on our cluster!
Using Hadoop and MapReduce
Getting Test Data Download the following tar.gz file to your local machine: https://drive.google.com/file/d/0b9fmxvd4btedqwzstegyaue5ctg/edit?usp=sharing Upload the file to your EC2 instance with the following command. scp -i ~/.ssh/<key FILE NAME>.pem <LOCAL PATH TO FILE>/shakespere.tar.gz ubuntu@<public DNS>:~ EX: scp -i ~/.ssh/hadoopkey.pem /home/jake/desktop/cs157b/shakes/output/shakespeare.tar.gz ubuntu@ec2-54-193-79-16.us-west-1.compute.amazonaws.com:~ Log into the same EC2 instance. Unzip the file with these commands: mkdir shakes gzip -dc shakespeare.tar.gz tar -xf - -C ~/shakes
All of Shakespeare's Work You now have all of Shakepeare's written works. Typically Hadoop works better with larger files, but these will still work for our purposes.
Deploying Test Data into HDFS Run the following command to make a directory on HDFS. The next command changes the ownership of the newly created directory to our user (ubuntu) hadoop fs -mkdir /user/ubuntu/input Load our test text files into HDFS sudo -u hdfs hadoop fs -chown -R ubuntu /user/ubuntu Create an input directory sudo -u hdfs hadoop fs -mkdir /user/ubuntu hadoop fs -put ~/shakes/* /user/ubuntu/input Our files are now replicated and distrubuted across our cluster!
Word Count Let's count how many times each word is used. The data has been normalized to remove punctuation and case sensitivity. Download the WordCount.java file to your EC2 Instance: cd ~ wget cs.cmu.edu/~abeutel/wordcount.java Let's compile the code into a jar with these commands: mkdir wordcount_classes javac -classpath /opt/cloudera/parcels/cdh/lib/hadoop-0.20mapreduce/hadoopcore.jar:/opt/cloudera/parcels/cdh/lib/hadoop/hadoop-common.jar -d wordcount_classes WordCount.java jar -cvf ~/wordcount.jar -C wordcount_classes/. Let's run it! hadoop jar ~/wordcount.jar WordCount /user/ubuntu/input /user/ubuntu/output
What Did We Just Do? We've just run our first MapReduce job! We have counted how many times each word appears. To check on the output, run the following command: hadoop dfs -cat /user/ubuntu/output/part00000 On the left side we have the individual words. On the right is the number of times they appeared in all of Shakespeare's works!
Let's Look at Code (Finally) You can download the WordCount.java file by going here: cs.cmu.edu/~abeutel/wordcount.java At a high level, we'll see a class called WordCount. It contains: 2 inner, static classes that define a single method each. Map Reduce One main method.
The Map Class LongWritable key = byte offset of the line. Text value = a single line of text OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A single word OutputCollector IntWritable = The integer one = 1
Map Method I/O Input: Output:
The Reduce Class Text key = A single word Iterator<IntWritable> value = An iterator over all of the 1 values associated with the given key (word). OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A same word OutputCollector IntWritable = The number of occurrences of that word = The sum of the ones.
Reduce Method I/O Input: Output:
Main Method
Inverted Index Let's count how many times each word is used in total and how many times it's used per file! Download the WordCount.java file to your local machine: https://drive.google.com/file/d/0b9fmxvd4btedqwszytjmmtzftxc/edit?usp=sharing Move the file to your EC2 instance. scp -i ~/.ssh/hadoopkey.pem <LOCAL PATH TO FILE>InvertedIndex.java ubuntu@<public DNS>:~ Log into your EC2 instance. Let's compile the code into a jar with these commands: mkdir invertedindex_classes javac -classpath /opt/cloudera/parcels/cdh/lib/hadoop-0.20-mapreduce/hadoopcore.jar:/opt/cloudera/parcels/cdh/lib/hadoop/hadoop-common.jar -d invertedindex_classes InvertedIndex.java jar -cvf ~/invertedindex.jar -C invertedindex_classes/. Let's run it! hadoop fs -rm -r /user/ubuntu/output hadoop jar ~/invertedindex.jar InvertedIndex /user/ubuntu/input /user/ubuntu/output
Results
What did we change? Only minor changes were needed to enhance our WordCount program into the InvertedIndex program. You should already have the InvertedIndex.java file downloaded to your computer if you want to open it and inspect for yourself.
The New Map Class LongWritable key = byte offset of the line. Text value = a single line of text OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A single word OutputCollector Text = The file name containing this line
New Map Method I/O Input: Output:
The New Reduce Class Text key = A single word Iterator<Text> value = An iterator over all of the filenames containing the given key (word). OutputCollector = A collection of KV pairs that will be sent to a Reducer once all Mappers are finished. OutputCollector Text = A same word OutputCollector Text = The number of occurrences of that word AND how many times each file contains it.
New Reduce Method I/O Input: Output:
Main Method
Retrieving files Now that we're done with MapReduce, let's get out files from HDFS to our local machines. Begin by being logged into your EC2 instance Get the files out of HDFS Now you have 2 new files in your home directory of the EC2 instance. hadoop fs -get /user/ubuntu/output/part* ~ Verify this by running: ls To download these to your local machine Log out of the EC2 instance. Enter: ~. You terminal will be returned to controlling your local machine Run this command to download the output part files: scp -i ~/.ssh/<keyfile>.pem ubuntu@<public DNS>:~/part* ~/Desktop/ You can now open the new files on your desktop in a text editor.
Terminate Your Instances After you're done using Hadoop, you want to terminate your EC2 instances. If you don't, you will continue to be charged per hour (even if you aren't actively using them)! When you terminate your instances though, you will lose ALL data/customizations. Therefore always download any necessary files to your location machine before terminating your instances. From the AWS console, click Instances in the left menu. Mark the check box for all of your instances on the left side. Click on Actions, then choose terminate. You will then see your instances shutting down. They will disappear after a few hours.