Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Transcription

1 Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics San Antonio Rd, Suite 210 Mt. View, CA (650)

2 Table of Contents OVERVIEW PREREQUISITES JAVA PASSWORDLESS SSH INSTALLING HADOOP DOWNLOADING APACHE HADOOP SETTING UP ENVIRONMENT VARIABLES HADOOP CONFIGURATION FILES FORMATTING THE NAMENODE STARTING PSEUDO- DISTRIBUTED HADOOP VERIFYING YOUR INSTALL VIEWING THE WEB INTERFACES RUNNING A MAPREDUCE JOB STOPPING HADOOP DEBUGGING MAPREDUCE JOBS IN HADOOP WHAT IS STANDALONE MODE? USING STANDALONE (LOCAL) MODE SETTING UP ECLIPSE S REMOTE DEBUGGER IMPORTING THE PROJECT IN ECLIPSE CREATING A REMOTE DEBUGGER ATTACHING THE MAPREDUCE JOB TO THE REMOTE DEBUGGER Overview One of the first roadblocks many developers face when trying to learn about Hadoop is simply getting an installation of Hadoop working locally that they can use to test. In this post, I will show you how to install Hadoop on a Mac so you can start playing with Hadoop today. I'll also show you some tips and tricks on how to debug and test the distributed MapReduce jobs you create. To note, this setup is for developers only. This is not a setup you would deploy in a production cluster. Prerequisites Java To check if Java installed type java - version in the Terminal. If you do not have Java installed this will prompt the installation. Once/if Java is installed, you should see a response like this:

3 $ java -version java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b M3909) Java HotSpot(TM) 64-Bit Server VM (build b01-434, mixed mode) Passwordless SSH First, make sure that SSH is enabled on your computer. You can turn SSH on by navigating to the System Preferences menu, then to Sharing, then to the Remote Login tab. Figure 1 - Enabling SSH in Mac OS X Now that SSH is enabled, we must set up passwordless SSH. This allows the Hadoop processes to talk to each other without requiring you to type in a password. On a computer where passwordless SSH is set up, you should be able to ssh localhost with no prompting. $ ssh localhost $ If it requires a password, follow the next steps.

4 First generate passwordless pub/private key: ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa Then add it to authorized hosts: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Finally, add it to the Mac OS X keychain: ssh-add -K ~/.ssh/id_rsa Now verify that you can ssh into localhost without a password. If this is your first time, you might have to enter your password once. Subsequent ssh localhost commands should not require a password. Installing Hadoop Downloading Apache Hadoop The first thing you should know about downloading Hadoop is that there are many different versions and distributions available. The Hadoop ecosystem includes many different components, including HDFS, MapReduce, HBase, Hive, Pig, and more. For this reason, it can be very confusing to figure out exactly what you need to download in the first place. Essentially, you have two options: 1. Download the individual components directly from their respective Apache websites. 2. Download a complete distribution from a third- party website. The advantage of downloading a distribution from a third- party vendor like MapR, Cloudera, or Hortonworks is that distribution is a supported and the components within the distribution are guaranteed to work with each other. You will also find that some of these distributions have different feature sets (which may be the topic of a future blog post), and this can be confusing if you're not exactly sure what you want. If you know your requirements and you know you need Hadoop, it is important you do your research and determine which distribution works best for you. For the purpose of vendor neutrality we will go through installing the HDFS and MapReduce components from the Apache website. You will want to find the latest stable release. 1. Navigate to and go to the link 'Download a release now!' 2. Choose any mirror from the list of mirrors.

5 3. You should see a list of folders that contain different versions of Hadoop, you want to choose the folder labeled stable. This folder will contain the latest stable version of Hadoop that you can download. 4. Choose the download in the format hadoop- x.x.x.tar.gz. Figure 2 - Downloading the Stable Apache Hadoop Release As of late January 2013, the latest stable release is labeled hadoop tar.gz. I am going to install Hadoop in a directory where I keep all of my code. ~/code/apache Note that you need to have write permissions in the directory you are going to install Hadoop in. So the first thing I am going to do is move the download to that folder and un- tar it:

6 $ mv hadoop tar.gz ~/code/apache/ $ cd ~/code/apache/ $ tar -xvzf hadoop tar.gz x hadoop-1.0.4/ x hadoop-1.0.4/bin/ x hadoop-1.0.4/c++/ x hadoop-1.0.4/c++/linux-amd64-64/... x hadoop-1.0.4/src/contrib/ec2/bin/launch-hadoop-cluster x hadoop-1.0.4/src/contrib/ec2/bin/launch-hadoop-master x hadoop-1.0.4/src/contrib/ec2/bin/launch-hadoop-slaves x hadoop-1.0.4/src/contrib/ec2/bin/list-hadoop-clusters x hadoop-1.0.4/src/contrib/ec2/bin/terminate-hadoop-cluster Then I create a soft link to it so I can easily reference it in the future: $ ln -s hadoop hadoop Create Name and Data Directories: $ cd ~/code/apache $ mkdir dfs $ mkdir./dfs/name $ mkdir./dfs/data Setting Up Environment Variables I like to create some environment variables in my ~/.bash_profile file so that when I start a new terminal I can easily reference my Hadoop installation. Open or create your ~/.bash_profile file and add the following lines to it. # Generic export CODE_BASE=~/code # Apache export APACHE_INSTALL_BASE=$CODE_BASE/apache # Install Base export INSTALL_BASE=$APACHE_INSTALL_BASE # Hadoop export HADOOP_INSTALL=$INSTALL_BASE/hadoop export HADOOP_PREFIX=$HADOOP_INSTALL export PATH=$PATH:$HADOOP_INSTALL/bin # Java export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJ DK/Home

7 This additionally allows me to be flexible when I have multiple versions of Hadoop setup only computer. I can simply change a few variables to point to different installations of Hadoop. Hadoop Configuration Files Hadoop comes with configuration files you can use to change how Hadoop runs. With these configuration files, you can run Hadoop in either Standalone Operation, Pseudo- Distribution Operation, or Fully Distributed Operation. In Standalone, Hadoop runs as a single process and uses the local file system to store files. In Pseudo- Distributed, Hadoop runs as several processes and uses HDFS to store files on a single node. In Fully Distributed, Hadoop runs as several processes and uses HDFS to store files across many nodes. We will set up Hadoop using Pseudo- Distributed operation. The files I am describing are located in the hadoop/conf directory. We will work on four of these files, core- site.xml, hadoop- env.sh, hdfs- site.xml, and mapred- site.xml. $ cd hadoop/conf Figure 3 - Hadoop Configuration Files We are going to create a set of local and pseudo distributed configuration files so we can switch between local and pseudo distributed with a simple script.

8 First let s rename the default configuration files. $ mv hdfs-site.xml hdfs-site-local.xml $ mv core-site.xml core-site-local.xml $ mv mapred-site.xml mapred-site-local.xml Now let s edit these files. Edit core- site- local.xml so that it contains the following: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>fs.default.name</name> <value>file:///</value> </property> </configuration> Verify hdfs- site- local.xml contains the following: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> </configuration> Edit mapred- site- local.xml so that it contains the following: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>mapred.job.tracker</name> <value>local</value> </property> </configuration> Create a file called hdfs- site- pseudo.xml and write the following as its contents (make sure to replace /Users/ryantabora/code/apache with wherever you created the dfs/data and name directories): <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

9  <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/users/ryantabora/code/apache/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/users/ryantabora/code/apache/dfs/data</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> Create a file called mapred- site- pseudo.xml and write the following as its contents: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>mapred.child.java.opts</name> <value>-xmx1536m</value> </property> </configuration> Create a file called core- site- pseudo.xml and write the following as its contents: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> </property> </configuration> Now rename the hadoop- env.sh script hadoop- env- pseudo.sh:

10 $ mv hadoop-env.sh hadoop-env-pseudo.sh Add the following lines to the hadoop- env- pseudo.sh script (you can put them anywhere, but I put them as the first two lines in the script). export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJ DK/Home export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK - Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" Create a hadoop- env- local.sh script that contains only the following: export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJ DK/Home export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK - Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" export HADOOP_OPTS="$HADOOP_OPTS - agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000" export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS" export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS" export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS" export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS" export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS" Next we will create a script that allows us to change between local and distributed mode. Create a file called configure.sh and write the following as its contents: #!/bin/bash ARG=$1 if [ "$ARG" == 'local' ]; then rm -f mapred-site.xml rm -f core-site.xml rm -f hdfs-site.xml rm -f hadoop-env.sh ln -s mapred-site-local.xml mapred-site.xml ln -s core-site-local.xml core-site.xml ln -s hdfs-site-local.xml hdfs-site.xml ln -s hadoop-env-local.sh hadoop-env.sh chmod 755 hadoop-env.sh exit 1; elif [ "$ARG" == 'pseudo' ]; then rm -f./mapred-site.xml rm -f./core-site.xml

11 else fi rm -f./hdfs-site.xml rm -f hadoop-env.sh ln -s mapred-site-pseudo.xml mapred-site.xml ln -s core-site-pseudo.xml core-site.xml ln -s hdfs-site-pseudo.xml hdfs-site.xml ln -s hadoop-env-pseudo.sh hadoop-env.sh chmod 755 hadoop-env.sh exit 1; echo "Enter either pseudo or local" exit 1; Now make sure that script is executable $ chmod 755 configure.sh To set the configuration files to local mode, run $./configure.sh local To set the configuration files to pseudo- distributed mode, run $./configure.sh pseudo To start, lets set them to Pseudo Distributed mode. $./configure.sh pseudo Ryan-Taboras-MacBook-Air:conf ryantabora$ ls -l total 232 -rw-r--r--@ 1 ryantabora staff 7457 Oct 2 22:17 capacityscheduler.xml -rw-r--r--@ 1 ryantabora staff 535 Oct 2 22:17 configuration.xsl -rwxr-xr-x 1 ryantabora staff 726 Jan 22 09:49 configure.sh -rw-r--r--@ 1 ryantabora staff 259 Jan 22 08:29 core-site-local.xml -rw-r--r-- 1 ryantabora staff 273 Jan 21 11:16 core-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 20 Jan 22 09:56 core-site.xml -> core-site-pseudo.xml -rw-r--r--@ 1 ryantabora staff 327 Oct 2 22:17 fair-scheduler.xml -rw-r--r-- 1 ryantabora staff 2537 Jan 22 09:43 hadoop-env-backup.sh -rwxr-xr-x 1 ryantabora staff 735 Jan 22 09:42 hadoop-env-local.sh -rwxr-xr-x 1 ryantabora staff 634 Jan 22 09:43 hadoop-env-pseudo.sh lrwxr-xr-x 1 ryantabora staff 20 Jan 22 09:56 hadoop-env.sh -> hadoop-env-pseudo.sh -rw-r--r--@ 1 ryantabora staff 1488 Oct 2 22:17 hadoopmetrics2.properties -rw-r--r--@ 1 ryantabora staff 4644 Oct 2 22:17 hadoop-policy.xml -rw-r--r--@ 1 ryantabora staff 178 Oct 2 22:17 hdfs-site-local.xml -rw-r--r-- 1 ryantabora staff 549 Jan 21 11:16 hdfs-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 20 Jan 22 09:56 hdfs-site.xml -> hdfs-site-pseudo.xml -rw-r--r--@ 1 ryantabora staff 4441 Oct 2 22:17 log4j.properties -rw-r--r--@ 1 ryantabora staff 2033 Oct 2 22:17 mapred-queueacls.xml

12 1 ryantabora staff 259 Jan 22 08:59 mapred-sitelocal.xml -rw-r--r-- 1 ryantabora staff 357 Jan 21 11:16 mapred-sitepseudo.xml lrwxr-xr-x 1 ryantabora staff 22 Jan 22 09:56 mapred-site.xml -> mapred-site-pseudo.xml -rw-r--r--@ 1 ryantabora staff 10 Oct 2 22:17 masters -rw-r--r--@ 1 ryantabora staff 10 Oct 2 22:17 slaves -rw-r--r--@ 1 ryantabora staff 1243 Oct 2 22:17 sslclient.xml.example -rw-r--r--@ 1 ryantabora staff 1195 Oct 2 22:17 sslserver.xml.example -rw-r--r--@ 1 ryantabora staff 382 Oct 2 22:17 taskcontroller.cfg Formatting the NameNode Now open a new terminal window (so that the variables set by ~/.bash_profile are updated) and run the following command to format the namenode. You should only need to run this command once. However, if you ever need to do this again you should know that you will lose all of the data you currently have in HDFS. $ hadoop namenode -format 13/01/21 10:25:46 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = STARTUP_MSG: args = [-format] STARTUP_MSG: version = STARTUP_MSG: build = -r ; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 ************************************************************/ Re-format filesystem in /Users/ryantabora/code/apache/dfs/name? (Y or N) Y 13/01/21 10:25:49 INFO util.gset: VM type = 64-bit 13/01/21 10:25:49 INFO util.gset: 2% max memory = MB 13/01/21 10:25:49 INFO util.gset: capacity = 2^21 = entries 13/01/21 10:25:49 INFO util.gset: recommended= , actual= /01/21 10:25:49 INFO namenode.fsnamesystem: fsowner= 13/01/21 10:25:49 INFO namenode.fsnamesystem: supergroup=supergroup 13/01/21 10:25:49 INFO namenode.fsnamesystem: ispermissionenabled=false 13/01/21 10:25:49 INFO namenode.fsnamesystem: dfs.block.invalidate.limit=100 13/01/21 10:25:49 INFO namenode.fsnamesystem: isaccesstokenenabled=false accesskeyupdateinterval=0 min(s), accesstokenlifetime=0 min(s) 13/01/21 10:25:49 INFO namenode.namenode: Caching file names occuring more than 10 times 13/01/21 10:25:50 INFO common.storage: Image file of size 116 saved in 0 seconds. 13/01/21 10:25:50 INFO common.storage: Storage directory /Users/ryantabora/code/apache/dfs/name has been successfully formatted. 13/01/21 10:25:50 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************

13 SHUTDOWN_MSG: Shutting down NameNode at ************************************************************/ Starting Pseudo- Distributed Hadoop You can use the start- all script to start all of the HDFS and MapReduce processes at once. $ ~/code/apache/hadoop/bin/start-all.sh starting namenode, logging to /Users/ryantabora/code/apache/hadoop /libexec/../logs/ localhost: starting datanode, logging to /Users/ ryantabora /code/apache/hadoop-1.0.4/libexec/../logs localhost: starting secondarynamenode, logging to /Users/ ryantabora /code/apache/hadoop-1.0.4/libexec/../logs/ starting jobtracker, logging to /Users/ ryantabora /code/apache/hadoop /libexec/../logs/ localhost: starting tasktracker, logging to /Users/ ryantabora /code/apache/hadoop-1.0.4/libexec/../logs/ Verifying Your Install First run JPS to verify that all 5 services are running. $ jps NameNode SecondaryNameNode 2592 Jps TaskTracker DataNode JobTracker List out the root directory in HDFS. $ hadoop fs -ls / Found 1 items drwxr-xr-x - ryantabora supergroup :11 /tmp Make a directory called test in HDFS. $ hadoop fs -mkdir /test Make sure the test directory exists. $ hadoop fs -ls / Found 2 items drwxr-xr-x - ryantabora supergroup :18 /test drwxr-xr-x - ryantabora supergroup :11 /tmp Viewing the Web Interfaces NameNode You can learn a little bit about the state of HDFS by visiting the NameNode UI at the following URL:

14 Figure 4 - NameNode UI JobTracker You can monitor the MapReduce jobs you are running by visiting the JobTracker UI at the following link:

15 Figure 5 - JobTracker UI TaskTracker Finally, you can monitor the TaskTrackers by visiting the TaskTracker UI at the following link:

16 Figure 6 - TaskTracker UI Running a MapReduce job Now let s run an example MapReduce job to verify that it works. The following is a provided MapReduce job included in the Hadoop download. It uses MapReduce to estimate the value of pi. Check out the JobTracker UI while it is running or monitor the Terminal to view the job s progress. $ hadoop jar ~/code/apache/hadoop/hadoop-examples jar pi Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 13/01/21 09:19:01 INFO mapred.fileinputformat: Total input paths to process : 10 13/01/21 09:19:02 INFO mapred.jobclient: Running job: job_ _ /01/21 09:19:03 INFO mapred.jobclient: map 0% reduce 0% 13/01/21 09:19:16 INFO mapred.jobclient: map 20% reduce 0%

17 13/01/21 09:19:25 INFO mapred.jobclient: map 40% reduce 0% 13/01/21 09:19:28 INFO mapred.jobclient: map 40% reduce 3% 13/01/21 09:19:31 INFO mapred.jobclient: map 60% reduce 6% 13/01/21 09:19:34 INFO mapred.jobclient: map 60% reduce 13% 13/01/21 09:19:37 INFO mapred.jobclient: map 80% reduce 13% 13/01/21 09:19:43 INFO mapred.jobclient: map 100% reduce 20% 13/01/21 09:19:55 INFO mapred.jobclient: map 100% reduce 100% 13/01/21 09:20:00 INFO mapred.jobclient: Job complete: job_ _ /01/21 09:20:00 INFO mapred.jobclient: Counters: 27 13/01/21 09:20:00 INFO mapred.jobclient: Job Counters 13/01/21 09:20:00 INFO mapred.jobclient: Launched reduce tasks=1 13/01/21 09:20:00 INFO mapred.jobclient: SLOTS_MILLIS_MAPS= /01/21 09:20:00 INFO mapred.jobclient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/21 09:20:00 INFO mapred.jobclient: Total time spent by all maps waiting after reserving slots (ms)=0 13/01/21 09:20:00 INFO mapred.jobclient: Launched map tasks=10 13/01/21 09:20:00 INFO mapred.jobclient: Data-local map tasks=10 13/01/21 09:20:00 INFO mapred.jobclient: SLOTS_MILLIS_REDUCES= /01/21 09:20:00 INFO mapred.jobclient: File Input Format Counters 13/01/21 09:20:00 INFO mapred.jobclient: Bytes Read= /01/21 09:20:00 INFO mapred.jobclient: File Output Format Counters 13/01/21 09:20:00 INFO mapred.jobclient: Bytes Written=97 13/01/21 09:20:00 INFO mapred.jobclient: FileSystemCounters 13/01/21 09:20:00 INFO mapred.jobclient: FILE_BYTES_READ=226 13/01/21 09:20:00 INFO mapred.jobclient: HDFS_BYTES_READ= /01/21 09:20:00 INFO mapred.jobclient: FILE_BYTES_WRITTEN= /01/21 09:20:00 INFO mapred.jobclient: HDFS_BYTES_WRITTEN=215 13/01/21 09:20:00 INFO mapred.jobclient: Map-Reduce Framework 13/01/21 09:20:00 INFO mapred.jobclient: Map output materialized bytes=280 13/01/21 09:20:00 INFO mapred.jobclient: Map input records=10 13/01/21 09:20:00 INFO mapred.jobclient: Reduce shuffle bytes=252 13/01/21 09:20:00 INFO mapred.jobclient: Spilled Records=40 13/01/21 09:20:00 INFO mapred.jobclient: Map output bytes=180 13/01/21 09:20:00 INFO mapred.jobclient: Total committed heap usage (bytes)= /01/21 09:20:00 INFO mapred.jobclient: Map input bytes=240 13/01/21 09:20:00 INFO mapred.jobclient: Combine input records=0 13/01/21 09:20:00 INFO mapred.jobclient: SPLIT_RAW_BYTES= /01/21 09:20:00 INFO mapred.jobclient: Reduce input records=20 13/01/21 09:20:00 INFO mapred.jobclient: Reduce input groups=20 13/01/21 09:20:00 INFO mapred.jobclient: Combine output records=0 13/01/21 09:20:00 INFO mapred.jobclient: Reduce output records=0 13/01/21 09:20:00 INFO mapred.jobclient: Map output records=20 Job Finished in seconds Estimated value of Pi is As you can see the job completed successfully. Congratulations, at this point you have a fully functional Pseudo- Distributed Hadoop installation running on your Mac! Now, to figure out how to debug some MapReduce jobs.

18 Stopping Hadoop You can stop Hadoop by running the stop- all.sh script provided in the bin directory of the Hadoop install. $ cd $HADOOP_PREFIX/bin $./stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode $ jps 9510 Jps Debugging MapReduce jobs in Hadoop Debugging distributed jobs can be a very difficult process. In a Pseudo- Distributed or Distributed Hadoop cluster, each Map and Reduce task is run in it s own JVM (unless you configure reusing JVMs, in which case you still have more than one JVM). This makes debugging difficult because debuggers generally only attach to a single JVM. What Is Standalone Mode? Unlike Pseudo- Distributed and Distributed modes, which run Hadoop as a set of processes in many JVMs, Standalone mode runs everything in a single JVM as you invoke Hadoop commands. This means that you will not be starting Standalone up in the same way, and you will not see the DataNode, NameNode, or other processes when you run jps. The fact that it runs everything in a single JVM is great for debugging as we can attach a remote debugger in an IDE like Eclipse and actually step through our MapReduce code line by line. Using Standalone (Local) Mode To solve the issue of multiple JVMs we can run Hadoop in Standalone mode. We setup some configuration files earlier so that we could easily switch between Pseudo- Distributed and Standalone mode. First make sure that Hadoop in Pseudo- Distributed Mode is shut down. $ jps 9510 Jps Now switch the configuration to local mode. $ cd $HADOOP_PREFIX/conf $./configure.sh local $ ls -l total 232 -rw-r--r--@ 1 ryantabora staff 7457 Oct 2 22:17 capacityscheduler.xml

19 1 ryantabora staff 535 Oct 2 22:17 configuration.xsl -rwxr-xr-x 1 ryantabora staff 726 Jan 22 09:49 configure.sh -rw-r--r--@ 1 ryantabora staff 259 Jan 22 08:29 core-site-local.xml -rw-r--r-- 1 ryantabora staff 273 Jan 21 11:16 core-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 19 Jan 22 10:17 core-site.xml -> core-site-local.xml -rw-r--r--@ 1 ryantabora staff 327 Oct 2 22:17 fair-scheduler.xml -rw-r--r-- 1 ryantabora staff 2537 Jan 22 09:43 hadoop-env-backup.sh -rwxr-xr-x 1 ryantabora staff 735 Jan 22 09:42 hadoop-env-local.sh -rwxr-xr-x 1 ryantabora staff 634 Jan 22 09:43 hadoop-env-pseudo.sh lrwxr-xr-x 1 ryantabora staff 19 Jan 22 10:17 hadoop-env.sh -> hadoop-env-local.sh -rw-r--r--@ 1 ryantabora staff 1488 Oct 2 22:17 hadoopmetrics2.properties -rw-r--r--@ 1 ryantabora staff 4644 Oct 2 22:17 hadoop-policy.xml -rw-r--r--@ 1 ryantabora staff 178 Oct 2 22:17 hdfs-site-local.xml -rw-r--r-- 1 ryantabora staff 549 Jan 21 11:16 hdfs-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 19 Jan 22 10:17 hdfs-site.xml -> hdfs-site-local.xml -rw-r--r--@ 1 ryantabora staff 4441 Oct 2 22:17 log4j.properties -rw-r--r--@ 1 ryantabora staff 2033 Oct 2 22:17 mapred-queueacls.xml -rw-r--r--@ 1 ryantabora staff 259 Jan 22 08:59 mapred-sitelocal.xml -rw-r--r-- 1 ryantabora staff 357 Jan 21 11:16 mapred-sitepseudo.xml lrwxr-xr-x 1 ryantabora staff 21 Jan 22 10:17 mapred-site.xml -> mapred-site-local.xml -rw-r--r--@ 1 ryantabora staff 10 Oct 2 22:17 masters -rw-r--r--@ 1 ryantabora staff 10 Oct 2 22:17 slaves -rw-r--r--@ 1 ryantabora staff 1243 Oct 2 22:17 sslclient.xml.example -rw-r--r--@ 1 ryantabora staff 1195 Oct 2 22:17 sslserver.xml.example -rw-r--r--@ 1 ryantabora staff 382 Oct 2 22:17 taskcontroller.cfg Setting Up Eclipse s Remote Debugger I will assume you ve already got Eclipse downloaded. I m running the J2EE Juno version. In this blog post I ve included an example WordCount project you can use to test out the remote debugging feature of Eclipse. Importing the Project in Eclipse First let s import the project into Eclipse. Select File- >Import.

20 Figure 7 - Importing a Project Select General- >Existing Projects into Workspace. Next. Figure 8 - Selecting an Existing Project

21 Then Browse to the downloaded exercise folder. Select MapReduce- WordCount as the project you want to import. Finish. Figure 9 - Importing the MapReduce- WordCount Project Now you should be able to see all of the files on the left hand side of the Eclipse application. Figure 10 - Project Contents

22 Creating a Remote Debugger Let s find the Mapper and place a breakpoint there. Find the SimpleWordCountMapper.java class and place a breakpoint on one of the lines. I put a breakpoint next to String[] tokens = line.tostring().split( \\s+ ); Figure 11 - Placing a Breakpoint Now let s set up the remote debugger. Go to the Run menu and select Debug Configurations Figure 12 - Debug Configurations Menu in Eclipse Then create a new Remote Java Application debugger by right clicking Remote Java Application and selecting New.

23 Figure 13 - Creating a Remote Debugger Make sure the remote debugger has connection type standard (socket attach) and points to host:localhost and port:8000. Figure 14 - Setting up the Remote Debugger Attaching the MapReduce Job to the Remote Debugger First we must build the jar for the project. Navigate to the project directory and execute ant. You should see a successful build and it will create an executable jar in the./bin/jar directory called wordcount.jar.

24 $ cd ~/Desktop/exercise/MapReduce-WordCount/ $ ant Buildfile: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/build.xml junit.clean: clean: [delete] Deleting directory /Users/ryantabora/Desktop/exercise/MapReduce-WordCount/bin build-dirs: [mkdir] Created dir: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin [mkdir] Created dir: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin/classes compile: [javac] Compiling 3 source files to /Users/ryantabora/Desktop/exercise/MapReduce-WordCount/bin/classes [javac] Compiling 1 source file to /Users/ryantabora/Desktop/exercise/MapReduce-WordCount/bin/classes jar: [mkdir] Created dir: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin/jar [jar] Building jar: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin/jar/wordcount.jar junit: [junit] Running com.thinkbiganalytics.ex.wordcount.simplewordcounttest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: sec junit.clean: test: main: BUILD SUCCESSFUL Total time: 2 seconds $ ls./bin/jar/ wordcount.jar Now, let s run the MapReduce job. We do so by calling in the following format: hadoop jar <path_to_jar> <input> <output> You will see that it waits to run on port Make sure that the output folder you put as the output does not exist yet.

25 $ hadoop jar ~/Desktop/exercise/MapReduce- WordCount/bin/jar/wordcount.jar beowulf.txt ~/Desktop/output Listening for transport dt_socket at address: 8000 Now go back to Eclipse - > Run - > Debug Configurations - > Remote Java Application - > SimpleWordCount and choose Debug. The MapReduce job will start to run. Once it hits the breakpoint in the Mapper, Eclipse will open the Debug perspective and you can start debugging. Figure 15 - Debugging MapReduce Congratulations! You have now debugged a MapReduce job! As you step through, you should see the console display the Map and Reduce task status. Eventually it will finish and you can view your results. 13/01/22 11:23:07 WARN util.nativecodeloader: Unable to load nativehadoop library for your platform... using builtin-java classes where applicable 13/01/22 11:23:07 WARN snappy.loadsnappy: Snappy native library not loaded 13/01/22 11:23:07 INFO mapred.fileinputformat: Total input paths to process : 1 13/01/22 11:23:07 INFO mapred.jobclient: Running job: job_local_ /01/22 11:23:07 INFO mapred.task: Using ResourceCalculatorPlugin : null 13/01/22 11:23:07 INFO mapred.maptask: numreducetasks: 1 13/01/22 11:23:07 INFO mapred.maptask: io.sort.mb = /01/22 11:23:08 INFO mapred.maptask: data buffer = / /01/22 11:23:08 INFO mapred.maptask: record buffer = / /01/22 11:23:08 INFO mapred.maptask: Starting flush of map output 13/01/22 11:23:08 INFO mapred.jobclient: map 0% reduce 0% 13/01/22 11:23:08 INFO mapred.maptask: Finished spill 0 13/01/22 11:23:08 INFO mapred.task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

26 13/01/22 11:23:10 INFO mapred.localjobrunner: file:/users/ryantabora/desktop/exercise/mapreduce- WordCount/beowulf.txt: /01/22 11:23:10 INFO mapred.task: Task 'attempt_local_0001_m_000000_0' done. 13/01/22 11:23:10 INFO mapred.task: Using ResourceCalculatorPlugin : null 13/01/22 11:23:10 INFO mapred.localjobrunner: 13/01/22 11:23:10 INFO mapred.merger: Merging 1 sorted segments 13/01/22 11:23:10 INFO mapred.merger: Down to the last merge-pass, with 1 segments left of total size: bytes 13/01/22 11:23:10 INFO mapred.localjobrunner: 13/01/22 11:23:11 INFO mapred.task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 13/01/22 11:23:11 INFO mapred.localjobrunner: 13/01/22 11:23:11 INFO mapred.task: Task attempt_local_0001_r_000000_0 is allowed to commit now 13/01/22 11:23:11 INFO mapred.fileoutputcommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/users/ryantabora/desktop/output 13/01/22 11:23:11 INFO mapred.jobclient: map 100% reduce 0% 13/01/22 11:23:13 INFO mapred.localjobrunner: reduce > reduce 13/01/22 11:23:13 INFO mapred.task: Task 'attempt_local_0001_r_000000_0' done. 13/01/22 11:23:14 INFO mapred.jobclient: map 100% reduce 100% 13/01/22 11:23:14 INFO mapred.jobclient: Job complete: job_local_ /01/22 11:23:14 INFO mapred.jobclient: Counters: 18 13/01/22 11:23:14 INFO mapred.jobclient: File Input Format Counters 13/01/22 11:23:14 INFO mapred.jobclient: Bytes Read= /01/22 11:23:14 INFO mapred.jobclient: File Output Format Counters 13/01/22 11:23:14 INFO mapred.jobclient: Bytes Written= /01/22 11:23:14 INFO mapred.jobclient: FileSystemCounters 13/01/22 11:23:14 INFO mapred.jobclient: FILE_BYTES_READ= /01/22 11:23:14 INFO mapred.jobclient: FILE_BYTES_WRITTEN= /01/22 11:23:14 INFO mapred.jobclient: Map-Reduce Framework 13/01/22 11:23:14 INFO mapred.jobclient: Map output materialized bytes= /01/22 11:23:14 INFO mapred.jobclient: Map input records= /01/22 11:23:14 INFO mapred.jobclient: Reduce shuffle bytes=0 13/01/22 11:23:14 INFO mapred.jobclient: Spilled Records= /01/22 11:23:14 INFO mapred.jobclient: Map output bytes= /01/22 11:23:14 INFO mapred.jobclient: Total committed heap usage (bytes)= /01/22 11:23:14 INFO mapred.jobclient: Map input bytes= /01/22 11:23:14 INFO mapred.jobclient: SPLIT_RAW_BYTES=124 13/01/22 11:23:14 INFO mapred.jobclient: Combine input records=0 13/01/22 11:23:14 INFO mapred.jobclient: Reduce input records= /01/22 11:23:14 INFO mapred.jobclient: Reduce input groups= /01/22 11:23:14 INFO mapred.jobclient: Combine output records=0 13/01/22 11:23:14 INFO mapred.jobclient: Reduce output records= /01/22 11:23:14 INFO mapred.jobclient: Map output records=42089 $ head ~/Desktop/output/part "'tis 2 "_excellently 1 "_i 1

27 "_nobly 1 "art 1 "ask 1 "beowulf 2 "brilliant" 1 "defects," 1 "folk 1