Basic Hadoop Programming Skills

Basic commands of Ubuntu Open file explorer

Basic commands of Ubuntu Open terminal

Basic commands of Ubuntu Open new tabs in terminal Typically, one tab for compiling source codes One tab for running Hadoop

Basic shell commands (in the terminal) List directory hadoop@ubuntu-virtualbox:~$ ls Create directory hadoop@ubuntu-virtualbox:~$ mkdir project Browse into directory hadoop@ubuntu-virtualbox:~/$ cd project Download file hadoop@ubuntu-virtualbox:~/project$ wget http://www.comp.nus.edu.sg/~shilei/download/cs5344-examples.zip

Basic commands of Ubuntu Extract the downloaded zip files

Start/stop Hadoop Re-format the HDFS (all data will be deleted) hadoop@ubuntu-virtualbox:~$ hadoop namenode -format Start Hadoop hadoop@ubuntu-virtualbox:~$ start-all.sh See if Hadoop is running hadoop@ubuntu-virtualbox:~$ jps hadoop@ubuntu-virtualbox:~$ hadoop dfsadmin -report Stop Hadoop (when you are done) hadoop@ubuntu-virtualbox:~$ stop-all.sh

Hadoop Web Interfaces Browse the followings in web browser HDFS status: http://localhost:50030 Hadoop job status: http://localhost:50070

Basic Hadoop Commands HDFS shell commands: Ø Create/remove folder: hadoop fs mkdir/- rmr FOLDER_NAME hadoop@ubuntu- VirtualBox:~$ hadoop fs - mkdir /data hadoop@ubuntu- VirtualBox:~$ hadoop fs - mkdir /data/input Ø List folder: hadoop fs ls PATH hadoop@ubuntu- VirtualBox:~$ hadoop fs ls /data

Basic Hadoop Commands HDFS shell commands: Ø Data transfering: hadoop fs cp/- mv/- put/- get src dest hadoop@ubuntu- VirtualBox:~$ hadoop fs - put project/cs5344- examples/txt/* /data/input

Compile source code Compile hadoop code: javac -classpath `hadoop classpath` -d destination_dir source_dir/filename.java (single quotation mark is the one above tab key) Generate a jar file: jar -cvf WordCount.jar -C destination_dir. (there is a dot in the end)

Compile WordCount example Browse to source code directory hadoop@ubuntu-virtualbox:~/$ cd project/cs5344-examples/src Create directory to store compiled classes hadoop@ubuntu-virtualbox:~/project/cs5344- examples/src$ mkdir classes Compile WordCount code: hadoop@ubuntu-virtualbox:~/project/cs5344- examples/src$ javac -classpath `hadoop classpath` -d classes./wordcount.java (single quotation mark is the one above tab key) Generate a jar file: hadoop@ubuntu-virtualbox:~/project/cs5344- examples/src$ jar -cvf WordCount.jar -C classes/. (there is a dot in the end)

Basic Hadoop Commands Launch job commands: hadoop jar PATH_TO_JAR_FILE classname parameters E.g., launch the above compiled WordCount hadoop@ubuntu- VirtualBox:~/project/CS5344- examples/src$ hadoop jar WordCount.jar myhadoop.wordcount /data/input /data/output Display the job results: hadoop@ubuntu- VirtualBox:~$ hadoop fs ls /data/output hadoop fs - cat /data/output/part- r- 00000 Ø Note: if you run a job multiple times, need to delete the output folder every time before you launch the job hadoop@ubuntu- VirtualBox:~$ hadoop fs rmr /data/output

Customize number of reducers Edit WordCount.java in the Text Editor job.setnumreducetasks(1) à job.setnumreducetasks(4) Compile, generate jar, launch the job again, and display the results (there will be 4 output files)

Adding combiner to map The combiner is already included in the WordCount.java example job.setcombinerclass(intsumreducer.class); This combiner uses the same class as reducer, because sum is associative aggregation. The job result would be the same as without using a combiner. We cannot tell the difference in performance (running time) because the input data / shuffled data are not large enough.

Map-only jobs Edit WordCount.java in the Text Editor job.setnumreducetasks(1) à job.setnumreducetasks(0) Compile, generate jar, launch the job again, and display the results (there will be 3 output files which correspond to 3 maps run on 3 input files, i.e., the results are not aggregated by any reducer)

Work with different input format Write 2 MR jobs WordCountSO is the wordcount but write results in SequenceFile output format (see the sample code in WordCountSO.java: job.setoutputformatclass(sequencefileoutputformat.class);) WordCountSort takes the above output (SequenceFile) as input, and this job performs a sorting on the frequency of the words (see sample code in WordCountSort.java: job.setinputformatclass(sequencefileinputformat.class);)

Work with different input format Compile WordCountSO.java and WordCountSort.java Generate the WordCount.jar file Execute the 2 MR jobs sequentially: hadoop jar WordCount.jar myhadoop.wordcountso /data/input /data/output hadoop jar WordCount.jar myhadoop.wordcountsort /data/output/part-r-00000 /data/sortoutput Display final results: hadoop fs -cat /data/sortoutput/part-r-00000