Basic Hadoop Programming Skills

Size: px

Start display at page:

Download "Basic Hadoop Programming Skills"

Maximilian King
10 years ago
Views:

1 Basic Hadoop Programming Skills

2 Basic commands of Ubuntu Open file explorer

3 Basic commands of Ubuntu Open terminal

4 Basic commands of Ubuntu Open new tabs in terminal Typically, one tab for compiling source codes One tab for running Hadoop

5 Basic shell commands (in the terminal) List directory ls Create directory mkdir project Browse into directory cd project Download file wget

6 Basic commands of Ubuntu Extract the downloaded zip files

7 Start/stop Hadoop Re-format the HDFS (all data will be deleted) hadoop namenode -format Start Hadoop start-all.sh See if Hadoop is running jps hadoop dfsadmin -report Stop Hadoop (when you are done) stop-all.sh

8 Hadoop Web Interfaces Browse the followings in web browser HDFS status: Hadoop job status:

9 Basic Hadoop Commands HDFS shell commands: Ø Create/remove folder: hadoop fs mkdir/- rmr FOLDER_NAME VirtualBox:~$ hadoop fs - mkdir /data hadoop@ubuntu- VirtualBox:~$ hadoop fs - mkdir /data/input Ø List folder: hadoop fs ls PATH hadoop@ubuntu- VirtualBox:~$ hadoop fs ls /data

10 Basic Hadoop Commands HDFS shell commands: Ø Data transfering: hadoop fs cp/- mv/- put/- get src dest VirtualBox:~$ hadoop fs - put project/cs5344- examples/txt/* /data/input

11 Compile source code Compile hadoop code: javac -classpath `hadoop classpath` -d destination_dir source_dir/filename.java (single quotation mark is the one above tab key) Generate a jar file: jar -cvf WordCount.jar -C destination_dir. (there is a dot in the end)

12 Compile WordCount example Browse to source code directory cd project/cs5344-examples/src Create directory to store compiled classes examples/src$ mkdir classes Compile WordCount code: examples/src$ javac -classpath `hadoop classpath` -d classes./wordcount.java (single quotation mark is the one above tab key) Generate a jar file: hadoop@ubuntu-virtualbox:~/project/cs5344- examples/src$ jar -cvf WordCount.jar -C classes/. (there is a dot in the end)

13 Basic Hadoop Commands Launch job commands: hadoop jar PATH_TO_JAR_FILE classname parameters E.g., launch the above compiled WordCount VirtualBox:~/project/CS5344- examples/src$ hadoop jar WordCount.jar myhadoop.wordcount /data/input /data/output Display the job results: VirtualBox:~$ hadoop fs ls /data/output hadoop fs - cat /data/output/part- r Ø Note: if you run a job multiple times, need to delete the output folder every time before you launch the job hadoop@ubuntu- VirtualBox:~$ hadoop fs rmr /data/output

14 Customize number of reducers Edit WordCount.java in the Text Editor job.setnumreducetasks(1) à job.setnumreducetasks(4) Compile, generate jar, launch the job again, and display the results (there will be 4 output files)

15 Adding combiner to map The combiner is already included in the WordCount.java example job.setcombinerclass(intsumreducer.class); This combiner uses the same class as reducer, because sum is associative aggregation. The job result would be the same as without using a combiner. We cannot tell the difference in performance (running time) because the input data / shuffled data are not large enough.

16 Map-only jobs Edit WordCount.java in the Text Editor job.setnumreducetasks(1) à job.setnumreducetasks(0) Compile, generate jar, launch the job again, and display the results (there will be 3 output files which correspond to 3 maps run on 3 input files, i.e., the results are not aggregated by any reducer)

17 Work with different input format Write 2 MR jobs WordCountSO is the wordcount but write results in SequenceFile output format (see the sample code in WordCountSO.java: job.setoutputformatclass(sequencefileoutputformat.class);) WordCountSort takes the above output (SequenceFile) as input, and this job performs a sorting on the frequency of the words (see sample code in WordCountSort.java: job.setinputformatclass(sequencefileinputformat.class);)

18 Work with different input format Compile WordCountSO.java and WordCountSort.java Generate the WordCount.jar file Execute the 2 MR jobs sequentially: hadoop jar WordCount.jar myhadoop.wordcountso /data/input /data/output hadoop jar WordCount.jar myhadoop.wordcountsort /data/output/part-r /data/sortoutput Display final results: hadoop fs -cat /data/sortoutput/part-r-00000

Hadoop (pseudo-distributed) installation and configuration

Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under