Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop- 1.0.4. In this report we will present various examples conducted on hadoop. After installation is complete any of the mentioned below example can be run on hadoop as a check for proper installation. The examples explained in this report are as mentioned below 1. wordcount: listing the words that occur is given file along with their occurrence frequency [1] 2. pi: calculating the value of pi [2] 3. pagerank: 4. inverted indexing: 5. indexing wikipedia: In this section we will index the entire English wikipedia 2 Wordcount Wordcount example is counting and sorting words in a given single file or group of files. Files of various size were used for this example. 1 st set of experiment was conducted using single files and 2 nd set of experiment was conducted using group of files. For 1 st set of experiments 5 files were used whose details along with time required for execution of wordcount is given in table 1. For 2 nd set of experiment combination of files from 1 st set were used whose details can be found in table 2 The figures given below are for line 3 of table 2 with 3 files in gutenberg directory in /tmp. Figure 1 shows command given in Listing 1 executed on my machine. It is assumed that the files are located in /tmp directory under appropriate name (in my case the directory name is /tmp/gutenberg). 1 $ bin / hadoop d f s copyfromlocal /tmp/ gutenberg / user / hduser / gutenberg 2 $ bin / hadoop d f s l s / u s e r / hduser / gutenberg Listing 1: Copying files from user machine to hadoop s file system 1
1 st set of experiments file name size cpu time required (ms) pg20417.txt 674.6 KB 3380 pg2243.txt 137.3 KB 2270 pg28885.txt 177.4 KB 2520 pg4300.txt 1.6 MB 4090 pg5000.txt 1.4 MB 3700 Table 1: Time required to count words in single files 2 nd set of experiments file names total size cpu time required (ms) pg4300.txt, pg5000.txt 3.0 MB 6860 pg4300.txt, pg5000.txt, pg20417.txt 3.7 MB 9580 pg2243.txt, pg5000.txt, pg20417.txt, pg28885.txt 2.4 MB 9090 pg2243.txt, pg4300.txt, pg5000.txt, pg20417.txt, pg28885.txt 4.0 MB 11410 Table 2: Time required to count words in multiple files Line 1 in listing 1 is copying files from /tmp/gutenberg in local machine to hadoop s file system in directory /user/hduser/gutenberg. Line 2 in Listing 1 is listing/checking the files just copied in /user/hduser/gutenberg Figure 1: copy files to dfs The command to run wordcount is given in listing 2 and the command executed on my machine is given in listing 3. Files from /user/hduser/gutenberg are used and it s output is stored in /user/hduser/gutenberg-output 1 $ bin / hadoop j a r hadoop examples. j a r wordcount / u s e r / hduser / gutenberg / user / hduser / gutenberg outout Listing 2: Copying files from user machine to hadoop s file system 1 hduser@ada desktop : / u s r / l o c a l / hadoop$ bin / hadoop j a r hadoop examples. j a r wordcount / user / hduser / gutenberg / user / hduser / gutenberg output 2 Warning : $HADOOP HOME i s deprecated. 3 4 13/07/29 1 4 : 2 0 : 5 7 INFO input. FileInputFormat : Total input paths to p r o c e s s : 3 5 13/07/29 1 4 : 2 0 : 5 7 INFO u t i l. NativeCodeLoader : Loaded the n a t i v e hadoop l i b r a r y 6 13/07/29 1 4 : 2 0 : 5 7 WARN snappy. LoadSnappy : Snappy n a t i v e l i b r a r y not loaded 2
7 13/07/29 1 4 : 2 0 : 5 7 INFO mapred. J o b C l i e n t : Running job : j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1 8 13/07/29 1 4 : 2 0 : 5 8 INFO mapred. JobClient : map 0% reduce 0% 9 13/07/29 1 4 : 2 1 : 1 3 INFO mapred. JobClient : map 66% reduce 0% 10 13/07/29 1 4 : 2 1 : 1 9 INFO mapred. JobClient : map 100% reduce 0% 11 13/07/29 1 4 : 2 1 : 2 2 INFO mapred. JobClient : map 100% reduce 22% 12 13/07/29 1 4 : 2 1 : 3 1 INFO mapred. JobClient : map 100% reduce 100% 13 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Job complete : j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1 14 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Counters : 29 15 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Job Counters 16 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Launched reduce t a s k s=1 17 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : SLOTS MILLIS MAPS=20523 18 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Total time s p ent by a l l r e d u c e s w a i t i n g a f t e r r e s e r v i n g s l o t s (ms)=0 19 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Total time s p ent by a l l maps w a i t i n g a f t e r r e s e r v i n g s l o t s (ms)=0 20 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Launched map t a s k s=3 21 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Data l o c a l map t a s k s=3 22 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : SLOTS MILLIS REDUCES=16245 23 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : F i l e Output Format Counters 24 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Bytes Written =880838 25 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : FileSystemCounters 26 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : FILE BYTES READ=2214875 27 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : HDFS BYTES READ=3671884 28 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : FILE BYTES WRITTEN=3775583 29 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : HDFS BYTES WRITTEN=880838 30 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : F i l e Input Format Counters 31 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Bytes Read=3671523 32 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Map Reduce Framework 33 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map output m a t e r i a l i z e d bytes =1474367 34 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map i n p u t r e c o r d s =77931 35 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce s h u f f l e b y t e s =1207341 36 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : S p i l l e d Records =255966 37 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Map output bytes =6076101 38 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Total committed heap usage ( b y t e s ) =586285056 39 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : CPU time spent (ms) =9580 40 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Combine i n p u t r e c o r d s =629172 41 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : SPLIT RAW BYTES=361 42 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce i n p u t r e c o r d s =102324 43 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. JobClient : Reduce input groups =82335 44 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Combine output r e c o r d s =102324 45 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : P h y s i c a l memory ( b y t e s ) snapshot =625811456 46 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce output r e c o r d s =82335 47 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : V i r t u a l memory ( b y t e s ) snapshot =1897635840 48 13/07/29 1 4 : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map output r e c o r d s =629172 49 hduser@ada desktop : / u s r / l o c a l / hadoop$ Listing 3: wordcount executed on /user/hduser/gutenberg In case the system is not able to detect the jar file the following error message is received 1 Exception in thread main java. i o. IOException : Error opening job j a r : hadoop examples. j a r at org. apache. hadoop. u t i l. RunJar. main ( RunJar. java : 90) 2 Caused by : j a v a. u t i l. z i p. ZipException : e r r o r i n opening z i p f i l e In such cases use complete name of jar file (instead of hadoop*examples*.jar use hadoop-examples-1.0.3.jar) and run the command again 3
As mentioned the output is stored in /user/hduser/gutenberg-output, to check if file exist run the command given in line 2 of listing 1 and in command replace gutenberg with gutenberg-output. Figure 2 shows the file present in my system. Figure 2: checking the files produced by wordcount Figure 3 shows the retrieved output which can be checked by importing the results back to local system. notice -getmerge in line 2 of listing 4, it merges everything present in gutenberg-output folder. 1 $ mkdir /tmp/ gutenberg output 2 $ bin / hadoop d f s getmerge / user / hduser / gutenberg output /tmp/ gutenberg output 3 $ head /tmp/ gutenberg output / gutenberg output Listing 4: Checking wordcount results after importing results to local system Figure 3: Checking wordcount results Results can be retrieved without importing the results also, just use the command given in listing!5 1 $ bin / hadoop d f s cat / user / hduser / gutenberg output / part r 00000 Listing 5: Checking wordcount results without importing results to local system 4
3 Value of PI Hadoop can be used to calculate value of PI. value of pi is 3.14159 1. Value of pi is calculated using quasi-monte Carlo method in this example. Value of pi can be estimated using command in listing 6. We define two values after pi first value is of x the number of maps and second value is y the number of samples per map. Result of some experiments conducted is given in table 3 1 $ bin / hadoop j a r hadoop examples. j a r p i 10 100 Listing 6: command to calculate value of pi x y Time required (secs) Value calculated 10 100 60.53 3.148 10 200 53.53 3.144 10 400 55.58 3.14 10 1000000 54.45 3.1415844 50 100 178.82 3.1418 Table 3: Time required to calculate value of PI for different x and y References [1] Michael G. Noll. Running hadoop on ubuntu linux (single-node cluster) - michael g. noll. http://www.michael-noll.com/tutorials/running-hadoopon-ubuntu-linux-single-node-cluster/. [2] Cloud 9. Cloud9: A mapreduce library for hadoop >> getting started in standalone mode. http://lintool.github.io/cloud9/docs/content/startstandalone.html. 1 http://en.wikipedia.org/wiki/pi 5