Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Size: px

Start display at page:

Download "Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay"

Martin Brian Warner
10 years ago
Views:

1 Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop and hadoop In this report we will present various examples conducted on hadoop. After installation is complete any of the mentioned below example can be run on hadoop as a check for proper installation. The examples explained in this report are as mentioned below 1. wordcount: listing the words that occur is given file along with their occurrence frequency [1] 2. pi: calculating the value of pi [2] 3. pagerank: 4. inverted indexing: 5. indexing wikipedia: In this section we will index the entire English wikipedia 2 Wordcount Wordcount example is counting and sorting words in a given single file or group of files. Files of various size were used for this example. 1 st set of experiment was conducted using single files and 2 nd set of experiment was conducted using group of files. For 1 st set of experiments 5 files were used whose details along with time required for execution of wordcount is given in table 1. For 2 nd set of experiment combination of files from 1 st set were used whose details can be found in table 2 The figures given below are for line 3 of table 2 with 3 files in gutenberg directory in /tmp. Figure 1 shows command given in Listing 1 executed on my machine. It is assumed that the files are located in /tmp directory under appropriate name (in my case the directory name is /tmp/gutenberg). 1 $ bin / hadoop d f s copyfromlocal /tmp/ gutenberg / user / hduser / gutenberg 2 $ bin / hadoop d f s l s / u s e r / hduser / gutenberg Listing 1: Copying files from user machine to hadoop s file system 1

After installation is complete any of the mentioned below example can be run on hadoop as a check for proper installation. The examples explained in this report are as mentioned below 1.

2 1 st set of experiments file name size cpu time required (ms) pg20417.txt KB 3380 pg2243.txt KB 2270 pg28885.txt KB 2520 pg4300.txt 1.6 MB 4090 pg5000.txt 1.4 MB 3700 Table 1: Time required to count words in single files 2 nd set of experiments file names total size cpu time required (ms) pg4300.txt, pg5000.txt 3.0 MB 6860 pg4300.txt, pg5000.txt, pg20417.txt 3.7 MB 9580 pg2243.txt, pg5000.txt, pg20417.txt, pg28885.txt 2.4 MB 9090 pg2243.txt, pg4300.txt, pg5000.txt, pg20417.txt, pg28885.txt 4.0 MB Table 2: Time required to count words in multiple files Line 1 in listing 1 is copying files from /tmp/gutenberg in local machine to hadoop s file system in directory /user/hduser/gutenberg. Line 2 in Listing 1 is listing/checking the files just copied in /user/hduser/gutenberg Figure 1: copy files to dfs The command to run wordcount is given in listing 2 and the command executed on my machine is given in listing 3. Files from /user/hduser/gutenberg are used and it s output is stored in /user/hduser/gutenberg-output 1 $ bin / hadoop j a r hadoop examples. j a r wordcount / u s e r / hduser / gutenberg / user / hduser / gutenberg outout Listing 2: Copying files from user machine to hadoop s file system 1 hduser@ada desktop : / u s r / l o c a l / hadoop$ bin / hadoop j a r hadoop examples. j a r wordcount / user / hduser / gutenberg / user / hduser / gutenberg output 2 Warning : $HADOOP HOME i s deprecated /07/ : 2 0 : 5 7 INFO input. FileInputFormat : Total input paths to p r o c e s s : /07/ : 2 0 : 5 7 INFO u t i l. NativeCodeLoader : Loaded the n a t i v e hadoop l i b r a r y 6 13/07/ : 2 0 : 5 7 WARN snappy. LoadSnappy : Snappy n a t i v e l i b r a r y not loaded 2

txt, pg5000.txt 3.0 MB 6860 pg4300.txt, pg5000.txt, pg20417.txt 3.7 MB 9580 pg2243.txt, pg5000.txt, pg20417.txt, pg28885.txt 2.4 MB 9090 pg2243.txt, pg4300.txt, pg5000.txt, pg20417.txt, pg28885.txt 4.

3 7 13/07/ : 2 0 : 5 7 INFO mapred. J o b C l i e n t : Running job : j o b /07/ : 2 0 : 5 8 INFO mapred. JobClient : map 0% reduce 0% 9 13/07/ : 2 1 : 1 3 INFO mapred. JobClient : map 66% reduce 0% 10 13/07/ : 2 1 : 1 9 INFO mapred. JobClient : map 100% reduce 0% 11 13/07/ : 2 1 : 2 2 INFO mapred. JobClient : map 100% reduce 22% 12 13/07/ : 2 1 : 3 1 INFO mapred. JobClient : map 100% reduce 100% 13 13/07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Job complete : j o b /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Counters : /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Job Counters 16 13/07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Launched reduce t a s k s= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : SLOTS MILLIS MAPS= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Total time s p ent by a l l r e d u c e s w a i t i n g a f t e r r e s e r v i n g s l o t s (ms)= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Total time s p ent by a l l maps w a i t i n g a f t e r r e s e r v i n g s l o t s (ms)= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Launched map t a s k s= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Data l o c a l map t a s k s= /07/ : 2 1 : 3 6 INFO mapred. JobClient : SLOTS MILLIS REDUCES= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : F i l e Output Format Counters 24 13/07/ : 2 1 : 3 6 INFO mapred. JobClient : Bytes Written = /07/ : 2 1 : 3 6 INFO mapred. JobClient : FileSystemCounters 26 13/07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : FILE BYTES READ= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : HDFS BYTES READ= /07/ : 2 1 : 3 6 INFO mapred. JobClient : FILE BYTES WRITTEN= /07/ : 2 1 : 3 6 INFO mapred. JobClient : HDFS BYTES WRITTEN= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : F i l e Input Format Counters 31 13/07/ : 2 1 : 3 6 INFO mapred. JobClient : Bytes Read= /07/ : 2 1 : 3 6 INFO mapred. JobClient : Map Reduce Framework 33 13/07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map output m a t e r i a l i z e d bytes = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map i n p u t r e c o r d s = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce s h u f f l e b y t e s = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : S p i l l e d Records = /07/ : 2 1 : 3 6 INFO mapred. JobClient : Map output bytes = /07/ : 2 1 : 3 6 INFO mapred. JobClient : Total committed heap usage ( b y t e s ) = /07/ : 2 1 : 3 6 INFO mapred. JobClient : CPU time spent (ms) = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Combine i n p u t r e c o r d s = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : SPLIT RAW BYTES= /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce i n p u t r e c o r d s = /07/ : 2 1 : 3 6 INFO mapred. JobClient : Reduce input groups = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Combine output r e c o r d s = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : P h y s i c a l memory ( b y t e s ) snapshot = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Reduce output r e c o r d s = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : V i r t u a l memory ( b y t e s ) snapshot = /07/ : 2 1 : 3 6 INFO mapred. J o b C l i e n t : Map output r e c o r d s = hduser@ada desktop : / u s r / l o c a l / hadoop$ Listing 3: wordcount executed on /user/hduser/gutenberg In case the system is not able to detect the jar file the following error message is received 1 Exception in thread main java. i o. IOException : Error opening job j a r : hadoop examples. j a r at org. apache. hadoop. u t i l. RunJar. main ( RunJar. java : 90) 2 Caused by : j a v a. u t i l. z i p. ZipException : e r r o r i n opening z i p f i l e In such cases use complete name of jar file (instead of hadoop*examples*.jar use hadoop-examples jar) and run the command again 3

JobClient : map 100% reduce 0% 11 13/07/29 1 4 : 2 1 : 2 2 INFO mapred. JobClient : map 100% reduce 22% 12 13/07/29 1 4 : 2 1 : 3 1 INFO mapred.

As mentioned the output is stored in /user/hduser/gutenberg-output, to check if file exist run the command given in line 2 of listing 1 and in command replace gutenberg with gutenberg-output.

4 As mentioned the output is stored in /user/hduser/gutenberg-output, to check if file exist run the command given in line 2 of listing 1 and in command replace gutenberg with gutenberg-output. Figure 2 shows the file present in my system. Figure 2: checking the files produced by wordcount Figure 3 shows the retrieved output which can be checked by importing the results back to local system. notice -getmerge in line 2 of listing 4, it merges everything present in gutenberg-output folder. 1 $ mkdir /tmp/ gutenberg output 2 $ bin / hadoop d f s getmerge / user / hduser / gutenberg output /tmp/ gutenberg output 3 $ head /tmp/ gutenberg output / gutenberg output Listing 4: Checking wordcount results after importing results to local system Figure 3: Checking wordcount results Results can be retrieved without importing the results also, just use the command given in listing!5 1 $ bin / hadoop d f s cat / user / hduser / gutenberg output / part r Listing 5: Checking wordcount results without importing results to local system 4

notice -getmerge in line 2 of listing 4, it merges everything present in gutenberg-output folder.

5 3 Value of PI Hadoop can be used to calculate value of PI. value of pi is Value of pi is calculated using quasi-monte Carlo method in this example. Value of pi can be estimated using command in listing 6. We define two values after pi first value is of x the number of maps and second value is y the number of samples per map. Result of some experiments conducted is given in table 3 1 $ bin / hadoop j a r hadoop examples. j a r p i Listing 6: command to calculate value of pi x y Time required (secs) Value calculated Table 3: Time required to calculate value of PI for different x and y References [1] Michael G. Noll. Running hadoop on ubuntu linux (single-node cluster) - michael g. noll. [2] Cloud 9. Cloud9: A mapreduce library for hadoop >> getting started in standalone mode

Result of some experiments conducted is given in table 3 1 $ bin / hadoop j a r hadoop examples.

CS 455 Spring 2015. Word Count Example

CS 455 Spring 2015. Word Count Example CS 455 Spring 2015 Word Count Example Before starting, make sure that you have HDFS and Yarn running, using sbin/start-dfs.sh and sbin/start-yarn.sh Download text copies of at least 3 books from Project