ETH Zurich Department of Computer Science Networked Information Systems - Spring Tutorial #1: Hadoop and MapReduce.

Transcription

1 ETH Zurich Department of Computer Science Networked Information Systems - Spring 2008 Tutorial #1: Hadoop and MapReduce March 17, Introduction Hadoop 1 is an open-source Java-based software platform developed by the Apache Software Foundation [1]. It lets one easily write and run distributed applications on large computer clusters to process vast amounts of data (see Figure 1). Hadoop implements Google s MapReduce programming model [7, 9, 10] on top of a distributed file system called the Hadoop Distributed File System (HDFS) [8]. MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. HDFS has a master/slave architecture (see Figure 2). An HDFS cluster consists of a single NameNode a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode [8]. 1 Hadoop was named after its creator Doug Cutting s child s stuffed elephant. Figure 1: Hadoop Architecture 1

2 Figure 2: HDFS Architecture MapReduce is developed within Google as a mechanism for processing large amounts of raw data, such as crawled documents or web request log. This data is so large that it must be distributed across thousands of machines in order to be processed in a reasonable time. This distribution implies parallel computing, since the same computations are performed on each CPU, but with a different dataset. MapReduce is an abstraction that allows Google engineers to perform simple computations, while hiding the details of parallelization, data distribution, load balancing, and fault tolerance. Map, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I, and passes them to the reduce function. Reduce, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker, which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client [3]. 2 MapReduce WordCount Example Let s see a simple MapReduce example. The goal of this example is to count the number of distinct words in a given text. In order to implement this example, Mapper, Reducer, and Executer classes are needed. map is the only required function to overload in a Mapper class, and reduce is the only required function to overload in a Reducer class. Beside these, there are a few other useful functions which are inherited from MapReduceBase. For complete details, see MapReduceBase on the Hadoop API page. The source code of the WordCount can be found at src/examples/org/apache/hadoop/examples/wordcount.java directory of Hadoop. After background information, let s concentrate on the word count example again. For example, suppose we are given this five-line high school football coach quote as our input dataset: 2

3 We are not what we want to be, but at least we are not what we used to be. Running the Map code for each word, outputs a pair <word, 1>, yielding the set of pairs: <we, 1> <are, 1> <not, 1> <what, 1> <we, 1> <want, 1> <to, 1> <be, 1> <but, 1>... For now we can think of the <key, value> pairs as a nice linear list, but in reality, the Hadoop process runs in parallel on many machines. Each process has a little part of the overall Map input (called a map shard), and maintains its own local cache of the Map output. After the Map phase produces the intermediate <key, value> pairs they are efficiently and automatically grouped by key by the Hadoop System, in preparation for the Reduce phase (this grouping is known as the shuffle phase of a MapReduce). For the above example, that means all the we pairs are grouped together, all the are pairs are grouped together, etc. as below (showing each group as a line): <we, 1> <we, 1> <we, 1> <we, 1> <are, 1> <are, 1> <not, 1> <not, 1> <what, 1> <what, 1> <want, 1> <to, 1> <to, 1> <be, 1> <be 1> <but, 1> <at, 1> <least, 1> <used, 1> The Reducer class contains a reduce function, which is then called once for each key one reduce call for we, one for are, and so on. Each reduce looks at all the values for that key and outputs a summary value for that key in the final output. So, in the above example, the reduce is called once for the we key, and passed the values the Mapper output, 1, 1, 1, and 1 (Note that the values going into reduce are not in any particular order). Suppose reduce computes a summary value string made up of the number of values the mapper output for the given key, then the output of the Reduce phase on the above pairs will produce the pairs shown below. The Reduce phase also sorts the output < key, value > pairs into increasing order by key: <are, 2> <at, 1> 3

4 <be, 2> <but, 1> <least, 1> <not, 2> <to, 2> <we, 4> <what, 2> <want, 1> <used, 1> Like Map, Reduce is also run in parallel on a group of machines. Each machine is assigned a subset of the keys to work on (known as a reduce shard), and outputs its results into a separate file. 3 Let s Try 3.1 Installing Hadoop 1. First, you need to get a Hadoop distribution. Download a recent stable release from the Hadoop webpage [4] and unpack it in your home directory. 2. Once done, in the distribution, edit the file conf/hadoop-env.sh to define JAVA HOME. 3. Try the following command: bin/hadoop This will display the usage documentation for the hadoop script. The following two sections describe the configuration and example run on pseudo distributed cluster. 3.2 Configuring Hadoop (Pseudo Distributed Cluster) Hadoop is designed to run on large computer clusters, but it can also be run on a single-node in a pseudodistributed mode, where each Hadoop daemon runs in a separate Java process. (For more information and/or installation on a fully distributed cluster, look [5] 1. First, use the following conf/hadoop-site.xml: <configuration> <property> <name> fs.default.name </name> <value> localhost:9000 </value> </property> <property> <name> mapred.job.tracker </name> <value> localhost:9001 </value> </property> <property> <name> dfs.replication </name> <value> 1 </value> </property> </configuration> 4

5 2. Setup pass-phraseless ssh. Check that you can ssh to the localhost without a passphrase: ssh localhost If you cannot ssh to localhost without a passphrase, execute the following commands: ssh-keygen -t dsa -P -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys 3. Make the HDFS ready. Format a new distributed-file system with the following command: bin/hadoop namenode -format 4. Start the hadoop daemons: bin/start-all.sh The hadoop daemon log output is written to the HADOOP LOG DIR directory (defaults to HADOOP HOME/logs). Browse the web-interface for the NameNode and the JobTracker, by default they are available at: NameNode - JobTracker When you are done, stop the daemons with: bin/stop-all.sh 3.3 Running MapReduce WordCount Example To run a Hadoop job, simply ssh into any of the JobTracker nodes on the cluster. 1. Copy the input data files into the distributed file system. If the data files are in the localinput/ directory, this is accomplished by executing:./bin/hadoop dfs -put localinput dfsinput The files will then be copied into the dfs into the directory dfsinput. It is important to copy files into a well named directory that is unique. These files can be viewed with the following command:./bin/hadoop dfs -ls dir 5

6 where dir is the name of the directory to be viewed. You can also use the following:./bin/hadoop dfs -lsr dir to recursively view the directories. Note that all relative paths given will be put in the /users/$user/[dir] directory. Make sure that the dfsoutput directory does not already exist, as you will be presented with an error, and your job will not run (This prevents the accidental overwriting of data, but can be overridden). Now that the data is available to all of the worker machines: 2. Execute the job from a local jar file: bin/hadoop jar hadoop-*-examples.jar wordcount dfsinput wordcount-result The job should be run across the worker machines, copying input and intermediate data as needed. Assuming the output of the reduce stage will be left in the wordcount-result directory. 3. Copy the result files to your local machine in the directory localoutput by:./bin/hadoop dfs -get wordcount-result localoutput The same example can be run on a single node by: bin/hadoop jar hadoop-*-examples.jar wordcount localinput localoutput 3.4 Seeing Job Progress When you submit your job to run, a line will be printed saying: Running job: job_12345 where job will correspond to whatever name your job has been given. Further status information will be printed in that terminal as the job progresses. However, it is also possible to monitor a job given its name from any node in the cluster. This is done by the command:./bin/hadoop/ job -status job_12345 Jobs can also be killed if necessary with./bin/hadoop/ job -kill job_

7 3.5 Eclipse Plug-In While you are welcome to develop in other platforms, we recommend using Eclipse with IBM s Eclipse MapReduce plug-in. 1. The plug-in can be found under the conf/ directory of Hadoop. 2. If Eclipse is open, close it before proceeding. 3. Open Eclipse. 4. Place hadoop-eclipse-plug-in.jar directly into the plugins/ directory of Eclipse. 5. To use the MapReduce perspective go to: Window > Open Perspective > Other... > Map/Reduce. 6. To enable the MapReduce servers window go to: Window > Show View > Other... > Map Reduce Tools > Map Reduce Servers. Once you have the plug-in working with Eclipse, you can add a new Hadoop server by: 1. Enabling the MapReduce Server view. 2. Clicking the blue elephant in the top right. 3. In New Hadoop Server Location window, complete the form. 4 Further Information For further information on Hadoop, please make sure to check out the Hadoop core web page [2] and the Hadoop wiki [6]. Happy coding! References [1] Hadoop. [2] Hadoop Core. [3] Hadoop MapReduce Tutorial. tutorial.html. [4] Hadoop Releases. [5] Hadoop Setup. [6] Hadoop Wiki. [7] MapReduce. [8] The Hadoop Distributed File System: Architecture and Design. docs/current/hdfs design.html. [9] D. Jeff and G. Sanjay. MapReduce: Simplified Data Processing on Large Clusters. In Sixth Symposium on Operating System Design and Implementation (OSDI 04), San Fransisco, CA, December [10] D. Jeff and G. Sanjay. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1),