Using BAC Hadoop Cluster Bodhisatta Barman Roy January 16, 2015 1
Contents 1 Introduction 3 2 Daemon locations 4 3 Pre-requisites 5 4 Setting up 6 4.1 Using a Linux Virtual Machine................... 6 4.2 Ensuring access to cluster...................... 6 4.3 Setting up the necessary software/tools on your laptop...... 7 5 Writing your first Map/Reduce program: WordCount 8 5.1 Fiddling with HDFS......................... 8 5.2 Writing a Map/Reduce program: WordCount........... 8 6 Other links 10 2
1 Introduction The BAC cluster has been setup with Apache Hadoop 1.2.1. Please contact Prof. Tan Kian-Lee to seek approval to use MRv2. This cluster has 7 nodes, bacn[0-6].comp.nus.edu.sg. bacn[0-5] are running CentOS 6.5, while bacn7 is running CentOS7. Each node in the cluster has 40 GB of RAM and 2 units of 6-core 2.1 GHz Intel Xeon E5-2620, along with 4x1TB 1 7200 RPM SATA drives. The total usable space for Hadoop File System (HDFS) is roughly about 24 TB 2. To give you an idea about how much data 24 TB can hold, the string caffeine uses 8 bytes (or 8B) - 1 byte for each character. The number of times the word caffeine can be stored in the BAC HDFS is 3 billion times 3. The Oxford English dictionary has about 350 million printed words 4. 1 TB=Terabyte 2 http://bacn0.comp.nus.edu.sg:50070 3 24T B 8B = 24 109 B =3 10 8B 9. 4 http://public.oed.com/history-of-the-oed/dictionary-facts/ 3
2 Daemon locations NN, SNN, JT: bacn0.comp.nus.edu.sg DN,TT: bacn[0-6].comp.nus.edu.sg A UNIX user hadoop is running all necessary daemons to run Map/Reduce programs on the cluster. 4
3 Pre-requisites The cluster has been tested with the following: 1. Eclipse Kepler: Eclipse IDE for Java Developers (http://eclipse.org/ downloads/packages/release/kepler/sr2) 2. Apache Hadoop 1.2.1 (https://archive.apache.org/dist/hadoop/core/ hadoop-1.2.1/). You may download the archive name hadoop-1.2.1.tar.gz. 3. Eclipse HDFS plugin for MapReduce (https://spideroak.com/share/ OB2WE3DJMNPXG5DVMZTA/so_stuff/home/jacob/SpiderOak%20Hive/public_ stuff/hadoop-eclipse-plugin-1.2.1.jar) 4. Mac OSX and a Ubuntu 14.04 x86-64. 5. SoC UNIXID. If you have a SoC UNIXID, you should have received a corresponding email address like SoC UNIXID @comp.nus.edu.sg. 6. Drop an email to Bodhi bodhi@comp.nus.edu.sg with your UNIXID requesting access to HDFS. 5
4 Setting up This section explains how to set up the software for writing MRv1 programs. NOTE: Please ensure that the user is within the SoC wireless/wired network. This means that the user can be physically in: Computing 1 Computing 2 or ICube lvl 3. Otherwise, the user has to connect to the SoC WebVPN network. The documentation for setting up VPN is available at https://docs.comp.nus. edu.sg/node/5065 and is outside the scope of this tutorial. 4.1 Using a Linux Virtual Machine The Hadoop cluster works with the following assumption that you re using your SoC UNIX ID to access the cluster. Therefore, to ensure that, it is advisable to use a Linux virtual machine. Please take care to ensure that the userid is the same as your SoC UNIX ID. You can use either VMWare Player or VirtualBox to setup a Linux virtual machine. You can choose any Linux distro of your choice. The following links provide instructions for installing Ubuntu, a popular Linux distribution: http://www.wikihow.com/install-ubuntu-on-virtualbox http://wiki.opencog.org/w/setting_up_ubuntu_in_vmware_for_noobs Please send your UNIX ID to Bodhi<bodhi@comp.nus.edu.sg> so that you can access the cluster. 4.2 Ensuring access to cluster 1. Mac and *nix users can open Terminal or similar applications and type ssh <UNIXID>@bacn0.comp.nus.edu.sg. Windows users can download SSH clients like Putty from www.putty. org 5. In the Host Name text-field, type bacn0.comp.nus.edu.sg and click on Open. 2. Accept the SSH key fingerprint by typing yes or clicking on Yes in the respective cases. 3. For Mac/*nix users, type in your <UNIXID> password on prompt and hit Enter. For Windows users, on the login as: prompt, type your UNIXID and hit Enter. The password is your <UNIXID> password. 5 Something to lose sleep over: http://bit.ly/1xuaaq5 6
On successful login, you should be able to see a prompt like [<UNIXID>@bacn0 ]$. You can logout by closing the respective applications. 4.3 Setting up the necessary software/tools on your laptop 1. Extract the Eclipse and Apache Hadoop archives onto your laptop. Note: If you are on Windows, you might consider downloading 7-Zip, a free utility to extract.tar.gz archives. 2. Copy the Eclipse plugin into the <Eclipse-directory>/plugins directory. 3. Start Eclipse IDE by executing <Eclipse-directory>/eclipse executable. 4. Click on Window Open Perspective Other 5. Click on Map/Reduce and then press OK. 6. On the bottom half of the screen, click on Map/Reduce Locations. 7. Click on the Blue Elephant icon on the bottom right (on hover, it says New Hadoop Location). 8. Put in the following entries: (a) Location Name: Any name of your choice, e.g. BAC Cluster. (b) Map/Reduce Master host: bacn0.comp.nus.edu.sg, port: 9001. This is the JobTracker URL. (c) DFS Master: bacn0.comp.nus.edu.sg, port: 9000. This is the HDFS URL. (d) User name: Your SoC UNIXID. (e) Click on Finish. You should be able to see the directories in the HDFS on expanding the DFS Locations expand-on-click menu. 7
5 Writing your first Map/Reduce program: Word- Count This section describes on how to write your very first Map/Reduce program on the cluster. The objective is to count the number of occurences of each word in a file. We shall download a file from the internet, upload it to HDFS, write the WordCount program and run it on the cluster. The file in concern is Shakespeare s Hamlet, available at http://www.gutenberg.org/ebooks/2265. txt.utf-8. Please note that the file is in plain-text. Marked-up formats like epub, Mobi etc. are out of scope for this tutorial. 5.1 Fiddling with HDFS 1. Please download Shakespeare s Hamlet from the aforementioned link. Save it as hamlet.txt. 2. From Eclipse, expand the DFS location into your user directory. It should be inside /user/ UNIXID. 3. Right-click on the directory, and select Create new directory, provide the name input and click Ok. 4. Right-click on the directory, and select Refresh. You should be able to see the directory input. 5. Right-click on input and select Upload files to DFS... Select hamlet.txt by navigating to the directory where it has been saved. Right-click on input and select Refresh. On expanding input, the file hamlet.txt can be seen. 5.2 Writing a Map/Reduce program: WordCount 1. In Eclipse, create a new Map/Reduce project (from the menu bar on top) by clicking File New Other and selecting Map/Reduce project and then click on Next. 2. Type WordCount as the name of the project and then click on Finish. The project WordCount can be seen in the Project Explorer on the left. On expanding WordCount, it can be observed that all the necessary drivers (i.e. JARs) have been added automatically to the project necessary to write any Map/Reduce program. 3. Select the directory named src and right-click it, then press New Class. Type WordCount in the text-field marked name and click on Finish. 4. Remove the contents of the file WordCount.java. 8
5. Copy the source code from http://pastie.org/9799572 and paste inside WordCount.java. 6. From the menu bar on top, click on Run and then Run Configurations... 7. Click the tab named Arguments, and in the text-field named Program Arguments type hdfs://bacn0.comp.nus.edu.sg:9000/user/<unixid> /input/hamlet.txthdfs://bacn0.comp.nus.edu.sg:9000/user/<unixid> /output. Please note that the two arguments are separated by a space character: the first indicating the location of the input file, the second specifying the output directory where the results of the program will be stored. Click on Apply and then Close. 8. In the Project Explorer, right-click on WordCount.java and go to Run as... and select 2. Run on Hadoop. The program should now start compiling. It can be seen that a Red square in the Console tab is active. In a few moments, one can observe some text in the Console. What happens in the background is this: (a) All the JAR files are compiled into one single JAR called WordCount.jar, is copied into the target cluster (BAC-cluster in our case), and gets executed. (b) The output of the program is stored in HDFS inside /user/ UNIXID /output. 9. Please wait until the Red square in the Console tab becomes inactive (turns grey). 10. Refresh the HDFS by right-clicking on /user/ UNIXID /input. A new directory by the name output can be observed. 11. On exapnding output, two new files can be seen: SUCCESS and part-r-00000. 12. The result of the program can be seen in part-r-00000. In this case, the result is the number of occurrences of every word in the file hamlet.txt. NOTE: If you wish to run this program again, ensure that the output directory doesn t exist in HDFS. In order to ensure the directory doesn t exist, right-click on output Delete and click on Ok. Otherwise, Hadoop will throw an error. 9
6 Other links The following links show the statuses of HDFS and M/R in the BAC cluster: 1. Hadoop M/R Jobtracker state: http://bacn0.comp.nus.edu.sg:50030 2. Hadoop M/R Tasktracker states: http://bacn0.comp.nus.edu.sg:50060 3. HDFS state: http://bacn0.comp.nus.edu.sg:50070 10