Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics"

Transcription

1 Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics San Antonio Rd, Suite 210 Mt. View, CA (650)

2 Table of Contents OVERVIEW PREREQUISITES JAVA PASSWORDLESS SSH INSTALLING HADOOP DOWNLOADING APACHE HADOOP SETTING UP ENVIRONMENT VARIABLES HADOOP CONFIGURATION FILES FORMATTING THE NAMENODE STARTING PSEUDO- DISTRIBUTED HADOOP VERIFYING YOUR INSTALL VIEWING THE WEB INTERFACES RUNNING A MAPREDUCE JOB STOPPING HADOOP DEBUGGING MAPREDUCE JOBS IN HADOOP WHAT IS STANDALONE MODE? USING STANDALONE (LOCAL) MODE SETTING UP ECLIPSE S REMOTE DEBUGGER IMPORTING THE PROJECT IN ECLIPSE CREATING A REMOTE DEBUGGER ATTACHING THE MAPREDUCE JOB TO THE REMOTE DEBUGGER Overview One of the first roadblocks many developers face when trying to learn about Hadoop is simply getting an installation of Hadoop working locally that they can use to test. In this post, I will show you how to install Hadoop on a Mac so you can start playing with Hadoop today. I'll also show you some tips and tricks on how to debug and test the distributed MapReduce jobs you create. To note, this setup is for developers only. This is not a setup you would deploy in a production cluster. Prerequisites Java To check if Java installed type java - version in the Terminal. If you do not have Java installed this will prompt the installation. Once/if Java is installed, you should see a response like this:

3 $ java -version java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b M3909) Java HotSpot(TM) 64-Bit Server VM (build b01-434, mixed mode) Passwordless SSH First, make sure that SSH is enabled on your computer. You can turn SSH on by navigating to the System Preferences menu, then to Sharing, then to the Remote Login tab. Figure 1 - Enabling SSH in Mac OS X Now that SSH is enabled, we must set up passwordless SSH. This allows the Hadoop processes to talk to each other without requiring you to type in a password. On a computer where passwordless SSH is set up, you should be able to ssh localhost with no prompting. $ ssh localhost $ If it requires a password, follow the next steps.

4 First generate passwordless pub/private key: ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa Then add it to authorized hosts: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Finally, add it to the Mac OS X keychain: ssh-add -K ~/.ssh/id_rsa Now verify that you can ssh into localhost without a password. If this is your first time, you might have to enter your password once. Subsequent ssh localhost commands should not require a password. Installing Hadoop Downloading Apache Hadoop The first thing you should know about downloading Hadoop is that there are many different versions and distributions available. The Hadoop ecosystem includes many different components, including HDFS, MapReduce, HBase, Hive, Pig, and more. For this reason, it can be very confusing to figure out exactly what you need to download in the first place. Essentially, you have two options: 1. Download the individual components directly from their respective Apache websites. 2. Download a complete distribution from a third- party website. The advantage of downloading a distribution from a third- party vendor like MapR, Cloudera, or Hortonworks is that distribution is a supported and the components within the distribution are guaranteed to work with each other. You will also find that some of these distributions have different feature sets (which may be the topic of a future blog post), and this can be confusing if you're not exactly sure what you want. If you know your requirements and you know you need Hadoop, it is important you do your research and determine which distribution works best for you. For the purpose of vendor neutrality we will go through installing the HDFS and MapReduce components from the Apache website. You will want to find the latest stable release. 1. Navigate to and go to the link 'Download a release now!' 2. Choose any mirror from the list of mirrors.

5 3. You should see a list of folders that contain different versions of Hadoop, you want to choose the folder labeled stable. This folder will contain the latest stable version of Hadoop that you can download. 4. Choose the download in the format hadoop- x.x.x.tar.gz. Figure 2 - Downloading the Stable Apache Hadoop Release As of late January 2013, the latest stable release is labeled hadoop tar.gz. I am going to install Hadoop in a directory where I keep all of my code. ~/code/apache Note that you need to have write permissions in the directory you are going to install Hadoop in. So the first thing I am going to do is move the download to that folder and un- tar it:

6 $ mv hadoop tar.gz ~/code/apache/ $ cd ~/code/apache/ $ tar -xvzf hadoop tar.gz x hadoop-1.0.4/ x hadoop-1.0.4/bin/ x hadoop-1.0.4/c++/ x hadoop-1.0.4/c++/linux-amd64-64/... x hadoop-1.0.4/src/contrib/ec2/bin/launch-hadoop-cluster x hadoop-1.0.4/src/contrib/ec2/bin/launch-hadoop-master x hadoop-1.0.4/src/contrib/ec2/bin/launch-hadoop-slaves x hadoop-1.0.4/src/contrib/ec2/bin/list-hadoop-clusters x hadoop-1.0.4/src/contrib/ec2/bin/terminate-hadoop-cluster Then I create a soft link to it so I can easily reference it in the future: $ ln -s hadoop hadoop Create Name and Data Directories: $ cd ~/code/apache $ mkdir dfs $ mkdir./dfs/name $ mkdir./dfs/data Setting Up Environment Variables I like to create some environment variables in my ~/.bash_profile file so that when I start a new terminal I can easily reference my Hadoop installation. Open or create your ~/.bash_profile file and add the following lines to it. # Generic export CODE_BASE=~/code # Apache export APACHE_INSTALL_BASE=$CODE_BASE/apache # Install Base export INSTALL_BASE=$APACHE_INSTALL_BASE # Hadoop export HADOOP_INSTALL=$INSTALL_BASE/hadoop export HADOOP_PREFIX=$HADOOP_INSTALL export PATH=$PATH:$HADOOP_INSTALL/bin # Java export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJ DK/Home

7 This additionally allows me to be flexible when I have multiple versions of Hadoop setup only computer. I can simply change a few variables to point to different installations of Hadoop. Hadoop Configuration Files Hadoop comes with configuration files you can use to change how Hadoop runs. With these configuration files, you can run Hadoop in either Standalone Operation, Pseudo- Distribution Operation, or Fully Distributed Operation. In Standalone, Hadoop runs as a single process and uses the local file system to store files. In Pseudo- Distributed, Hadoop runs as several processes and uses HDFS to store files on a single node. In Fully Distributed, Hadoop runs as several processes and uses HDFS to store files across many nodes. We will set up Hadoop using Pseudo- Distributed operation. The files I am describing are located in the hadoop/conf directory. We will work on four of these files, core- site.xml, hadoop- env.sh, hdfs- site.xml, and mapred- site.xml. $ cd hadoop/conf Figure 3 - Hadoop Configuration Files We are going to create a set of local and pseudo distributed configuration files so we can switch between local and pseudo distributed with a simple script.

8 First let s rename the default configuration files. $ mv hdfs-site.xml hdfs-site-local.xml $ mv core-site.xml core-site-local.xml $ mv mapred-site.xml mapred-site-local.xml Now let s edit these files. Edit core- site- local.xml so that it contains the following: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>file:///</value> </property> </configuration> Verify hdfs- site- local.xml contains the following: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> </configuration> Edit mapred- site- local.xml so that it contains the following: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>local</value> </property> </configuration> Create a file called hdfs- site- pseudo.xml and write the following as its contents (make sure to replace /Users/ryantabora/code/apache with wherever you created the dfs/data and name directories): <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

9 <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/users/ryantabora/code/apache/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/users/ryantabora/code/apache/dfs/data</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> Create a file called mapred- site- pseudo.xml and write the following as its contents: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>mapred.child.java.opts</name> <value>-xmx1536m</value> </property> </configuration> Create a file called core- site- pseudo.xml and write the following as its contents: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> </property> </configuration> Now rename the hadoop- env.sh script hadoop- env- pseudo.sh:

10 $ mv hadoop-env.sh hadoop-env-pseudo.sh Add the following lines to the hadoop- env- pseudo.sh script (you can put them anywhere, but I put them as the first two lines in the script). export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJ DK/Home export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK - Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" Create a hadoop- env- local.sh script that contains only the following: export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJ DK/Home export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK - Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" export HADOOP_OPTS="$HADOOP_OPTS - agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000" export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS" export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS" export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS" export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS" export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS" Next we will create a script that allows us to change between local and distributed mode. Create a file called configure.sh and write the following as its contents: #!/bin/bash ARG=$1 if [ "$ARG" == 'local' ]; then rm -f mapred-site.xml rm -f core-site.xml rm -f hdfs-site.xml rm -f hadoop-env.sh ln -s mapred-site-local.xml mapred-site.xml ln -s core-site-local.xml core-site.xml ln -s hdfs-site-local.xml hdfs-site.xml ln -s hadoop-env-local.sh hadoop-env.sh chmod 755 hadoop-env.sh exit 1; elif [ "$ARG" == 'pseudo' ]; then rm -f./mapred-site.xml rm -f./core-site.xml

11 else fi rm -f./hdfs-site.xml rm -f hadoop-env.sh ln -s mapred-site-pseudo.xml mapred-site.xml ln -s core-site-pseudo.xml core-site.xml ln -s hdfs-site-pseudo.xml hdfs-site.xml ln -s hadoop-env-pseudo.sh hadoop-env.sh chmod 755 hadoop-env.sh exit 1; echo "Enter either pseudo or local" exit 1; Now make sure that script is executable $ chmod 755 configure.sh To set the configuration files to local mode, run $./configure.sh local To set the configuration files to pseudo- distributed mode, run $./configure.sh pseudo To start, lets set them to Pseudo Distributed mode. $./configure.sh pseudo Ryan-Taboras-MacBook-Air:conf ryantabora$ ls -l total ryantabora staff 7457 Oct 2 22:17 capacityscheduler.xml 1 ryantabora staff 535 Oct 2 22:17 configuration.xsl -rwxr-xr-x 1 ryantabora staff 726 Jan 22 09:49 configure.sh 1 ryantabora staff 259 Jan 22 08:29 core-site-local.xml -rw-r--r-- 1 ryantabora staff 273 Jan 21 11:16 core-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 20 Jan 22 09:56 core-site.xml -> core-site-pseudo.xml 1 ryantabora staff 327 Oct 2 22:17 fair-scheduler.xml -rw-r--r-- 1 ryantabora staff 2537 Jan 22 09:43 hadoop-env-backup.sh -rwxr-xr-x 1 ryantabora staff 735 Jan 22 09:42 hadoop-env-local.sh -rwxr-xr-x 1 ryantabora staff 634 Jan 22 09:43 hadoop-env-pseudo.sh lrwxr-xr-x 1 ryantabora staff 20 Jan 22 09:56 hadoop-env.sh -> hadoop-env-pseudo.sh 1 ryantabora staff 1488 Oct 2 22:17 hadoopmetrics2.properties 1 ryantabora staff 4644 Oct 2 22:17 hadoop-policy.xml 1 ryantabora staff 178 Oct 2 22:17 hdfs-site-local.xml -rw-r--r-- 1 ryantabora staff 549 Jan 21 11:16 hdfs-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 20 Jan 22 09:56 hdfs-site.xml -> hdfs-site-pseudo.xml 1 ryantabora staff 4441 Oct 2 22:17 log4j.properties 1 ryantabora staff 2033 Oct 2 22:17 mapred-queueacls.xml

12 1 ryantabora staff 259 Jan 22 08:59 mapred-sitelocal.xml -rw-r--r-- 1 ryantabora staff 357 Jan 21 11:16 mapred-sitepseudo.xml lrwxr-xr-x 1 ryantabora staff 22 Jan 22 09:56 mapred-site.xml -> mapred-site-pseudo.xml 1 ryantabora staff 10 Oct 2 22:17 masters 1 ryantabora staff 10 Oct 2 22:17 slaves 1 ryantabora staff 1243 Oct 2 22:17 sslclient.xml.example 1 ryantabora staff 1195 Oct 2 22:17 sslserver.xml.example 1 ryantabora staff 382 Oct 2 22:17 taskcontroller.cfg Formatting the NameNode Now open a new terminal window (so that the variables set by ~/.bash_profile are updated) and run the following command to format the namenode. You should only need to run this command once. However, if you ever need to do this again you should know that you will lose all of the data you currently have in HDFS. $ hadoop namenode -format 13/01/21 10:25:46 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = STARTUP_MSG: args = [-format] STARTUP_MSG: version = STARTUP_MSG: build = -r ; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 ************************************************************/ Re-format filesystem in /Users/ryantabora/code/apache/dfs/name? (Y or N) Y 13/01/21 10:25:49 INFO util.gset: VM type = 64-bit 13/01/21 10:25:49 INFO util.gset: 2% max memory = MB 13/01/21 10:25:49 INFO util.gset: capacity = 2^21 = entries 13/01/21 10:25:49 INFO util.gset: recommended= , actual= /01/21 10:25:49 INFO namenode.fsnamesystem: fsowner= 13/01/21 10:25:49 INFO namenode.fsnamesystem: supergroup=supergroup 13/01/21 10:25:49 INFO namenode.fsnamesystem: ispermissionenabled=false 13/01/21 10:25:49 INFO namenode.fsnamesystem: dfs.block.invalidate.limit=100 13/01/21 10:25:49 INFO namenode.fsnamesystem: isaccesstokenenabled=false accesskeyupdateinterval=0 min(s), accesstokenlifetime=0 min(s) 13/01/21 10:25:49 INFO namenode.namenode: Caching file names occuring more than 10 times 13/01/21 10:25:50 INFO common.storage: Image file of size 116 saved in 0 seconds. 13/01/21 10:25:50 INFO common.storage: Storage directory /Users/ryantabora/code/apache/dfs/name has been successfully formatted. 13/01/21 10:25:50 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************

13 SHUTDOWN_MSG: Shutting down NameNode at ************************************************************/ Starting Pseudo- Distributed Hadoop You can use the start- all script to start all of the HDFS and MapReduce processes at once. $ ~/code/apache/hadoop/bin/start-all.sh starting namenode, logging to /Users/ryantabora/code/apache/hadoop /libexec/../logs/ localhost: starting datanode, logging to /Users/ ryantabora /code/apache/hadoop-1.0.4/libexec/../logs localhost: starting secondarynamenode, logging to /Users/ ryantabora /code/apache/hadoop-1.0.4/libexec/../logs/ starting jobtracker, logging to /Users/ ryantabora /code/apache/hadoop /libexec/../logs/ localhost: starting tasktracker, logging to /Users/ ryantabora /code/apache/hadoop-1.0.4/libexec/../logs/ Verifying Your Install First run JPS to verify that all 5 services are running. $ jps NameNode SecondaryNameNode 2592 Jps TaskTracker DataNode JobTracker List out the root directory in HDFS. $ hadoop fs -ls / Found 1 items drwxr-xr-x - ryantabora supergroup :11 /tmp Make a directory called test in HDFS. $ hadoop fs -mkdir /test Make sure the test directory exists. $ hadoop fs -ls / Found 2 items drwxr-xr-x - ryantabora supergroup :18 /test drwxr-xr-x - ryantabora supergroup :11 /tmp Viewing the Web Interfaces NameNode You can learn a little bit about the state of HDFS by visiting the NameNode UI at the following URL:

14 Figure 4 - NameNode UI JobTracker You can monitor the MapReduce jobs you are running by visiting the JobTracker UI at the following link:

15 Figure 5 - JobTracker UI TaskTracker Finally, you can monitor the TaskTrackers by visiting the TaskTracker UI at the following link:

16 Figure 6 - TaskTracker UI Running a MapReduce job Now let s run an example MapReduce job to verify that it works. The following is a provided MapReduce job included in the Hadoop download. It uses MapReduce to estimate the value of pi. Check out the JobTracker UI while it is running or monitor the Terminal to view the job s progress. $ hadoop jar ~/code/apache/hadoop/hadoop-examples jar pi Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 13/01/21 09:19:01 INFO mapred.fileinputformat: Total input paths to process : 10 13/01/21 09:19:02 INFO mapred.jobclient: Running job: job_ _ /01/21 09:19:03 INFO mapred.jobclient: map 0% reduce 0% 13/01/21 09:19:16 INFO mapred.jobclient: map 20% reduce 0%

17 13/01/21 09:19:25 INFO mapred.jobclient: map 40% reduce 0% 13/01/21 09:19:28 INFO mapred.jobclient: map 40% reduce 3% 13/01/21 09:19:31 INFO mapred.jobclient: map 60% reduce 6% 13/01/21 09:19:34 INFO mapred.jobclient: map 60% reduce 13% 13/01/21 09:19:37 INFO mapred.jobclient: map 80% reduce 13% 13/01/21 09:19:43 INFO mapred.jobclient: map 100% reduce 20% 13/01/21 09:19:55 INFO mapred.jobclient: map 100% reduce 100% 13/01/21 09:20:00 INFO mapred.jobclient: Job complete: job_ _ /01/21 09:20:00 INFO mapred.jobclient: Counters: 27 13/01/21 09:20:00 INFO mapred.jobclient: Job Counters 13/01/21 09:20:00 INFO mapred.jobclient: Launched reduce tasks=1 13/01/21 09:20:00 INFO mapred.jobclient: SLOTS_MILLIS_MAPS= /01/21 09:20:00 INFO mapred.jobclient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/21 09:20:00 INFO mapred.jobclient: Total time spent by all maps waiting after reserving slots (ms)=0 13/01/21 09:20:00 INFO mapred.jobclient: Launched map tasks=10 13/01/21 09:20:00 INFO mapred.jobclient: Data-local map tasks=10 13/01/21 09:20:00 INFO mapred.jobclient: SLOTS_MILLIS_REDUCES= /01/21 09:20:00 INFO mapred.jobclient: File Input Format Counters 13/01/21 09:20:00 INFO mapred.jobclient: Bytes Read= /01/21 09:20:00 INFO mapred.jobclient: File Output Format Counters 13/01/21 09:20:00 INFO mapred.jobclient: Bytes Written=97 13/01/21 09:20:00 INFO mapred.jobclient: FileSystemCounters 13/01/21 09:20:00 INFO mapred.jobclient: FILE_BYTES_READ=226 13/01/21 09:20:00 INFO mapred.jobclient: HDFS_BYTES_READ= /01/21 09:20:00 INFO mapred.jobclient: FILE_BYTES_WRITTEN= /01/21 09:20:00 INFO mapred.jobclient: HDFS_BYTES_WRITTEN=215 13/01/21 09:20:00 INFO mapred.jobclient: Map-Reduce Framework 13/01/21 09:20:00 INFO mapred.jobclient: Map output materialized bytes=280 13/01/21 09:20:00 INFO mapred.jobclient: Map input records=10 13/01/21 09:20:00 INFO mapred.jobclient: Reduce shuffle bytes=252 13/01/21 09:20:00 INFO mapred.jobclient: Spilled Records=40 13/01/21 09:20:00 INFO mapred.jobclient: Map output bytes=180 13/01/21 09:20:00 INFO mapred.jobclient: Total committed heap usage (bytes)= /01/21 09:20:00 INFO mapred.jobclient: Map input bytes=240 13/01/21 09:20:00 INFO mapred.jobclient: Combine input records=0 13/01/21 09:20:00 INFO mapred.jobclient: SPLIT_RAW_BYTES= /01/21 09:20:00 INFO mapred.jobclient: Reduce input records=20 13/01/21 09:20:00 INFO mapred.jobclient: Reduce input groups=20 13/01/21 09:20:00 INFO mapred.jobclient: Combine output records=0 13/01/21 09:20:00 INFO mapred.jobclient: Reduce output records=0 13/01/21 09:20:00 INFO mapred.jobclient: Map output records=20 Job Finished in seconds Estimated value of Pi is As you can see the job completed successfully. Congratulations, at this point you have a fully functional Pseudo- Distributed Hadoop installation running on your Mac! Now, to figure out how to debug some MapReduce jobs.

18 Stopping Hadoop You can stop Hadoop by running the stop- all.sh script provided in the bin directory of the Hadoop install. $ cd $HADOOP_PREFIX/bin $./stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode $ jps 9510 Jps Debugging MapReduce jobs in Hadoop Debugging distributed jobs can be a very difficult process. In a Pseudo- Distributed or Distributed Hadoop cluster, each Map and Reduce task is run in it s own JVM (unless you configure reusing JVMs, in which case you still have more than one JVM). This makes debugging difficult because debuggers generally only attach to a single JVM. What Is Standalone Mode? Unlike Pseudo- Distributed and Distributed modes, which run Hadoop as a set of processes in many JVMs, Standalone mode runs everything in a single JVM as you invoke Hadoop commands. This means that you will not be starting Standalone up in the same way, and you will not see the DataNode, NameNode, or other processes when you run jps. The fact that it runs everything in a single JVM is great for debugging as we can attach a remote debugger in an IDE like Eclipse and actually step through our MapReduce code line by line. Using Standalone (Local) Mode To solve the issue of multiple JVMs we can run Hadoop in Standalone mode. We setup some configuration files earlier so that we could easily switch between Pseudo- Distributed and Standalone mode. First make sure that Hadoop in Pseudo- Distributed Mode is shut down. $ jps 9510 Jps Now switch the configuration to local mode. $ cd $HADOOP_PREFIX/conf $./configure.sh local $ ls -l total ryantabora staff 7457 Oct 2 22:17 capacityscheduler.xml

19 1 ryantabora staff 535 Oct 2 22:17 configuration.xsl -rwxr-xr-x 1 ryantabora staff 726 Jan 22 09:49 configure.sh 1 ryantabora staff 259 Jan 22 08:29 core-site-local.xml -rw-r--r-- 1 ryantabora staff 273 Jan 21 11:16 core-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 19 Jan 22 10:17 core-site.xml -> core-site-local.xml 1 ryantabora staff 327 Oct 2 22:17 fair-scheduler.xml -rw-r--r-- 1 ryantabora staff 2537 Jan 22 09:43 hadoop-env-backup.sh -rwxr-xr-x 1 ryantabora staff 735 Jan 22 09:42 hadoop-env-local.sh -rwxr-xr-x 1 ryantabora staff 634 Jan 22 09:43 hadoop-env-pseudo.sh lrwxr-xr-x 1 ryantabora staff 19 Jan 22 10:17 hadoop-env.sh -> hadoop-env-local.sh 1 ryantabora staff 1488 Oct 2 22:17 hadoopmetrics2.properties 1 ryantabora staff 4644 Oct 2 22:17 hadoop-policy.xml 1 ryantabora staff 178 Oct 2 22:17 hdfs-site-local.xml -rw-r--r-- 1 ryantabora staff 549 Jan 21 11:16 hdfs-site-pseudo.xml lrwxr-xr-x 1 ryantabora staff 19 Jan 22 10:17 hdfs-site.xml -> hdfs-site-local.xml 1 ryantabora staff 4441 Oct 2 22:17 log4j.properties 1 ryantabora staff 2033 Oct 2 22:17 mapred-queueacls.xml 1 ryantabora staff 259 Jan 22 08:59 mapred-sitelocal.xml -rw-r--r-- 1 ryantabora staff 357 Jan 21 11:16 mapred-sitepseudo.xml lrwxr-xr-x 1 ryantabora staff 21 Jan 22 10:17 mapred-site.xml -> mapred-site-local.xml 1 ryantabora staff 10 Oct 2 22:17 masters 1 ryantabora staff 10 Oct 2 22:17 slaves 1 ryantabora staff 1243 Oct 2 22:17 sslclient.xml.example 1 ryantabora staff 1195 Oct 2 22:17 sslserver.xml.example 1 ryantabora staff 382 Oct 2 22:17 taskcontroller.cfg Setting Up Eclipse s Remote Debugger I will assume you ve already got Eclipse downloaded. I m running the J2EE Juno version. In this blog post I ve included an example WordCount project you can use to test out the remote debugging feature of Eclipse. Importing the Project in Eclipse First let s import the project into Eclipse. Select File- >Import.

20 Figure 7 - Importing a Project Select General- >Existing Projects into Workspace. Next. Figure 8 - Selecting an Existing Project

21 Then Browse to the downloaded exercise folder. Select MapReduce- WordCount as the project you want to import. Finish. Figure 9 - Importing the MapReduce- WordCount Project Now you should be able to see all of the files on the left hand side of the Eclipse application. Figure 10 - Project Contents

22 Creating a Remote Debugger Let s find the Mapper and place a breakpoint there. Find the SimpleWordCountMapper.java class and place a breakpoint on one of the lines. I put a breakpoint next to String[] tokens = line.tostring().split( \\s+ ); Figure 11 - Placing a Breakpoint Now let s set up the remote debugger. Go to the Run menu and select Debug Configurations Figure 12 - Debug Configurations Menu in Eclipse Then create a new Remote Java Application debugger by right clicking Remote Java Application and selecting New.

23 Figure 13 - Creating a Remote Debugger Make sure the remote debugger has connection type standard (socket attach) and points to host:localhost and port:8000. Figure 14 - Setting up the Remote Debugger Attaching the MapReduce Job to the Remote Debugger First we must build the jar for the project. Navigate to the project directory and execute ant. You should see a successful build and it will create an executable jar in the./bin/jar directory called wordcount.jar.

24 $ cd ~/Desktop/exercise/MapReduce-WordCount/ $ ant Buildfile: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/build.xml junit.clean: clean: [delete] Deleting directory /Users/ryantabora/Desktop/exercise/MapReduce-WordCount/bin build-dirs: [mkdir] Created dir: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin [mkdir] Created dir: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin/classes compile: [javac] Compiling 3 source files to /Users/ryantabora/Desktop/exercise/MapReduce-WordCount/bin/classes [javac] Compiling 1 source file to /Users/ryantabora/Desktop/exercise/MapReduce-WordCount/bin/classes jar: [mkdir] Created dir: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin/jar [jar] Building jar: /Users/ryantabora/Desktop/exercise/MapReduce- WordCount/bin/jar/wordcount.jar junit: [junit] Running com.thinkbiganalytics.ex.wordcount.simplewordcounttest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: sec junit.clean: test: main: BUILD SUCCESSFUL Total time: 2 seconds $ ls./bin/jar/ wordcount.jar Now, let s run the MapReduce job. We do so by calling in the following format: hadoop jar <path_to_jar> <input> <output> You will see that it waits to run on port Make sure that the output folder you put as the output does not exist yet.

25 $ hadoop jar ~/Desktop/exercise/MapReduce- WordCount/bin/jar/wordcount.jar beowulf.txt ~/Desktop/output Listening for transport dt_socket at address: 8000 Now go back to Eclipse - > Run - > Debug Configurations - > Remote Java Application - > SimpleWordCount and choose Debug. The MapReduce job will start to run. Once it hits the breakpoint in the Mapper, Eclipse will open the Debug perspective and you can start debugging. Figure 15 - Debugging MapReduce Congratulations! You have now debugged a MapReduce job! As you step through, you should see the console display the Map and Reduce task status. Eventually it will finish and you can view your results. 13/01/22 11:23:07 WARN util.nativecodeloader: Unable to load nativehadoop library for your platform... using builtin-java classes where applicable 13/01/22 11:23:07 WARN snappy.loadsnappy: Snappy native library not loaded 13/01/22 11:23:07 INFO mapred.fileinputformat: Total input paths to process : 1 13/01/22 11:23:07 INFO mapred.jobclient: Running job: job_local_ /01/22 11:23:07 INFO mapred.task: Using ResourceCalculatorPlugin : null 13/01/22 11:23:07 INFO mapred.maptask: numreducetasks: 1 13/01/22 11:23:07 INFO mapred.maptask: io.sort.mb = /01/22 11:23:08 INFO mapred.maptask: data buffer = / /01/22 11:23:08 INFO mapred.maptask: record buffer = / /01/22 11:23:08 INFO mapred.maptask: Starting flush of map output 13/01/22 11:23:08 INFO mapred.jobclient: map 0% reduce 0% 13/01/22 11:23:08 INFO mapred.maptask: Finished spill 0 13/01/22 11:23:08 INFO mapred.task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

26 13/01/22 11:23:10 INFO mapred.localjobrunner: file:/users/ryantabora/desktop/exercise/mapreduce- WordCount/beowulf.txt: /01/22 11:23:10 INFO mapred.task: Task 'attempt_local_0001_m_000000_0' done. 13/01/22 11:23:10 INFO mapred.task: Using ResourceCalculatorPlugin : null 13/01/22 11:23:10 INFO mapred.localjobrunner: 13/01/22 11:23:10 INFO mapred.merger: Merging 1 sorted segments 13/01/22 11:23:10 INFO mapred.merger: Down to the last merge-pass, with 1 segments left of total size: bytes 13/01/22 11:23:10 INFO mapred.localjobrunner: 13/01/22 11:23:11 INFO mapred.task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 13/01/22 11:23:11 INFO mapred.localjobrunner: 13/01/22 11:23:11 INFO mapred.task: Task attempt_local_0001_r_000000_0 is allowed to commit now 13/01/22 11:23:11 INFO mapred.fileoutputcommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/users/ryantabora/desktop/output 13/01/22 11:23:11 INFO mapred.jobclient: map 100% reduce 0% 13/01/22 11:23:13 INFO mapred.localjobrunner: reduce > reduce 13/01/22 11:23:13 INFO mapred.task: Task 'attempt_local_0001_r_000000_0' done. 13/01/22 11:23:14 INFO mapred.jobclient: map 100% reduce 100% 13/01/22 11:23:14 INFO mapred.jobclient: Job complete: job_local_ /01/22 11:23:14 INFO mapred.jobclient: Counters: 18 13/01/22 11:23:14 INFO mapred.jobclient: File Input Format Counters 13/01/22 11:23:14 INFO mapred.jobclient: Bytes Read= /01/22 11:23:14 INFO mapred.jobclient: File Output Format Counters 13/01/22 11:23:14 INFO mapred.jobclient: Bytes Written= /01/22 11:23:14 INFO mapred.jobclient: FileSystemCounters 13/01/22 11:23:14 INFO mapred.jobclient: FILE_BYTES_READ= /01/22 11:23:14 INFO mapred.jobclient: FILE_BYTES_WRITTEN= /01/22 11:23:14 INFO mapred.jobclient: Map-Reduce Framework 13/01/22 11:23:14 INFO mapred.jobclient: Map output materialized bytes= /01/22 11:23:14 INFO mapred.jobclient: Map input records= /01/22 11:23:14 INFO mapred.jobclient: Reduce shuffle bytes=0 13/01/22 11:23:14 INFO mapred.jobclient: Spilled Records= /01/22 11:23:14 INFO mapred.jobclient: Map output bytes= /01/22 11:23:14 INFO mapred.jobclient: Total committed heap usage (bytes)= /01/22 11:23:14 INFO mapred.jobclient: Map input bytes= /01/22 11:23:14 INFO mapred.jobclient: SPLIT_RAW_BYTES=124 13/01/22 11:23:14 INFO mapred.jobclient: Combine input records=0 13/01/22 11:23:14 INFO mapred.jobclient: Reduce input records= /01/22 11:23:14 INFO mapred.jobclient: Reduce input groups= /01/22 11:23:14 INFO mapred.jobclient: Combine output records=0 13/01/22 11:23:14 INFO mapred.jobclient: Reduce output records= /01/22 11:23:14 INFO mapred.jobclient: Map output records=42089 $ head ~/Desktop/output/part "'tis 2 "_excellently 1 "_i 1

27 "_nobly 1 "art 1 "ask 1 "beowulf 2 "brilliant" 1 "defects," 1 "folk 1

Hadoop Installation Tutorial (Hadoop 1.x)

Hadoop Installation Tutorial (Hadoop 1.x) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

Hadoop (pseudo-distributed) installation and configuration

Hadoop (pseudo-distributed) installation and configuration Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under

More information

HADOOP - MULTI NODE CLUSTER

HADOOP - MULTI NODE CLUSTER HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed

More information

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g. Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You

More information

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone

More information

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 There are three ways of installing Hadoop: Standalone (or local) mode: no deamons running. Nothing to configure after

More information

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit

More information

Hands-on Exercises with Big Data

Hands-on Exercises with Big Data Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

More information

HSearch Installation

HSearch Installation To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all

More information

Running Kmeans Mapreduce code on Amazon AWS

Running Kmeans Mapreduce code on Amazon AWS Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Kognitio Technote Kognitio v8.x Hadoop Connector Setup Kognitio Technote Kognitio v8.x Hadoop Connector Setup For External Release Kognitio Document No Authors Reviewed By Authorised By Document Version Stuart Watt Date Table Of Contents Document Control...

More information

IDS 561 Big data analytics Assignment 1

IDS 561 Big data analytics Assignment 1 IDS 561 Big data analytics Assignment 1 Due Midnight, October 4th, 2015 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code

More information

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap And Garrett Poppe Topics Create hadoopuser and group Edit sudoers Set up SSH Install JDK Install Hadoop Editting

More information

Installing Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0

Installing Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0 April 29, 2015 Installing Hadoop Hortonworks Hadoop VERSION 1.0 Mogulla, Deepak Reddy Table of Contents Get Linux platform ready...2 Update Linux...2 Update/install Java:...2 Setup SSH Certificates...3

More information

Hadoop Installation. Sandeep Prasad

Hadoop Installation. Sandeep Prasad Hadoop Installation Sandeep Prasad 1 Introduction Hadoop is a system to manage large quantity of data. For this report hadoop- 1.0.3 (Released, May 2012) is used and tested on Ubuntu-12.04. The system

More information

Hadoop Training Hands On Exercise

Hadoop Training Hands On Exercise Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe

More information

Setting up Hadoop with MongoDB on Windows 7 64-bit

Setting up Hadoop with MongoDB on Windows 7 64-bit SGT WHITE PAPER Setting up Hadoop with MongoDB on Windows 7 64-bit HCCP Big Data Lab 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301)

More information

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1 102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

More information

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3

More information

Hadoop Tutorial. General Instructions

Hadoop Tutorial. General Instructions CS246: Mining Massive Datasets Winter 2016 Hadoop Tutorial Due 11:59pm January 12, 2016 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure

More information

Single Node Setup. Table of contents

Single Node Setup. Table of contents Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/ Download the Hadoop tar. Download the Java from Oracle - Unpack the Comparisons -- $tar -zxvf hadoop-2.6.0.tar.gz $tar -zxf jdk1.7.0_60.tar.gz Set JAVA PATH in Linux Environment. Edit.bashrc and add below

More information

CS 455 Spring 2015. Word Count Example

CS 455 Spring 2015. Word Count Example CS 455 Spring 2015 Word Count Example Before starting, make sure that you have HDFS and Yarn running, using sbin/start-dfs.sh and sbin/start-yarn.sh Download text copies of at least 3 books from Project

More information

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family

More information

Single Node Hadoop Cluster Setup

Single Node Hadoop Cluster Setup Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

More information

HDFS Installation and Shell

HDFS Installation and Shell 2012 coreservlets.com and Dima May HDFS Installation and Shell Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses

More information

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation 1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible

More information

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop-

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Tableau Spark SQL Setup Instructions

Tableau Spark SQL Setup Instructions Tableau Spark SQL Setup Instructions 1. Prerequisites 2. Configuring Hive 3. Configuring Spark & Hive 4. Starting the Spark Service and the Spark Thrift Server 5. Connecting Tableau to Spark SQL 5A. Install

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own

More information

CDH 5 Quick Start Guide

CDH 5 Quick Start Guide CDH 5 Quick Start Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this

More information

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested

More information

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Installation MapReduce Examples Jake Karnes Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an

More information

HADOOP CLUSTER SETUP GUIDE:

HADOOP CLUSTER SETUP GUIDE: HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do

More information

Hadoop 2.6 Configuration and More Examples

Hadoop 2.6 Configuration and More Examples Hadoop 2.6 Configuration and More Examples Big Data 2015 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies

More information

Hadoop 2.2.0 MultiNode Cluster Setup

Hadoop 2.2.0 MultiNode Cluster Setup Hadoop 2.2.0 MultiNode Cluster Setup Sunil Raiyani Jayam Modi June 7, 2014 Sunil Raiyani Jayam Modi Hadoop 2.2.0 MultiNode Cluster Setup June 7, 2014 1 / 14 Outline 4 Starting Daemons 1 Pre-Requisites

More information

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email n.roy@neu.edu if you have questions or need more clarifications. Nilay

More information

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster Dr. G. Venkata Rami Reddy 1, CH. V. V. N. Srikanth Kumar 2 1 Assistant Professor, Department of SE, School Of Information

More information

Install Hadoop on Ubuntu and run as standalone

Install Hadoop on Ubuntu and run as standalone Welcome, this document is a record of my installation of Hadoop for study purpose. Version Version Date Content and Change 1.0 2013 Dec Initialize study Hadoop Install basic environment, run first word

More information

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster Integrating SAP BusinessObjects with Hadoop Using a multi-node Hadoop Cluster May 17, 2013 SAP BO HADOOP INTEGRATION Contents 1. Installing a Single Node Hadoop Server... 2 2. Configuring a Multi-Node

More information

Hadoop Multi-node Cluster Installation on Centos6.6

Hadoop Multi-node Cluster Installation on Centos6.6 Hadoop Multi-node Cluster Installation on Centos6.6 Created: 01-12-2015 Author: Hyun Kim Last Updated: 01-12-2015 Version Number: 0.1 Contact info: hyunk@loganbright.com Krish@loganbriht.com Hadoop Multi

More information

Hadoop Installation Guide

Hadoop Installation Guide Hadoop Installation Guide Hadoop Installation Guide (for Ubuntu- Trusty) v1.0, 25 Nov 2014 Naveen Subramani Hadoop Installation Guide (for Ubuntu - Trusty) v1.0, 25 Nov 2014 Hadoop and the Hadoop Logo

More information

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved. Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework

More information

Deploying MongoDB and Hadoop to Amazon Web Services

Deploying MongoDB and Hadoop to Amazon Web Services SGT WHITE PAPER Deploying MongoDB and Hadoop to Amazon Web Services HCCP Big Data Lab 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301)

More information

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and

More information

2.1 Hadoop a. Hadoop Installation & Configuration

2.1 Hadoop a. Hadoop Installation & Configuration 2. Implementation 2.1 Hadoop a. Hadoop Installation & Configuration First of all, we need to install Java Sun 6, and it is preferred to be version 6 not 7 for running Hadoop. Type the following commands

More information

Hadoop in Action. Justin Quan March 15, 2011

Hadoop in Action. Justin Quan March 15, 2011 Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) CactoScale Guide User Guide Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) Version History Version Date Change Author 0.1 12/10/2014 Initial version Athanasios Tsitsipas(UULM)

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

Centrify Server Suite 2015.1 For MapR 4.1 Hadoop With Multiple Clusters in Active Directory

Centrify Server Suite 2015.1 For MapR 4.1 Hadoop With Multiple Clusters in Active Directory Centrify Server Suite 2015.1 For MapR 4.1 Hadoop With Multiple Clusters in Active Directory v1.1 2015 CENTRIFY CORPORATION. ALL RIGHTS RESERVED. 1 Contents General Information 3 Centrify Server Suite for

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Big Data Operations Guide for Cloudera Manager v5.x Hadoop Big Data Operations Guide for Cloudera Manager v5.x Hadoop Logging into the Enterprise Cloudera Manager 1. On the server where you have installed 'Cloudera Manager', make sure that the server is running,

More information

Using The Hortonworks Virtual Sandbox

Using The Hortonworks Virtual Sandbox Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop This work by Hortonworks, Inc. is licensed under a Creative Commons Attribution- ShareAlike3.0 Unported License. Legal Notice Copyright 2012

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm. hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.br Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install

More information

Hadoop Setup. 1 Cluster

Hadoop Setup. 1 Cluster In order to use HadoopUnit (described in Sect. 3.3.3), a Hadoop cluster needs to be setup. This cluster can be setup manually with physical machines in a local environment, or in the cloud. Creating a

More information

Basic Hadoop Programming Skills

Basic Hadoop Programming Skills Basic Hadoop Programming Skills Basic commands of Ubuntu Open file explorer Basic commands of Ubuntu Open terminal Basic commands of Ubuntu Open new tabs in terminal Typically, one tab for compiling source

More information

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014) RHadoop and MapR Accessing Enterprise- Grade Hadoop from R Version 2.0 (14.March.2014) Table of Contents Introduction... 3 Environment... 3 R... 3 Special Installation Notes... 4 Install R... 5 Install

More information

Revolution R Enterprise 7 Hadoop Configuration Guide

Revolution R Enterprise 7 Hadoop Configuration Guide Revolution R Enterprise 7 Hadoop Configuration Guide The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2014. Revolution R Enterprise 7 Hadoop Configuration Guide.

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung. E6893 Big Data Analytics: Demo Session for HW I Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung 1 Oct 2, 2014 2 Part I: Pig installation and Demo Pig is a platform for analyzing

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Ankush Cluster Manager - Hadoop2 Technology User Guide

Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush User Manual 1.5 Ankush User s Guide for Hadoop2, Version 1.5 This manual, and the accompanying software and other documentation, is protected

More information

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ. Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

More information

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer About this Tutorial Apache Mahout is an open source project that is primarily used in producing scalable machine learning algorithms. This brief tutorial provides a quick introduction to Apache Mahout

More information

Hadoop 2.6.0 Setup Walkthrough

Hadoop 2.6.0 Setup Walkthrough Hadoop 2.6.0 Setup Walkthrough This document provides information about working with Hadoop 2.6.0. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government

More information

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Dynamic Hadoop Clusters

Dynamic Hadoop Clusters Dynamic Hadoop Clusters Steve Loughran Julio Guijarro Slides: http://wiki.smartfrog.org/wiki/display/sf/dynamic+hadoop+clusters 2009 Hewlett-Packard Development Company, L.P. The information contained

More information

Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0

Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0 Symantec Enterprise Solution for Hadoop Installation and Administrator's Guide 1.0 The software described in this book is furnished under a license agreement and may be used only in accordance with the

More information

Getting Started with

Getting Started with Chapter 1 Getting Started with Hadoop Core Applications frequently require more resources than are available on an inexpensive machine. Many organizations nd themselves with business processes that no

More information

Integration Of Virtualization With Hadoop Tools

Integration Of Virtualization With Hadoop Tools Integration Of Virtualization With Hadoop Tools Aparna Raj K aparnaraj.k@iiitb.org Kamaldeep Kaur Kamaldeep.Kaur@iiitb.org Uddipan Dutta Uddipan.Dutta@iiitb.org V Venkat Sandeep Sandeep.VV@iiitb.org Technical

More information

CDH installation & Application Test Report

CDH installation & Application Test Report CDH installation & Application Test Report He Shouchun (SCUID: 00001008350, Email: she@scu.edu) Chapter 1. Prepare the virtual machine... 2 1.1 Download virtual machine software... 2 1.2 Plan the guest

More information

Spectrum Scale HDFS Transparency Guide

Spectrum Scale HDFS Transparency Guide Spectrum Scale Guide Spectrum Scale BDA 2016-1-5 Contents 1. Overview... 3 2. Supported Spectrum Scale storage mode... 4 2.1. Local Storage mode... 4 2.2. Shared Storage Mode... 4 3. Hadoop cluster planning...

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 4: Hadoop Administration An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted

More information

H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop. September 30, 2014. www.0xdata.com H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

More information

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03 Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide Rev: A03 Use of Open Source This product may be distributed with open source code, licensed to you in accordance with the applicable open source

More information

Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions

Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions WHITE PAPER March 2013 Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions In collaboration with Colfax International Background...1 Hardware...1 Software Requirements...3 Installation...3 Scalling

More information

Cloudera Manager Training: Hands-On Exercises

Cloudera Manager Training: Hands-On Exercises 201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working

More information

Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh

Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh Chapter No. 3 "Integration of Apache Nutch with Apache Hadoop and Eclipse" In this package, you will find: A Biography

More information

Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE

Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE Table of Contents Apache Hadoop Deployment Using VMware vsphere Big Data Extensions.... 3 Big Data

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Practice Fusion API Client Installation Guide for Windows

Practice Fusion API Client Installation Guide for Windows Practice Fusion API Client Installation Guide for Windows Quickly and easily connect your Results Information System with Practice Fusion s Electronic Health Record (EHR) System Table of Contents Introduction

More information

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved. Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document

More information

Extending Remote Desktop for Large Installations. Distributed Package Installs

Extending Remote Desktop for Large Installations. Distributed Package Installs Extending Remote Desktop for Large Installations This article describes four ways Remote Desktop can be extended for large installations. The four ways are: Distributed Package Installs, List Sharing,

More information

CycleServer Grid Engine Support Install Guide. version 1.25

CycleServer Grid Engine Support Install Guide. version 1.25 CycleServer Grid Engine Support Install Guide version 1.25 Contents CycleServer Grid Engine Guide 1 Administration 1 Requirements 1 Installation 1 Monitoring Additional OGS/SGE/etc Clusters 3 Monitoring

More information

Running Knn Spark on EC2 Documentation

Running Knn Spark on EC2 Documentation Pseudo code Running Knn Spark on EC2 Documentation Preparing to use Amazon AWS First, open a Spark launcher instance. Open a m3.medium account with all default settings. Step 1: Login to the AWS console.

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Using BAC Hadoop Cluster

Using BAC Hadoop Cluster Using BAC Hadoop Cluster Bodhisatta Barman Roy January 16, 2015 1 Contents 1 Introduction 3 2 Daemon locations 4 3 Pre-requisites 5 4 Setting up 6 4.1 Using a Linux Virtual Machine................... 6

More information

File S1: Supplementary Information of CloudDOE

File S1: Supplementary Information of CloudDOE File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.

More information

SAS Marketing Automation 4.4. Unix Install Instructions for Hot Fix 44MA10

SAS Marketing Automation 4.4. Unix Install Instructions for Hot Fix 44MA10 SAS Marketing Automation 4.4 Unix Install Instructions for Hot Fix 44MA10 Introduction This document describes the steps necessary to install and deploy the SAS Marketing Automation 4.4 Hot fix Release

More information