USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Size: px
Start display at page:

Download "USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2"

Transcription

1 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users [email protected] if you have questions or need more clarifications. Nilay K. Roy, Ph.D) To use HDFS on Discovery Cluster login to any of the login nodes as usual. Make sure that your.bashrc has the following modules loaded. module whatis hadoop will give you the prerequisites. After login type module list to make sure the proper modules are loaded as shown below: Next using the hadoop-10g queue get an interactive-node. Every user is restricted to no more than 10 cores. The interactive nodes are the hadoop file system data nodes. Generally you do not need X-windows forwarding so bsub -Is -q hadoop-10g -n 1 /bin/bash will suffice. If you do need X-windows forwarding add the -XF tag for bsub -Is -XF -q hadoop-10g -n 1 /bin/bash. This is shown below. -n is number of cores requested. If you have a bad ssh connection and want a persistent one use screen available on the two login nodes discovery2 and discovery4. This way you can logout of the cluster while your job runs and login later without killing anything. Remember to detach from the screen session before logging out of the login nodes. You can reattach to the relevant screen after login. Further details are out-of-scope here. Please note that there is no time restriction on this queue. So when you are done exit from the login to the compute node and the LSF interactive session will automatically be killed. Check it again by using bjobs -w and if still running use bkill <job_ib>. Similarly exit all screen sessions if you are not doing anything in them.

2 Using HDFS on Discovery Cluster consists of three steps: 1) Compile hadoop code using Java and get the *.jar file. 2) Create your data directory on the HDFS file system following the Discovery Cluster best practice procedure as in /scratch. 3) Move (stage) the input data to the HDFS file system in your top level data folder. 4) Run your code. 5) Move your output and other data back to your /home/<my_neu_id> or /scratch/<my_neu_id> folder. 6) Exit your interactive session. Check your interactive job is killed on exit using bjobs w. 7) Exit screen if using it. 1. Compile Hadoop code: Example test1 Download the example file here - Now upload to your /home/<my_neu_id> or /scratch/<my_neu_id> and extract then compile using javac as shown below. Make a directory for the classes. In this example it is called wordcount_classes. Before compiling run export CLASSPATH=$(hadoop classpath):$classpath where hadoop will be in your path if you have loaded the modules correctly as shown above. You can ignore the deprecated API note. In the example java source file you will see import org.apache.hadoop.filecache.distributedcache;. This is for fully distributed Hadoop implementation with multi-level replication and fully distributed cache as in this case. In this case the replication and distributed cache is across three data nodes compute-2-004, compute-2-005, compute [nilay.roy@compute ~]$ tar -zxvf NKR_hadoop_example.tar.gz hadoop_test/ hadoop_test/test1/ hadoop_test/test1/wordcount.java hadoop_test/test1/file02 hadoop_test/test1/file01 [nilay.roy@compute ~]$ cd hadoop_test/ [nilay.roy@compute hadoop_test]$ cd test1 ls -la total 83 drwxr-xr-x 2 nilay.roy GID_nilay.roy 80 Aug 18 15:24. drwxr-xr-x 3 nilay.roy GID_nilay.roy 23 Aug 15 11:50.. -rw-r--r-- 1 nilay.roy GID_nilay.roy 24 Aug 15 19:32 file01 -rw-r--r-- 1 nilay.roy GID_nilay.roy 33 Aug 15 19:32 file02 -rw-r--r-- 1 nilay.roy GID_nilay.roy 4544 Aug 15 12:12 WordCount.java mkdir wordcount_classes export CLASSPATH=$(hadoop classpath):$classpath javac -d wordcount_classes WordCount.java Note: WordCount.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. cd wordcount_classes/org/myorg [nilay.roy@compute myorg]$ ls -la total 109 drwxr-xr-x 2 nilay.roy GID_nilay.roy 156 Aug 18 15:44. drwxr-xr-x 3 nilay.roy GID_nilay.roy 23 Aug 18 15:44.. -rw-r--r-- 1 nilay.roy GID_nilay.roy 2671 Aug 18 15:44 WordCount.class -rw-r--r-- 1 nilay.roy GID_nilay.roy 4661 Aug 18 15:44 WordCount$Map.class -rw-r--r-- 1 nilay.roy GID_nilay.roy 983 Aug 18 15:44 WordCount$Map$Counters.class -rw-r--r-- 1 nilay.roy GID_nilay.roy 1611 Aug 18 15:44 WordCount$Reduce.class [nilay.roy@compute myorg]$ 2. Create your HDFS top-level directory: The rule followed for the top-level HDFS directory is <myneu_id>. This your id that you use to login into the cluster. You create the HDFS directory as follows: The command to use is hadoop fs -mkdir hdfs://discovery3:9000/tmp/nilay.roy where my my_neu_id is nilay.roy. Ignore the native-hadoop stack guard warnings. These will be fixed in later versions. We use a global Hadoop install via modules not a local per node install. Hence the WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where warnings.

3 hadoop fs -mkdir hdfs://discovery3:9000/tmp/nilay.roy 14/08/18 17:04:36 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where You can now list the directory and also view it in the GUI. For GUI go to and click the Utilities tab on top. From the drop down select Browse the file system. From the command line - see below: hdfs dfs -ls hdfs://discovery3:9000/tmp/ 14/08/18 17:13:51 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where Found 2 items drwxrwx--- - hadoopuser supergroup :34 hdfs://discovery3:9000/tmp/hadoop-yarn drwxr-xr-x - nilay.roy supergroup :04 hdfs://discovery3:9000/tmp/nilay.roy 3. Move (stage) the input data to the HDFS file system in your top level data folder: Now we can move the input file into the input directory. Create a local input directory for this file and put the files there and then copy the whole hadoop_test folder over. This is shown below: [nilay.roy@compute ~]$ hdfs dfs -put hadoop_test/ hdfs://discovery3:9000/tmp/nilay.roy/. 14/08/18 17:20:18 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where [nilay.roy@compute ~]$ hdfs dfs -lsr hdfs://discovery3:9000/tmp/nilay.roy lsr: DEPRECATED: Please use 'ls -R' instead. 14/08/18 17:20:44 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where drwxr-xr-x - nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test drwxr-xr-x - nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1 -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount.java -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/file01 -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/file02 drwxr-xr-x - nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input/file01 -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input/file02 drwxr-xr-x - nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes drwxr-xr-x - nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org drwxr-xr-x - nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/wordcount$map$counters.class -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/wordcount$map.class -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/wordcount$reduce.class -rw-r--r-- 3 nilay.roy supergroup :20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/wordcount.class [nilay.roy@compute ~]$ 4. Run your code: Now create the jar file to run as shown below using jar -cvf wordcount.jar -C wordcount_classes/.

4 jar -cvf wordcount.jar -C wordcount_classes/. added manifest adding: org/(in = 0) (out= 0)(stored 0%) adding: org/myorg/(in = 0) (out= 0)(stored 0%) adding: org/myorg/wordcount$map.class(in = 4661) (out= 2217)(deflated 52%) adding: org/myorg/wordcount.class(in = 2671) (out= 1289)(deflated 51%) adding: org/myorg/wordcount$map$counters.class(in = 983) (out= 504)(deflated 48%) adding: org/myorg/wordcount$reduce.class(in = 1611) (out= 648)(deflated 59%) ls -la total 114 drwxr-xr-x 4 nilay.roy GID_nilay.roy 169 Aug 18 17:27. drwxr-xr-x 3 nilay.roy GID_nilay.roy 23 Aug 15 11:50.. -rw-r--r-- 1 nilay.roy GID_nilay.roy 24 Aug 15 19:32 file01 -rw-r--r-- 1 nilay.roy GID_nilay.roy 33 Aug 15 19:32 file02 drwxr-xr-x 2 nilay.roy GID_nilay.roy 48 Aug 18 17:18 input drwxr-xr-x 3 nilay.roy GID_nilay.roy 21 Aug 18 15:44 wordcount_classes -rw-r--r-- 1 nilay.roy GID_nilay.roy 5799 Aug 18 17:27 wordcount.jar -rw-r--r-- 1 nilay.roy GID_nilay.roy 4544 Aug 15 12:12 WordCount.java Then run in on the HDFS file system as show below using: hadoop jar /home/nilay.roy/hadoop_test/test1/wordcount.jar org.myorg.wordcount hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output Although the above seems intimidating what we are doing is simply giving the full path to the *.jar to run and the full paths for input and output in HDFS with the org.myorg.wordcount compiled class libraries. When you run this you will see output as shown below: hadoop jar /home/nilay.roy/hadoop_test/test1/wordcount.jar org.myorg.wordcount hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output 14/08/18 17:35:44 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where 14/08/18 17:35:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/08/18 17:35:44 INFO jvm.jvmmetrics: Initializing JVM Metrics with processname=jobtracker, sessionid= 14/08/18 17:35:44 INFO jvm.jvmmetrics: Cannot initialize JVM Metrics with processname=jobtracker, sessionid= - already initialized 14/08/18 17:35:45 INFO mapred.fileinputformat: Total input paths to process : 2 14/08/18 17:35:45 INFO mapreduce.jobsubmitter: number of splits:2 14/08/18 17:35:46 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_local _0001 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::LOT OF OUTPUT OMITTED FOR CONCISENESS:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: FILE: Number of write operations=0 HDFS: Number of bytes read=147 HDFS: Number of bytes written=67 HDFS: Number of read operations=25 Reduce output records=8 File Input Format Counters Bytes Read=57 File Output Format Counters Bytes Written=67 org.myorg.wordcount$map$counters INPUT_WORDS=9 5. Move data out from HDFS to your home directory: Browse your output data using the command shown below and then move it over using the hdfs dfs -get command. This cat example is shown below and the get is left as an exercise. You can then remove your data with the hdfs dfs - rmr command. This is also left as an exercise for the user.

5 hdfs dfs -cat hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output/part /08/18 17:54:42 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1 6. Exit your interactive session: Exit your interactive session. Check your interactive job is killed on exit using bjobs -w. The entire online Hadoop User Guide is here: File System Shell GUIDE is here: If you are not sure what you are doing, have questions, encounter problems, and require help with Hadoop, JAVA, PYTHON or any other issues contact ITS-Research Computing at [email protected]. Alternatively me at [email protected]. Please be mindful that this 50TB Hadoop cluster is a shared resource and must not be used for data storage. Once you are done with your work please clean up everything under your top - level HDFS directory that must be: hdfs://discovery.neu.edu:9000/tmp/<my_neu_id> - where <my_neu_id> is your Discovery Cluster Login ID and the same ID that you use to login to myneu DO NOT USE any other paths here.

6 Another Example test2 Download the example file here - Here is a more advanced example of secondary sort that shows of the power of Hadoop Map-Reduce. Here we sort the values coming into the Reducer of a Hadoop Map/Reduce (MR) Job. We extend the Java API for our use and do the following: 1.Use a composite key. 2.Extend org.apache.hadoop.mapreduce.partitioner. 3.Extend org.apache.hadoop.io.writablecomparator. The main part of the code that one should focus on is the use of the M/R API org.apache.hadoop.mapreduce.*. The problem Imagine we have stock data that looks like the following. Each line represents the value of a stock at a particular time. Each value in a line is delimited by a comma. The first value is the stock symbol (i.e. GOOG), the second value is the timestamp (i.e. the number of milliseconds since January 1, 1970, 00:00:00 GMT), and the third value is the stock s price. The data below is a toy data set. As you can see, there are 3 stock symbols: a, b, and c. The timestamps are also simple: 1, 2, 3, 4. The values are fake as well: 1.0, 2.0, 3.0, and 4.0. a, 1, 1.0 b, 1, 1.0 c, 1, 1.0 a, 2, 2.0 b, 2, 2.0 c, 2, 2.0 a, 3, 3.0 b, 3, 3.0 c, 3, 3.0 a, 4, 4.0 b, 4, 4.0 c, 4, 4.0 Let s say we want for each stock symbol (the reducer key input, or alternatively, the mapper key output), to order the values descendingly by timestamp when they come into the reducer. How do we sort the timestamp descendingly? This problem is known as secondary sorting. Hadoop s M/R platform sorts the keys, but not the values. (Note, Google s M/R platform explicitly supports secondary sorting, see Lin and Dyer 2010). A solution for secondary sorting Use a composite key A solution for secondary sorting involves doing multiple things. First, instead of simply emitting the stock symbol as the key from the mapper, we need to emit a composite key, a key that has multiple parts. The key will have the stock symbol and timestamp. The process for a M/R Job is as follows. (K1,V1) > Map > (K2,V2) (K2,List[V2]) > Reduce > (K3,V3) In the toy data above, K1 will be of type LongWritable, and V1 will be of type Text. Without secondary sorting, K2 will be of type Text and V2 will be of type DoubleWritable (we simply emit the stock symbol and price from the

7 mapper to the reducer). So, K2=symbol, and V2=price, or (K2,V2) = (symbol,price). However, if we emit such an intermediary key-value pair, secondary sorting is not possible. We have to emit a composite key, K2={symbol,timestamp. So the intermediary key-value pair is (K2,V2) = ({symbol,timestamp,price). Note that composite data structures, such as the composite key, is held within the curly braces. Our reducer simply outputs a K3 of type Text and V3 of type Text; (K3,V3) = (symbol, price). The complete M/R job with the new composite key is shown below. (LongWritable,Text) > Map > ({symbol,timestamp,price) ({symbol,timestamp,list[price]) > Reduce > (symbol,price) K2 is a composite key, but inside it, the symbol part/component is referred to as the natural key. It is the key which values will be grouped by. Use a composite key comparator The composite key comparator is where the secondary sorting takes place. It compares composite key by symbol ascendingly and timestamp descendingly. It is shown below. Notice here we sort based on symbol and timestamp. All the components of the composite key are considered. public class CompositeKeyComparator extends WritableComparator { protected CompositeKeyComparator() { public int compare(writablecomparable w1, WritableComparable w2) { StockKey k1 = (StockKey)w1; StockKey k2 = (StockKey)w2; int result = k1.getsymbol().compareto(k2.getsymbol()); if(0 == result) { result = -1* k1.gettimestamp().compareto(k2.gettimestamp()); return result; Use a natural key grouping comparator The natural key group comparator groups values together according to the natural key. Without this component, each K2={symbol,timestamp and its associated V2=price may go to different reducers. Notice here, we only consider the natural key. public class NaturalKeyGroupingComparator extends WritableComparator { protected NaturalKeyGroupingComparator() { public int compare(writablecomparable w1, WritableComparable w2) { StockKey k1 = (StockKey)w1; StockKey k2 = (StockKey)w2; return k1.getsymbol().compareto(k2.getsymbol());

8 Use a natural key partitioner The natural key partitioner uses the natural key to partition the data to the reducer(s). Again, note that here, we only consider the natural key. public class NaturalKeyPartitioner extends Partitioner<StockKey, DoubleWritable> public int getpartition(stockkey key, DoubleWritable val, int numpartitions) { int hash = key.getsymbol().hashcode(); int partition = hash % numpartitions; return partition; The M/R Job Once we define the Mapper, Reducer, natural key grouping comparator, natural key partitioner, composite key comparator, and composite key, in Hadoop s M/R API, we may configure the Job as follows. public class SsJob extends Configured implements Tool { public static void main(string[] args) throws Exception { ToolRunner.run(new Configuration(), new SsJob(), public int run(string[] args) throws Exception { Configuration conf = getconf(); Job job = new Job(conf, "secondary sort"); job.setjarbyclass(ssjob.class); job.setpartitionerclass(naturalkeypartitioner.class); job.setgroupingcomparatorclass(naturalkeygroupingcomparator.class); job.setsortcomparatorclass(compositekeycomparator.class); job.setmapoutputkeyclass(stockkey.class); job.setmapoutputvalueclass(doublewritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); job.setmapperclass(ssmapper.class); job.setreducerclass(ssreducer.class); job.waitforcompletion(true); return 0; Comments You need at least 4 new classes. The composite key class needs to hold the natural key and other data that you will sort on. The composite key comparator will perform the sorting of the keys (and thus values). The natural key grouping

9 comparator will group values based on the natural key. The natural key partitioner will send values with the same natural key to the same reducer. NOTES: After the job is executed the result is shown below: The job was run using the command: hadoop jar demo_classes.jar demo.ssjob -Dmapred.input.dir=hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test2/data - Dmapred.output.dir=hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test2/result The jar file was created as shown below: The code was compiled using the command: javac -d demo_classes CompositeKeyComparator.java NaturalKeyGroupingComparator.java NaturalKeyPartitioner.java SsJob.java SsMapper.java SsReducer.java StockKey.java ============================================================================================ ============================================================================================ Nilay K Roy, Ph.D ========================================================== Nilay Roy, PhD Computational Physics, MS Computer Science Assistant Director - Research Computing, Information Technology Services Northeastern University, , 360 Huntington Avenue, Boston, MA [email protected] (C) (Preferred) / (O) Northeastern Research Computing Website: ==========================================================

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free) Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

More information

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Installation MapReduce Examples Jake Karnes Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Basic Hadoop Programming Skills

Basic Hadoop Programming Skills Basic Hadoop Programming Skills Basic commands of Ubuntu Open file explorer Basic commands of Ubuntu Open terminal Basic commands of Ubuntu Open new tabs in terminal Typically, one tab for compiling source

More information

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

CS 455 Spring 2015. Word Count Example

CS 455 Spring 2015. Word Count Example CS 455 Spring 2015 Word Count Example Before starting, make sure that you have HDFS and Yarn running, using sbin/start-dfs.sh and sbin/start-yarn.sh Download text copies of at least 3 books from Project

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

Hadoop (pseudo-distributed) installation and configuration

Hadoop (pseudo-distributed) installation and configuration Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under

More information

Hadoop Streaming. Table of contents

Hadoop Streaming. Table of contents Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Assignment 1: MapReduce with Hadoop

Assignment 1: MapReduce with Hadoop Assignment 1: MapReduce with Hadoop Jean-Pierre Lozi January 24, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment1.tar.gz

More information

CS2510 Computer Operating Systems Hadoop Examples Guide

CS2510 Computer Operating Systems Hadoop Examples Guide CS2510 Computer Operating Systems Hadoop Examples Guide The main objective of this document is to acquire some faimiliarity with the MapReduce and Hadoop computational model and distributed file system.

More information

IDS 561 Big data analytics Assignment 1

IDS 561 Big data analytics Assignment 1 IDS 561 Big data analytics Assignment 1 Due Midnight, October 4th, 2015 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Hadoop Tutorial. General Instructions

Hadoop Tutorial. General Instructions CS246: Mining Massive Datasets Winter 2016 Hadoop Tutorial Due 11:59pm January 12, 2016 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics www.thinkbiganalytics.com 520 San Antonio Rd, Suite 210 Mt. View, CA 94040 (650) 949-2350 Table of Contents OVERVIEW

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

cloud-kepler Documentation

cloud-kepler Documentation cloud-kepler Documentation Release 1.2 Scott Fleming, Andrea Zonca, Jack Flowers, Peter McCullough, El July 31, 2014 Contents 1 System configuration 3 1.1 Python and Virtualenv setup.......................................

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015 MarkLogic Connector for Hadoop Developer s Guide 1 MarkLogic 8 February, 2015 Last Revised: 8.0-3, June, 2015 Copyright 2015 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

A. Aiken & K. Olukotun PA3

A. Aiken & K. Olukotun PA3 Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so

More information

How to Run Spark Application

How to Run Spark Application How to Run Spark Application Junghoon Kang Contents 1 Intro 2 2 How to Install Spark on a Local Machine? 2 2.1 On Ubuntu 14.04.................................... 2 3 How to Run Spark Application on a

More information

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Hands-on Exercises with Big Data

Hands-on Exercises with Big Data Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

More information

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh [email protected]. Stratis Viglas Extreme Computing 1

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1 Extreme Computing Hadoop Stratis Viglas School of Informatics University of Edinburgh [email protected] Stratis Viglas Extreme Computing 1 Hadoop Overview Examples Environment Stratis Viglas Extreme

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Hadoop Training Hands On Exercise

Hadoop Training Hands On Exercise Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming. Lecture 2 Map- Reduce CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments

More information

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights

More information

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) [email protected]

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) [email protected] The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services

More information

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone

More information

CS 378 Big Data Programming

CS 378 Big Data Programming CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is

More information

Running Knn Spark on EC2 Documentation

Running Knn Spark on EC2 Documentation Pseudo code Running Knn Spark on EC2 Documentation Preparing to use Amazon AWS First, open a Spark launcher instance. Open a m3.medium account with all default settings. Step 1: Login to the AWS console.

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1 102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

More information

Big Data Frameworks: Scala and Spark Tutorial

Big Data Frameworks: Scala and Spark Tutorial Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional

More information

Hadoop Hands-On Exercises

Hadoop Hands-On Exercises Hadoop Hands-On Exercises Lawrence Berkeley National Lab Oct 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos) Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g. Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco ([email protected]) Prerequisites You

More information

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop-

More information

Using BAC Hadoop Cluster

Using BAC Hadoop Cluster Using BAC Hadoop Cluster Bodhisatta Barman Roy January 16, 2015 1 Contents 1 Introduction 3 2 Daemon locations 4 3 Pre-requisites 5 4 Setting up 6 4.1 Using a Linux Virtual Machine................... 6

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

To reduce or not to reduce, that is the question

To reduce or not to reduce, that is the question To reduce or not to reduce, that is the question 1 Running jobs on the Hadoop cluster For part 1 of assignment 8, you should have gotten the word counting example from class compiling. To start with, let

More information

Single Node Hadoop Cluster Setup

Single Node Hadoop Cluster Setup Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

More information

5 HDFS - Hadoop Distributed System

5 HDFS - Hadoop Distributed System 5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Running Hadoop on Windows CCNP Server

Running Hadoop on Windows CCNP Server Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs

More information

Setting up Hadoop with MongoDB on Windows 7 64-bit

Setting up Hadoop with MongoDB on Windows 7 64-bit SGT WHITE PAPER Setting up Hadoop with MongoDB on Windows 7 64-bit HCCP Big Data Lab 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301)

More information

Lecture 3 Hadoop Technical Introduction CSE 490H

Lecture 3 Hadoop Technical Introduction CSE 490H Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated

More information

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung. E6893 Big Data Analytics: Demo Session for HW I Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung 1 Oct 2, 2014 2 Part I: Pig installation and Demo Pig is a platform for analyzing

More information

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 Hadoop and HDInsight in a Heartbeat HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter

More information

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014 Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014 1 Содержание Бигдайта: распределенные вычисления и тренды MapReduce: концепция и примеры реализации

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

HADOOP CLUSTER SETUP GUIDE:

HADOOP CLUSTER SETUP GUIDE: HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Kognitio Technote Kognitio v8.x Hadoop Connector Setup Kognitio Technote Kognitio v8.x Hadoop Connector Setup For External Release Kognitio Document No Authors Reviewed By Authorised By Document Version Stuart Watt Date Table Of Contents Document Control...

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own

More information

Offline Image Viewer Guide

Offline Image Viewer Guide Table of contents 1 Overview... 2 2 Usage... 3 2.1 Basic...3 2.2 Example... 3 3 Options...5 3.1 Option Index... 5 4 Analyzing Results... 6 4.1 Total Number of Files for Each User...6 4.2 Files That Have

More information

Recommended Literature for this Lecture

Recommended Literature for this Lecture COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin,

More information

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap And Garrett Poppe Topics Create hadoopuser and group Edit sudoers Set up SSH Install JDK Install Hadoop Editting

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September 2014. National Institute of Standards and Technology (NIST)

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September 2014. National Institute of Standards and Technology (NIST) NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop September 2014 Dylan Yaga NIST/ITL CSD Lead Software Designer Fernando Podio NIST/ITL CSD Project Manager National Institute of Standards

More information

Cloud Computing. Chapter 8. 8.1 Hadoop

Cloud Computing. Chapter 8. 8.1 Hadoop Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

More information