Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Big Data Computing March 23, 2015. Master s Degree in Computer Science Academic Year 2014-2015, spring semester I.Finocchi and E.Fusco Hadoop (Hands On) 1/25
A quick glance to Hadoop Hadoop is the industry standard open source implementation of MapReduce (and something more that we will not address in this course). Data intensive/computation intensive Hadoop is mainly intended to solve data intensive tasks. SETI@home moves data from central servers to idle desktops and performs computationally intensive analysis on small data; Hadoop ideally moves the code that is needed to perform the analysis to the machine that already contains the data to be analyzed. I.Finocchi and E.Fusco Hadoop (Hands On) 2/25
Hadoop vs. SQL Structured/Unstructured data SQL works on structured data; it enforces invariants and maintains relations among tuples. This comes with a price: larger datasets and increased performance needs require to scale-up (buy ONE more powerful machine). Hadoop works on loosely structured data (key-value pairs). It cannot enforce relations and invariants, but allows to scale-out: largerdatasetsandincreasedperformanceneedscan be dealt with buying more machines. I.Finocchi and E.Fusco Hadoop (Hands On) 3/25
Batch processing Online/Batch data processing Online queries involving a few key value pairs tend to be inefficient: Hadoop works best on batch computations involving terabytes of data. In the rest of this lesson we will introduce the main classes and object you, as an Hadoop programmer, are required to interact with in order to write and execute your custom application. I.Finocchi and E.Fusco Hadoop (Hands On) 4/25
The APIs The official JavaDoc of all Hadoop 2.6.0 classes is available at the following URL: http://hadoop.apache.org/docs/r2.6.0/api/ overview-summary.html We will review quickly the interfaces, classes, and methods we need to run our first Hadoop program. I.Finocchi and E.Fusco Hadoop (Hands On) 5/25
The package org.apache.hadoop.conf This package contains two classes and one interface: Interface Configurable Something that may be configured with a Configuration. We should check its subinterface org.apache.hadoop.util.tool Class Configuration The container class for <key, value> pairs of configuration parameters. Class Configured Objects that can be configured with a Configuration. I.Finocchi and E.Fusco Hadoop (Hands On) 6/25
The class org.apache.hadoop.mapreduce.job Job is the class used to submit a MapReduce task to the cluster: Job job = Job.getInstance(new Configuration()); job.setjarbyclass(myjob.class); // Specify various job-specific parameters job.setjobname("myjob"); job.setinputpath(new Path("in")); job.setoutputpath(new Path("out")); job.setmapperclass(myjob.mymapper.class); job.setreducerclass(myjob.myreducer.class); /* Submit the job, then poll for progress until * the job is complete */ job.waitforcompletion(true); I.Finocchi and E.Fusco Hadoop (Hands On) 7/25
The Mapper and Reducer classes The way to define what a Job should do is to assign it with our custom Mapper and Reducer subclasses: The Mapper class: org.apache.hadoop.mapreduce.mapper <KEYIN,VALUEIN,KEYOUT,VALUEOUT> The Reducer class: org.apache.hadoop.mapreduce.reducer <KEYIN,VALUEIN,KEYOUT,VALUEOUT> I.Finocchi and E.Fusco Hadoop (Hands On) 8/25
Methods of the Mapper class protected void setup(context context) throws IOException, InterruptedException { } protected void cleanup(context context) throws IOException, InterruptedException { } protected void map(keyin key, VALUEIN value, Context context) throws IOException, InterruptedException { } public void run(context context) throws IOException, InterruptedException { } I.Finocchi and E.Fusco Hadoop (Hands On) 9/25
Methods of the Reducer class protected void setup(context context) throws IOException, InterruptedException { } protected void cleanup(context context) throws IOException, InterruptedException { } protected void reduce(keyin key, Iterable<VALUEIN> value, Context context) throws IOException, InterruptedException { } public void run(context context) throws IOException, InterruptedException { } I.Finocchi and E.Fusco Hadoop (Hands On) 10/25
The Context(s) Finally we should consider the Context classes. These are inner classes of the Mapper and Reducer classes. We are interested in the method: write(keyout, VALUEOUT) (inherited from org.apache.hadoop.mapreduce. TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>) I.Finocchi and E.Fusco Hadoop (Hands On) 11/25
Creating the project We ll start by creating a WordCount project in Eclipse. I.Finocchi and E.Fusco Hadoop (Hands On) 12/25
I.Finocchi and E.Fusco Hadoop (Hands On) 13/25
I.Finocchi and E.Fusco Hadoop (Hands On) 14/25
Importing resources Unzip the resoruces.tar.gz file you downloaded and change to the folder where you extracted the files. Importing the WordCount.java skeleton file cp WordCount.java ~/Projects/WordCount/src/ Setting the classpath sed s?your_path_to_yarn?-? classpath >~/Projects/WordCount/.classpath I.Finocchi and E.Fusco Hadoop (Hands On) 15/25
Eclipse Fill in the missing parts in Eclipse! I.Finocchi and E.Fusco Hadoop (Hands On) 16/25
Missing parts MyMapper fields: private final static IntWritable one = new IntWritable(1); private Text word = new Text(); MyMapper map function body: Scanner scanner = new Scanner(value.toString()); scanner.usedelimiter(" "); while (scanner.hasnext()) { word.set(scanner.next()); context.write(word, one); } scanner.close(); MyReducer reduce function body: int sum = 0; for(intwritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); I.Finocchi and E.Fusco Hadoop (Hands On) 17/25
Running WordCount Start Hadoop start-dfs.sh start-yarn.sh I.Finocchi and E.Fusco Hadoop (Hands On) 18/25
Running WordCount Start Hadoop start-dfs.sh start-yarn.sh Aquickcheck $ jps 90540 Jps 90192 DataNode 90298 SecondaryNameNode 90502 NodeManager 90412 ResourceManager 90106 NameNode I.Finocchi and E.Fusco Hadoop (Hands On) 18/25
Running WordCount (continued) Midsummer Night s Dream hadoop fs -put MidSummerNightsDream.txt /in.txt $ hadoop fs -ls / 92529 2014-03-28 12:08 /in.txt I.Finocchi and E.Fusco Hadoop (Hands On) 19/25
Running WordCount (continued) Midsummer Night s Dream hadoop fs -put MidSummerNightsDream.txt /in.txt $ hadoop fs -ls / 92529 2014-03-28 12:08 /in.txt Submit the jar file yarn jar ~/Projects/WordCount/WordCount.jar -m 4 -r 2 /in.txt /out I.Finocchi and E.Fusco Hadoop (Hands On) 19/25
Running WordCount (continued) Check the output hadoop fs -ls /out 0 2014-03-28 12:22 /out/_success 22172 2014-03-28 12:22 /out/part-r-00000 22328 2014-03-28 12:22 /out/part-r-00001 hadoop fs -cat /out/part-r-00000 less hadoop fs -cat /out/part-r-00001 less I.Finocchi and E.Fusco Hadoop (Hands On) 20/25
DegreeCalculator With this example we will show: hot to let Hadoop parse common arguments automatically; how to pass arguments to the mappers and the reducers. I.Finocchi and E.Fusco Hadoop (Hands On) 21/25
Missing code parts I Constant declaration: public final static String KEEP_ABOVE = "KEEP_ABOVE"; DegreeCalculator class declaration: public class DegreeCalculator extends Configured implements Tool { } run(string[] args) method: Configuration conf = this.getconf(); int keepabove =-1; if(args.length>2) { keepabove = Integer.parseInt(args[2]); } conf.setint(keep_above, keepabove); I.Finocchi and E.Fusco Hadoop (Hands On) 22/25
Missing code parts II main (String[] args) method: DegreeCalculator dc = new DegreeCalculator(); dc.setconf(conf); int res = ToolRunner.run(dc, args); System.exit(res); I.Finocchi and E.Fusco Hadoop (Hands On) 23/25
MyReducer class I public static class MyReducer extends Reducer<Text, Text, Text, IntWritable> { IntWritable degree= new IntWritable(); String keyval; int keepabove=-1; @Override protected void setup(context context) throws IOException, InterruptedException { super.setup(context); keepabove = context.getconfiguration().getint( DegreeCalculator.KEEP_ABOVE, -1); } I.Finocchi and E.Fusco Hadoop (Hands On) 24/25
MyReducer class II } @Override protected void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int size = 0; keyval=key.tostring(); for (Text t : values) { if (!t.tostring().equals(keyval)) //remove self-loops size++; } if(keepabove < size) { degree.set(size); context.write(key,degree); } } I.Finocchi and E.Fusco Hadoop (Hands On) 25/25