Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Similar documents
Word Count Code using MR2 Classes and API

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Hadoop Configuration and First Examples

Getting to know Apache Hadoop

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

BIG DATA APPLICATIONS

Introduction to MapReduce and Hadoop

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop Basics with InfoSphere BigInsights

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Mrs: MapReduce for Scientific Computing in Python

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Word count example Abdalrahman Alsaedi

Tutorial. Christopher M. Judd

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

map/reduce connected components

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

How To Write A Mapreduce Program In Java.Io (Orchestra)

Introduc)on to Map- Reduce. Vincent Leroy

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

IDS 561 Big data analytics Assignment 1

Health Care Claims System Prototype

Hadoop Tutorial. General Instructions

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Big Data 2012 Hadoop Tutorial

MR-(Mapreduce Programming Language)

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Xiaoming Gao Hui Li Thilina Gunarathne

Three Approaches to Data Analysis with Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce. Tushar B. Kute,

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Running Hadoop on Windows CCNP Server

Hadoop Integration Guide

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

19 Putting into Practice: Large-Scale Data Management with HADOOP

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop Integration Guide

Hadoop MultiNode Cluster Setup

Introduction to Big Data Science. Wuhui Chen

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Map Reduce & Hadoop Recommended Text:

Basic Hadoop Programming Skills

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

University of Maryland. Tuesday, February 2, 2010

HPCHadoop: MapReduce on Cray X-series

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Advanced Java Client API

Connecting Hadoop with Oracle Database

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Internals of Hadoop Application Framework and Distributed File System

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Project 5 Twitter Analyzer Due: Fri :59:59 pm

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

CS 378 Big Data Programming

Hadoop, Hive & Spark Tutorial

Hadoop Installation MapReduce Examples Jake Karnes

The Hadoop Eco System Shanghai Data Science Meetup

Big Data for the JVM developer. Costin Leau,

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

CS54100: Database Systems

Hadoop + Clojure. Hadoop World NYC Friday, October 2, Stuart Sierra, AltLaw.org

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Management and NoSQL Databases

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

TP1: Getting Started with Hadoop

Big Data Analytics* Outline. Issues. Big Data

How To Install Hadoop From Apa Hadoop To (Hadoop)

Introduction to Cloud Computing

Distributed Systems + Middleware Hadoop

Tutorial. Christopher M. Judd

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Data Science in the Wild

Distributed Model Checking Using Hadoop

Enterprise Data Storage and Analysis on Tim Barr

Biomap Jobs and the Big Picture

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

Assignment 1: MapReduce with Hadoop

Click Stream Data Analysis Using Hadoop

How to program a MapReduce cluster

Hadoop: Understanding the Big Data Processing Method

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Hadoop Training Hands On Exercise

BIG DATA, MAPREDUCE & HADOOP

Scalable Computing with Hadoop

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Transcription:

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Big Data Computing March 23, 2015. Master s Degree in Computer Science Academic Year 2014-2015, spring semester I.Finocchi and E.Fusco Hadoop (Hands On) 1/25

A quick glance to Hadoop Hadoop is the industry standard open source implementation of MapReduce (and something more that we will not address in this course). Data intensive/computation intensive Hadoop is mainly intended to solve data intensive tasks. SETI@home moves data from central servers to idle desktops and performs computationally intensive analysis on small data; Hadoop ideally moves the code that is needed to perform the analysis to the machine that already contains the data to be analyzed. I.Finocchi and E.Fusco Hadoop (Hands On) 2/25

Hadoop vs. SQL Structured/Unstructured data SQL works on structured data; it enforces invariants and maintains relations among tuples. This comes with a price: larger datasets and increased performance needs require to scale-up (buy ONE more powerful machine). Hadoop works on loosely structured data (key-value pairs). It cannot enforce relations and invariants, but allows to scale-out: largerdatasetsandincreasedperformanceneedscan be dealt with buying more machines. I.Finocchi and E.Fusco Hadoop (Hands On) 3/25

Batch processing Online/Batch data processing Online queries involving a few key value pairs tend to be inefficient: Hadoop works best on batch computations involving terabytes of data. In the rest of this lesson we will introduce the main classes and object you, as an Hadoop programmer, are required to interact with in order to write and execute your custom application. I.Finocchi and E.Fusco Hadoop (Hands On) 4/25

The APIs The official JavaDoc of all Hadoop 2.6.0 classes is available at the following URL: http://hadoop.apache.org/docs/r2.6.0/api/ overview-summary.html We will review quickly the interfaces, classes, and methods we need to run our first Hadoop program. I.Finocchi and E.Fusco Hadoop (Hands On) 5/25

The package org.apache.hadoop.conf This package contains two classes and one interface: Interface Configurable Something that may be configured with a Configuration. We should check its subinterface org.apache.hadoop.util.tool Class Configuration The container class for <key, value> pairs of configuration parameters. Class Configured Objects that can be configured with a Configuration. I.Finocchi and E.Fusco Hadoop (Hands On) 6/25

The class org.apache.hadoop.mapreduce.job Job is the class used to submit a MapReduce task to the cluster: Job job = Job.getInstance(new Configuration()); job.setjarbyclass(myjob.class); // Specify various job-specific parameters job.setjobname("myjob"); job.setinputpath(new Path("in")); job.setoutputpath(new Path("out")); job.setmapperclass(myjob.mymapper.class); job.setreducerclass(myjob.myreducer.class); /* Submit the job, then poll for progress until * the job is complete */ job.waitforcompletion(true); I.Finocchi and E.Fusco Hadoop (Hands On) 7/25

The Mapper and Reducer classes The way to define what a Job should do is to assign it with our custom Mapper and Reducer subclasses: The Mapper class: org.apache.hadoop.mapreduce.mapper <KEYIN,VALUEIN,KEYOUT,VALUEOUT> The Reducer class: org.apache.hadoop.mapreduce.reducer <KEYIN,VALUEIN,KEYOUT,VALUEOUT> I.Finocchi and E.Fusco Hadoop (Hands On) 8/25

Methods of the Mapper class protected void setup(context context) throws IOException, InterruptedException { } protected void cleanup(context context) throws IOException, InterruptedException { } protected void map(keyin key, VALUEIN value, Context context) throws IOException, InterruptedException { } public void run(context context) throws IOException, InterruptedException { } I.Finocchi and E.Fusco Hadoop (Hands On) 9/25

Methods of the Reducer class protected void setup(context context) throws IOException, InterruptedException { } protected void cleanup(context context) throws IOException, InterruptedException { } protected void reduce(keyin key, Iterable<VALUEIN> value, Context context) throws IOException, InterruptedException { } public void run(context context) throws IOException, InterruptedException { } I.Finocchi and E.Fusco Hadoop (Hands On) 10/25

The Context(s) Finally we should consider the Context classes. These are inner classes of the Mapper and Reducer classes. We are interested in the method: write(keyout, VALUEOUT) (inherited from org.apache.hadoop.mapreduce. TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>) I.Finocchi and E.Fusco Hadoop (Hands On) 11/25

Creating the project We ll start by creating a WordCount project in Eclipse. I.Finocchi and E.Fusco Hadoop (Hands On) 12/25

I.Finocchi and E.Fusco Hadoop (Hands On) 13/25

I.Finocchi and E.Fusco Hadoop (Hands On) 14/25

Importing resources Unzip the resoruces.tar.gz file you downloaded and change to the folder where you extracted the files. Importing the WordCount.java skeleton file cp WordCount.java ~/Projects/WordCount/src/ Setting the classpath sed s?your_path_to_yarn?-? classpath >~/Projects/WordCount/.classpath I.Finocchi and E.Fusco Hadoop (Hands On) 15/25

Eclipse Fill in the missing parts in Eclipse! I.Finocchi and E.Fusco Hadoop (Hands On) 16/25

Missing parts MyMapper fields: private final static IntWritable one = new IntWritable(1); private Text word = new Text(); MyMapper map function body: Scanner scanner = new Scanner(value.toString()); scanner.usedelimiter(" "); while (scanner.hasnext()) { word.set(scanner.next()); context.write(word, one); } scanner.close(); MyReducer reduce function body: int sum = 0; for(intwritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); I.Finocchi and E.Fusco Hadoop (Hands On) 17/25

Running WordCount Start Hadoop start-dfs.sh start-yarn.sh I.Finocchi and E.Fusco Hadoop (Hands On) 18/25

Running WordCount Start Hadoop start-dfs.sh start-yarn.sh Aquickcheck $ jps 90540 Jps 90192 DataNode 90298 SecondaryNameNode 90502 NodeManager 90412 ResourceManager 90106 NameNode I.Finocchi and E.Fusco Hadoop (Hands On) 18/25

Running WordCount (continued) Midsummer Night s Dream hadoop fs -put MidSummerNightsDream.txt /in.txt $ hadoop fs -ls / 92529 2014-03-28 12:08 /in.txt I.Finocchi and E.Fusco Hadoop (Hands On) 19/25

Running WordCount (continued) Midsummer Night s Dream hadoop fs -put MidSummerNightsDream.txt /in.txt $ hadoop fs -ls / 92529 2014-03-28 12:08 /in.txt Submit the jar file yarn jar ~/Projects/WordCount/WordCount.jar -m 4 -r 2 /in.txt /out I.Finocchi and E.Fusco Hadoop (Hands On) 19/25

Running WordCount (continued) Check the output hadoop fs -ls /out 0 2014-03-28 12:22 /out/_success 22172 2014-03-28 12:22 /out/part-r-00000 22328 2014-03-28 12:22 /out/part-r-00001 hadoop fs -cat /out/part-r-00000 less hadoop fs -cat /out/part-r-00001 less I.Finocchi and E.Fusco Hadoop (Hands On) 20/25

DegreeCalculator With this example we will show: hot to let Hadoop parse common arguments automatically; how to pass arguments to the mappers and the reducers. I.Finocchi and E.Fusco Hadoop (Hands On) 21/25

Missing code parts I Constant declaration: public final static String KEEP_ABOVE = "KEEP_ABOVE"; DegreeCalculator class declaration: public class DegreeCalculator extends Configured implements Tool { } run(string[] args) method: Configuration conf = this.getconf(); int keepabove =-1; if(args.length>2) { keepabove = Integer.parseInt(args[2]); } conf.setint(keep_above, keepabove); I.Finocchi and E.Fusco Hadoop (Hands On) 22/25

Missing code parts II main (String[] args) method: DegreeCalculator dc = new DegreeCalculator(); dc.setconf(conf); int res = ToolRunner.run(dc, args); System.exit(res); I.Finocchi and E.Fusco Hadoop (Hands On) 23/25

MyReducer class I public static class MyReducer extends Reducer<Text, Text, Text, IntWritable> { IntWritable degree= new IntWritable(); String keyval; int keepabove=-1; @Override protected void setup(context context) throws IOException, InterruptedException { super.setup(context); keepabove = context.getconfiguration().getint( DegreeCalculator.KEEP_ABOVE, -1); } I.Finocchi and E.Fusco Hadoop (Hands On) 24/25

MyReducer class II } @Override protected void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int size = 0; keyval=key.tostring(); for (Text t : values) { if (!t.tostring().equals(keyval)) //remove self-loops size++; } if(keepabove < size) { degree.set(size); context.write(key,degree); } } I.Finocchi and E.Fusco Hadoop (Hands On) 25/25