Hadoop WordCount Explained! IT332 Distributed Systems



Similar documents
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Word count example Abdalrahman Alsaedi

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Introduction to MapReduce and Hadoop

How To Write A Mapreduce Program In Java.Io (Orchestra)

map/reduce connected components

Data Science in the Wild

Introduction To Hadoop

CS54100: Database Systems

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

MR-(Mapreduce Programming Language)

Hadoop: Understanding the Big Data Processing Method

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

By Hrudaya nath K Cloud Computing

Word Count Code using MR2 Classes and API

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Getting to know Apache Hadoop

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Configuration and First Examples

HPCHadoop: MapReduce on Cray X-series

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Extreme Computing. Hadoop MapReduce in more detail.

Big Data Management and NoSQL Databases

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Programming in Hadoop Programming, Tuning & Debugging

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Hadoop, Hive & Spark Tutorial

Internals of Hadoop Application Framework and Distributed File System

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Introduction to Cloud Computing

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Hadoop Design and k-means Clustering

Download and install Download virtual machine Import virtual machine in Virtualbox

Scalable Computing with Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

University of Maryland. Tuesday, February 2, 2010

Data Science Analytics & Research Centre

PassTest. Bessere Qualität, bessere Dienstleistungen!

Implementations of iterative algorithms in Hadoop and Spark

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

MapReduce Tutorial. Table of contents

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Introduction to MapReduce and Hadoop

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Connecting Hadoop with Oracle Database

Lecture 3 Hadoop Technical Introduction CSE 490H

INTRODUCTION TO HADOOP

Data management in the cloud using Hadoop

BIG DATA, MAPREDUCE & HADOOP

The Hadoop Eco System Shanghai Data Science Meetup

Introduction to MapReduce

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Xiaoming Gao Hui Li Thilina Gunarathne

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

An Implementation of Sawzall on Hadoop

Mrs: MapReduce for Scientific Computing in Python

Hadoop Streaming. Table of contents

HowTo Hadoop. Devaraj Das

Big Data Analytics* Outline. Issues. Big Data

MAPREDUCE Programming Model

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big Data and Scripting map/reduce in Hadoop

Cloudera Certified Developer for Apache Hadoop

Developing a MapReduce Application

Lecture #1. An overview of Big Data

CS 378 Big Data Programming

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Big Data 2012 Hadoop Tutorial

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Introduc8on to Apache Spark

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Data-intensive computing systems

MapReduce Programming with Apache Hadoop Viraj Bhat

Map Reduce & Hadoop Recommended Text:

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data and Apache Hadoop s MapReduce

Transcription:

Hadoop WordCount Explained! IT332 Distributed Systems

Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, Map and Reduce change to fit the problem

Data Flow 1. Mappers read from HDFS 2. Map output is partitioned by key and sent to Reducers 3. Reducers sort input by key 4. Reduce output is written to HDFS

Detailed Hadoop MapReduce data flow

MapReduce Paradigm Basic data type: the key-value pair (k,v). For example, key = URL, value = HTML of the web page. Programmer specifies two primary methods: Map: (k, v) <(k 1,v 1 ), (k 2,v 2 ), (k 3,v 3 ),,(k n,v n )> Reduce: (k', <v 1, v 2,,v n >) <(k', v' 1 ), (k', v' 2 ),,(k', v' n )> All v' with same k' are reduced together. (Remember the invisible Shuffle and Sort step.)

MapReduce Paradigm Shuffle Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP. Sort The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged Hadoop framework handles the Shuffle and Sort step..

MapReduce Paradigm: wordcount flow Key/value Pairs: (fileoffset,line) Map (word,1) Reduce (word,n) file offset is position within the file.

WordCount in Web Pages A typical exercise for a new Google engineer in his or her first week Input: files with one document per record Specify a map function that takes a key/value pair key = document URL value = document contents Output of map function is (potentially many) key/value pairs. In our case, output (word, 1 ) once per word in the document

WordCount in Web Pages MapReduce library gathers together all pairs with the same key (shuffle/sort) Specify a reduce function that combines the values for a key In our case, compute the sum key = be values = 1, 1 key = not values = 1 key = or values 1 = 1 key = to values = 1, 1 2 1 2 Output of reduce (usually 0 or 1 value) paired with key and saved

MapReduce Programe A MapReduce program consists of the following 3 parts : Driver main (would trigger the map and reduce methods) Mapper Reducer it is better to include the map reduce and main methods in 3 different classes

Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { } } } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one);

Mapper //Map class header public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> extends MapReduceBase: Base class for Mapper and Reducer implementations. Provides default no-op implementations for a few methods, most non-trivial applications need to override some of them. Example implements Mapper : Interface Mapper<K1,V1,K2,V2> has the map method <K1,V1,K2,V2>first pair is the input key/value pair, second is the output key/value pair. <LongWritable, Text, Text, IntWritable> The data types provided here are Hadoop specific data types designed for operational efficiency suited for massive parallel and lightning fast read write operations. All these data types are based out of java data types itself, for example LongWritable is the equivalent for long in java, IntWritable for int and Text for String

Mapper //hadoop supported data types for the key/value pairs private final static IntWritable one = new IntWritable(1); private Text word = new Text(); //Map method header public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { LongWritable key, Text value :Data type of the input Key and Value to the mapper. OutputCollector<Text, IntWritable> output :output collector <K,V>: collect data output by either the Mapper or the Reducer i.e. intermediate outputs or the output of the job **k and v are the output Key and Value from the mapper. EX: < the,1> Reporter reporter: the reporter is used to report the task status internally in Hadoop environment to avoid time outs.

Mapper public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { //Convert the input line in Text type to a String String line = value.tostring(); //Use a tokenizer to split the line into words StringTokenizer tokenizer = new StringTokenizer(line); //Iterate through each word and a form key value pairs while (tokenizer.hasmoretokens()) { //Assign each work from the tokenizer(of String type) //to a Text word word.set(tokenizer.nexttoken()); //Form key value pairs for each word as <word,one> and push //it to the output collector output.collect(word, one); } }

Reducer //Class Header similar to the one in map public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { //Reduce Header similar to the one in map with different //key/value data type //data from map will be < word,{1,1,..}> so we get it with an //Iterator so we can go through the sets of values public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { //Initaize a variable sum as 0 int sum = 0; //Iterate through all the values with respect to a key and //sum up all of them while (values.hasnext()) { sum += values.next().get(); } //Push to the output collector the Key and the obtained sum as value output.collect(key, new IntWritable(sum)); }

Main(The Driver) Given the Mapper and Reducer code, the short main() starts the Map- Reduction running. The Hadoop system picks up a bunch of values from the command line on its own. Then the main() also specifies a few key parameters of the problem in the JobConf object. JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution (such as what Map and Reduce classes to use and the format of the input and output files). Other parameters, i.e. the number of machines to use, are optional and the system will determine good values for them if not specified. Then the framework tries to faithfully execute the job as-is described by JobConf.

Main(The Driver) Assume we did the word count on book how many of ( the,1) have as out put then share with other machines? Sloution : we can perform a "local reduce" on the outputs produced by a single Mapper, before the intermediate values are shuffled (expensive I/O) to the Reducers. Combiner class "mini-reduce" use when the operation is commutative and associative. Ex:Median is an example that doesn't work! (SUM is commutative and associative) machine A emits <the, 1>, <the, 1> machine B emits <the, 1>. a Combiner on machine A emits <the, 2>. This value, along with the <the, 1> from machine B will both go to the Reducer node we have now saved bandwidth but preserved the computation. (This is why our reducer actually reads the value out of its input, instead of simply assuming the value is 1.)

Main public static void main(string[] args) throws Exception { //creating a JobConf object and assigning a job name for identification purposes JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); //Setting configuration object with the Data Type of output Key and Value for //map and reduce if you have diffrent type of outputs there is other set method //for them conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); //Providing the mapper and reducer class names conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); //set thecombiner class conf.setreducerclass(reduce.class); // The default input format, "TextInputFormat," will load data in as // (LongWritable, Text) pairs. The long value is the byte offset of the line in //the file. conf.setinputformat(textinputformat.class); // The basic (default) instance is TextOutputFormat, which writes (key, value) //pairs on individual lines of a text file. conf.setoutputformat(textoutputformat.class); //the hdfs input and output directory to be fetched from the command line FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); //submits the job to MapReduce. and returns only after the job has completed JobClient.runJob(conf); }

Chaining Jobs Not every problem can be solved with a MapReduce program, but fewer still are those which can be solved with a single MapReduce job. Many problems can be solved with MapReduce, by writing several MapReduce steps which run in series to accomplish a goal: Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3 You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc.

How it Works? Purpose Simple serialization for keys, values, and other data Interface Writable Read and write binary format Convert to String for text formats

Interface Text Stores text using standard UTF8 encoding. It provides methods to serialize, deserialize, and compare texts at byte level. Methods: set(byte[] utf8)

Classes OutputCollector Collects data output by either the Mapper or the Reducer i.e. intermediate outputs or the output of the job. Methods: collect(k key, V value) We have linked class and Interfaces in previous slides.

Exercises Get WordCount running! Extend WordCount to count Bigrams: Bigrams are simply sequences of two consecutive words. For example, the previous sentence contains the following bigrams: "Bigrams are", "are simply", "simply sequences", "sequence of", etc.