Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology



Similar documents
Halloween Costume Ideas For Xbox 360

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop Configuration and First Examples

Extreme Computing. Hadoop MapReduce in more detail.

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Word Count Code using MR2 Classes and API

Introduc)on to Map- Reduce. Vincent Leroy

Getting to know Apache Hadoop

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Basics with InfoSphere BigInsights

Hadoop WordCount Explained! IT332 Distributed Systems

Data-intensive computing systems

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

University of Maryland. Tuesday, February 2, 2010

COURSE CONTENT Big Data and Hadoop Training

Tutorial- Counting Words in File(s) using MapReduce

BIG DATA APPLICATIONS

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

How MapReduce Works 資碩一 戴睿宸

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Mrs: MapReduce for Scientific Computing in Python

The Hadoop Eco System Shanghai Data Science Meetup

Lecture 3 Hadoop Technical Introduction CSE 490H

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to MapReduce and Hadoop

Big Data Management and NoSQL Databases

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Internals of Hadoop Application Framework and Distributed File System

An Overview of Hadoop

Hadoop Design and k-means Clustering

19 Putting into Practice: Large-Scale Data Management with HADOOP

Data Science in the Wild

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Xiaoming Gao Hui Li Thilina Gunarathne

BIG DATA, MAPREDUCE & HADOOP

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Big Data 2012 Hadoop Tutorial

Hadoop Streaming. Table of contents

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Hadoop: Understanding the Big Data Processing Method

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Introduction to Big Data Science. Wuhui Chen

map/reduce connected components

A very short Intro to Hadoop

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

HADOOP PERFORMANCE TUNING

MapReduce Job Processing

Word count example Abdalrahman Alsaedi

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop: The Definitive Guide

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

How To Write A Mapreduce Program In Java.Io (Orchestra)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

6. How MapReduce Works. Jari-Pekka Voutilainen

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Tes$ng Hadoop Applica$ons. Tom Wheeler

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Distributed Systems + Middleware Hadoop

By Hrudaya nath K Cloud Computing

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

MR-(Mapreduce Programming Language)

Introduction to Cloud Computing

Hadoop: The Definitive Guide

INTRODUCTION TO HADOOP

PassTest. Bessere Qualität, bessere Dienstleistungen!

Map Reduce & Hadoop Recommended Text:

Hadoop Architecture. Part 1

Scalable Computing with Hadoop

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Health Care Claims System Prototype

CS54100: Database Systems

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Connecting Hadoop with Oracle Database

Three Approaches to Data Analysis with Hadoop

Big Data and Scripting map/reduce in Hadoop

Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Complete Java Classes Hadoop Syllabus Contact No:

H2O on Hadoop. September 30,

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Transcription:

Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new org.apache.hadoop.mapreduce API, but Many existing programs might be written using the old API org.apache.hadoop.mapred Some old libraries might only support the old API 1 2 Important Terminology NameNode daemon Corresponds to GFS Master Runs on master node of the Hadoop Distributed File System (HDFS) Directs DataNodes to perform their low-level I/O tasks DataNode daemon Corresponds to GFS chunkserver Runs on each slave machine in the HDFS Does the low-level I/O work Important Terminology Secondary NameNode daemon One per cluster to monitor status of HDFS Takes snapshots of HDFS metadata to facilitate recovery from NameNode failure JobTracker daemon MapReduce master in Google paper One per cluster, usually running on master node Communicates with client application and controls MapReduce execution in TaskTrackers 3 4 Important Terminology Anatomy of MapReduce Job Run TaskTracker daemon MapReduce worker in Google paper One TaskTracker per slave node Performs actual Map and Reduce execution Can spawn multiple JVMs to do the work Typical setup NameNode and JobTracker run on cluster head node DataNode and TaskTracker run on all other nodes Secondary NameNode runs on dedicated machine or on cluster head node (usually not a good idea, but ok for small clusters) 5 Client node JobTracker node 2: get new job ID MapReduce 1: run job Job JobTracker 5: initialize job program 4: submit job Client JVM 6: retrieve 3: copy job input split resources (job 7.1: heartbeat info 7.2: task JAR, config file, (slots free) input split info) HDFS TaskTracker 8: retrieve job resources 9: launch Child JVM Child 10: run Illustration based on White s book Map or Reduce task TaskTracker node 6 1

Job Submission Client submits MapReduce job through Job.submit() call waitforcompletion() submits job and polls JobTracker about progress every sec, outputs to console if changed Job submission process Get new job ID from JobTracker Determine input splits for job Copy job resources (job JAR file, configuration file, computed input splits) to HDFS into directory named after the job ID Informs JobTracker that job is ready for execution Job Initialization JobTracker puts ready job into internal queue Job scheduler picks job from queue Initializes it by creating job object Creates list of tasks One map task for each input split Number of reduce tasks determined by mapred.reduce.tasks property in Job, which is set by setnumreducetasks() Tasks need to be assigned to worker nodes 7 8 Task Assignment TaskTrackers send heartbeat to JobTracker Indicate if ready to run new tasks Number of slots for tasks depends on number of cores and memory size JobTracker replies with new task Chooses task from first job in priority-queue Chooses map tasks before reduce tasks Chooses map task whose input split location is closest to machine running the TaskTracker instance Ideal case: data-local task Could also use other scheduling policy Task Execution TaskTracker copies job JAR and other configuration data (e.g., distributed cache) from HDFS to local disk Creates local working directory Creates TaskRunner instance TaskRunner launches new JVM (or reuses one from another task) to execute the JAR 9 10 Monitoring Job Progress Tasks report progress to TaskTracker TaskTracker includes task progress in heartbeat message to JobTracker JobTracker computes global status of job progress JobClient polls JobTracker regularly for status Visible on console and Web UI Handling Failures: Task Error reported to TaskTracker and logged Hanging task detected through timeout JobTracker will automatically re-schedule failed tasks Tries up to mapred.map.max.attempts many times (similar for reduce) Job is aborted when task failure rate exceeds mapred.max.map.failures.percent (similar for reduce) 11 12 2

Handling Failures: TaskTracker and JobTracker TaskTracker failure detected by JobTracker from missing heartbeat messages JobTracker re-schedules map tasks and not completed reduce tasks from that TaskTracker Hadoop cannot deal with JobTracker failure Could use Google s proposed JobTracker take-over idea, using ZooKeeper to make sure there is at most one JobTracker Moving Data From Mappers to Reducers Shuffle and sort phase = synchronization barrier between map and reduce phase Often one of the most expensive parts of a MapReduce execution Mappers need to separate output intended for different reducers Reducers need to collect their data from all mappers and group it by key Keys at each reducer are processed in order 13 14 Map task Input split M a p Shuffle and Sort Overview Reduce task starts copying data from map task as soon as it completes. Reduce cannot start working on the data until all mappers have finished and their data has arrived. Spilled to a new disk file when almost full Buffer in memory There are tuning parameters to control the performance of this crucial phase. Spill files on disk: partitioned by reduce task, each partition sorted by key Spill files merged into single output file From other Maps Fetch over HTTP Merge happens in memory if data fits, otherwise also on disk merge merge To other Reduces merge Reduce task R e d u c e Output Illustration based on White s book 15 NCDC Weather Data Example Raw data has lines like these (year, temperature in bold) 0067011990999991950051507004+68750+023550FM- 12+038299999V0203301N00671220001CN9999999N9+00 001+99999999999 0043011990999991950051512004+68750+023550FM- 12+038299999V0203201N00671220001CN9999999N9+00 221+99999999999 Goal: find max temperature for each year Map: emit (year, temp) for each year Reduce: compute max over temp from (year, (temp, temp, )) list 16 Map Hadoop s Mapper class Org.apache.hadoop.mapreduce.Mapper Type parameters: input key type, input value type, output key type, and output value type Input key: line s offset in file (irrelevant) Input value: line from NCDC file Output key: year Output value: temperature Data types are optimized for network serialization Found in org.apache.hadoop.io package Work is done by the map() method Map() Method Input: input key type, input value type (and a Context) Line of text from NCDC file Converted to Java String type, then parsed to get year and temperature Output: written using Context Uses output key and value types Only write (year, temp) pair if the temperature is present and quality indicator reading is OK 17 18 3

import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; @Override public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String year = line.substring(15, 19); int airtemperature; if (line.charat(87) == '+') { airtemperature = Integer.parseInt(line.substring(88, 92)); else { airtemperature = Integer.parseInt(line.substring(87, 92)); String quality = line.substring(92, 93); if (airtemperature!= MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); // parseint doesn't like leading plus signs 19 Reduce Implements org.apache.hadoop.mapreduce.reducer Input key and value types must match Mapper output key and value types Work is done by reduce() method Input values passed as Iterable list Goes over all temperatures to find the max Result pair is written by using the Context Writes result to HDFS, Hadoop s distributed file system 20 import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxvalue = Integer.MIN_VALUE; for (IntWritable value : values) { maxvalue = Math.max(maxValue, value.get()); context.write(key, new IntWritable(maxValue)); 21 Job Configuration Job object forms the job specification and gives control for running the job Specify data input path using addinputpath() Can be single file, directory (to use all files there), or file pattern Can be called multiple times to add multiple paths Specify output path using setoutputpath() Single output path, which is a directory for all output files Set mapper and reducer class to be used Set output key and value classes for map and reduce functions For reducer: setoutputkeyclass(), setoutputvalueclass() For mapper (omit if same as reducer): setmapoutputkeyclass(), setmapoutputvalueclass() Can set input types similarly (default is TextInputFormat) Method waitforcompletion() submits job and waits for it to finish 22 import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; Extension: Combiner Functions public class MaxTemperature { public static void main(string[] args) throws Exception { if (args.length!= 2) { System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); Job job = new Job(); job.setjarbyclass(maxtemperature.class); job.setjobname("max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setmapperclass(maxtemperaturemapper.class); job.setreducerclass(maxtemperaturereducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); System.exit(job.waitForCompletion(true)? 0 : 1); 23 Recall earlier discussion about combiner function Pre-reduces mapper output before transfer to reducers Does not change program semantics Usually (almost) same as reduce function, but has to have same output type as Map Works only for some reduce functions that can be incrementally computed MAX(5, 4, 1, 2) = MAX(MAX(5, 1), MAX(4, 2)) Same for SUM, MIN, COUNT, AVG (=SUM/COUNT) 24 4

public class MaxTemperatureWithCombiner { public static void main(string[] args) throws Exception { if (args.length!= 2) { System.err.println("Usage: MaxTemperatureWithCombiner <input path> " + "<output path>"); System.exit(-1); Job job = new Job(); job.setjarbyclass(maxtemperaturewithcombiner.class); job.setjobname("max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setmapperclass(maxtemperaturemapper.class); job.setcombinerclass(maxtemperaturereducer.class); job.setreducerclass(maxtemperaturereducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); Note: combiner here is identical to reducer class. Extension: Custom Partitioner Partitioner determines which keys are assigned to which reduce task Default HashPartitioner essentially assigns keys randomly Create custom partitioner by implementing your own getpartition() method of Partitioner in org.apache.hadoop.mapreduce System.exit(job.waitForCompletion(true)? 0 : 1); 25 26 MapReduce Development Steps 1. Write Map and Reduce functions Create unit tests 2. Write driver program to run a job Can run from IDE with small data subset for testing If test fails, use IDE for debugging Update unit tests and Map/Reduce if necessary 3. Once program works on small test set, run it on full data set If there are problems, update tests and code accordingly 4. Fine-tune code, do some profiling Local (Standalone) Mode Runs same MapReduce user program as cluster version, but does it sequentially Does not use any of the Hadoop daemons Works directly with local file system No HDFS, hence no need to copy data to/from HDFS Great for development, testing, initial debugging 27 28 Pseudo-Distributed Mode Still runs on single machine, but now simulates a real Hadoop cluster Simulates multiple nodes Runs all daemons Uses HDFS Main purpose: more advanced testing and debugging You can also set this up on your laptop Extension: MapFile Sorted file of (key, value) pairs with an index for lookups by key Must append new entries in order Can create MapFile by sorting SequenceFile Can get value for specific key by calling MapFile s get() method Found by performing binary search on index Method getclosest() finds closest match to search key 29 30 5

Extension: Counters Useful to get statistics about the MapReduce job, e.g., how many records were discarded in Map Difficult to implement from scratch Mappers and reducers need to communicate to compute a global counter Hadoop has built-in support for counters See ch. 8 in Tom White s book for details Hadoop Job Tuning Choose appropriate number of mappers and reducers Define combiners whenever possible But see also later discussion about local aggregation Consider Map output compression Optimize the expensive shuffle phase (between mappers and reducers) by setting its tuning parameters Profiling distributed MapReduce jobs is challenging. 31 32 Hadoop and Other Programming Languages Hadoop Streaming API to write map and reduce functions in languages other than Java Any language that can read from standard input and write to standard output Hadoop Pipes API for using C++ Uses sockets to communicate with Hadoop s task trackers Multiple MapReduce Steps Example: find average max temp for every day of the year and every weather station Find max temp for each combination of station and day/month/year Compute average for each combination of station and day/month Can be done in two MapReduce jobs Could also combine it into single job, which would be faster 33 34 Running a MapReduce Workflow Linear chain of jobs To run job2 after job1, create JobConf s conf1 and conf2 in main function Call JobClient.runJob(conf1); JobClient.runJob(conf2); Catch exceptions to re-start failed jobs in pipeline More complex workflows Use JobControl from org.apache.hadoop.mapred.jobcontrol We will see soon how to use Pig for this MapReduce Coding Summary Decompose problem into appropriate workflow of MapReduce jobs For each job, implement the following Job configuration Map function Reduce function Combiner function (optional) Partition function (optional) Might have to create custom data types as well WritableComparable for keys Writable for values 35 36 6