Map Reduce Workflows



Similar documents
Virtual Machine (VM) For Hadoop Training

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Apache Pig Joining Data-Sets

Advanced Java Client API

HBase Key Design coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Distributed File System (HDFS) Overview

HBase Java Administrative API

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

HDFS - Java API coreservlets.com and Dima May coreservlets.com and Dima May

HDFS Installation and Shell

MapReduce on YARN Job Execution

The Google Web Toolkit (GWT): The Model-View-Presenter (MVP) Architecture Official MVP Framework

Java with Eclipse: Setup & Getting Started

Building Web Services with Apache Axis2

Official Android Coding Style Conventions

Android Programming: 2D Drawing Part 1: Using ondraw

The Google Web Toolkit (GWT): Declarative Layout with UiBinder Basics

Managed Beans II Advanced Features

Android Programming: Installation, Setup, and Getting Started

Android Programming Basics

JHU/EP Server Originals of Slides and Source Code for Examples:

Word Count Code using MR2 Classes and API

The Google Web Toolkit (GWT): Overview & Getting Started

Session Tracking Customized Java EE Training:

Application Security

Servlet and JSP Filters

Handling the Client Request: Form Data

ITG Software Engineering

Localization and Resources

For live Java EE training, please see training courses

Object-Oriented Programming in Java: More Capabilities

Introduc)on to. Eric Nagler 11/15/11

Web Applications. For live Java training, please see training courses at

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Web Applications. Originals of Slides and Source Code for Examples:

Package hive. January 10, 2011

BIG DATA - HADOOP PROFESSIONAL amron

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Important Notice. (c) Cloudera, Inc. All rights reserved.

What servlets and JSP are all about

Debugging Ajax Pages: Firebug

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Basic Java Syntax. Slides 2016 Marty Hall,

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Running Hadoop on Windows CCNP Server

2011 Marty Hall An Overview of Servlet & JSP Technology Customized Java EE Training:

Package hive. July 3, 2015

University of Maryland. Tuesday, February 2, 2010

Introduction to Cloud Computing

CS 378 Big Data Programming

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop: The Definitive Guide

Enterprise Application Development In Java with AJAX and ORM

Hadoop Configuration and First Examples

Big Data and Scripting map/reduce in Hadoop

Hadoop Basics with InfoSphere BigInsights

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Introduc)on to Map- Reduce. Vincent Leroy

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

An Overview of Servlet & JSP Technology

Qsoft Inc

& JSP Technology Originals of Slides and Source Code for Examples:

Strategies for scheduling Hadoop Jobs. Pere Urbon-Bayes

Distributed Computing and Big Data: Hadoop and MapReduce

Peers Techno log ies Pv t. L td. HADOOP

CS506 Web Design and Development Solved Online Quiz No. 01

A very short Intro to Hadoop

L1: Introduction to Hadoop

Web Development in Java

Extreme Computing. Hadoop MapReduce in more detail.

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Xiaoming Gao Hui Li Thilina Gunarathne

The Hadoop Eco System Shanghai Data Science Meetup

OUR COURSES 19 November All prices are per person in Swedish Krona. Solid Beans AB Kungsgatan Göteborg Sweden

Hadoop Tutorial. General Instructions

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

Tutorial- Counting Words in File(s) using MapReduce

Software Development Interactief Centrum voor gerichte Training en Studie Edisonweg 14c, 1821 BN Alkmaar T:

Mrs: MapReduce for Scientific Computing in Python

Getting to know Apache Hadoop

Hadoop Ecosystem B Y R A H I M A.

COURSE CONTENT Big Data and Hadoop Training

ITG Software Engineering

Recommended Literature for this Lecture

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

IDS 561 Big data analytics Assignment 1

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Data processing goes big

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 308. Coding Conventions. Reference

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Transcription:

2012 coreservlets.com and Dima May Map Reduce Workflows Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite or at public venues) http://courses.coreservlets.com/hadoop-training.html Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. 2012 coreservlets.com and Dima May For live customized Hadoop training (including prep for the Cloudera certification exam), please email info@coreservlets.com Taught by recognized Hadoop expert who spoke on Hadoop several times at JavaOne, and who uses Hadoop daily in real-world apps. Available at public venues, or customized versions can be held on-site at your organization. Courses developed and taught by Marty Hall JSF 2.2, PrimeFaces, servlets/jsp, Ajax, jquery, Android development, Java 7 or 8 programming, custom mix of topics Courses Customized available in any state Java or country. EE Training: Maryland/DC http://courses.coreservlets.com/ area companies can also choose afternoon/evening courses. Hadoop, Courses Java, developed JSF 2, PrimeFaces, and taught Servlets, by coreservlets.com JSP, Ajax, jquery, experts Spring, (edited Hibernate, by Marty) RESTful Web Services, Android. Spring, Hibernate/JPA, GWT, Hadoop, HTML5, RESTful Web Services Developed and taught by well-known author and developer. At public venues or onsite at your location. Contact info@coreservlets.com for details

Agenda Workflows Introduction Decomposing Problems into MapReduce Workflow Using JobControl class 4 MapReduce Workflows We ve looked at single MapReduce job Complex processing requires multiple steps Usually manifest in multiple MapReduce jobs rather than complex map and reduce functions May also want to consider higher-level MapReduce abstractions Pig, Hive, Cascading, Cascalog, Crunch Focus on business logic rather than MapReduce translation On the other hand you ll need to learn another syntax and methodology This lecture will focus on building MapReduce Workflows 5

Decomposing Problems into MapReduce Jobs Small map-reduce jobs are usually better Easier to implement, test and maintain Easier to scale and re-use Problem: Find a letter that occurs the most in the provided body of text 6 Decomposing the Problem Calculate number of occurrences of each letter in the provided body of text Traverse each letter comparing occurrence count Produce start letter that has the most occurrences (For so this side of our known world esteem'd him) Did slay this Fortinbras; who, by a seal'd compact, Well ratified by law and heraldry, Did forfeit, with his life, all those his lands Which he stood seiz'd of, to the conqueror; Against the which a moiety competent Was gaged by our king; which had return'd To the inheritance of Fortinbras, A 89530 B 3920.... Z 876 T 495959 7

Decomposing the Problem Count Each Letter MapReduce A 101 B 20 C 300.. Y 3 Z 1 Find Max Letter MapReduce C 300 Calculate number of occurrences of each letter in the provided body of text Produce start letter that has the most occurrences 8 MapReduce Workflows Your choices can depend on complexity of workflows Linear chain or simple set of dependent Jobs vs. Directed Acyclic Graph (DAG) http://en.wikipedia.org/wiki/directed_acyclic_graph 1 2 4 5 3 Simple Dependency or Linear chain VS. 1 3 2 5 7 4 6 Directed Acyclic Graph (DAG) 9

MapReduce Workflows JobControl class Create simple workflows Represents a graph of Jobs to run Specify dependencies in code Oozie An engine to build complex DAG workflows Runs in its own daemon Describe workflows in set of XML and configuration files Has coordinator engine that schedules workflows based on time and incoming data Provides ability to re-run failed portions of the workflow 10 Workflow with JobControl 11 1. Create JobControl Implements java.lang.runnable, will need to execute within a Thread 2. For each Job in the workflow Construct ControlledJob Wrapper for Job instance Constructor takes in dependent jobs 3. Add each ControlledJob to JobControl 4. Execute JobControl in a Thread Recall JobControl implements Runnable 5. Wait for JobControl to complete and report results Clean-up in case of a failure

1: Create JobControl public class MostSeenStartLetterJobControl extends Configured implements Tool{ private final Logger log = Logger.getLogger(MostSeenStartLetterJobControl.class); @Override public int run(string[] args) throws Exception { String inputtext = args[0]; String finaloutput = args[1]; Manage intermediate output directory String intermediatetempdir = "/" + getclass().getsimplename() + "-tmp"; Path intermediatepath = new Path(intermediateTempDir); deleteintermediatedir(intermediatepath); 12 try { JobControl control = new JobControl("Worklfow-Example"); JobControl manages workflow 2: For Each Job in the Workflow Construct ControlledJob ControlledJob step1 = new ControlledJob( getcountjob(inputtext, intermediatetempdir), null); Returns Job object with properly configured job ControlledJob step2 = new ControlledJob( getmostseenjob(intermediatetempdir, finaloutput), Arrays.asList(step1)); Specify dependencies, In this case step2 depends on step1 13

2: Build First Job private Job getcountjob(string inputtext, String tempoutputpath) throws IOException { Job job = Job.getInstance(getConf(), "StartsWithCount"); job.setjarbyclass(getclass()); // configure output and input source TextInputFormat.addInputPath(job, new Path(inputText)); job.setinputformatclass(textinputformat.class); // configure mapper and reducer job.setmapperclass(startswithcountmapper.class); job.setcombinerclass(startswithcountreducer.class); job.setreducerclass(startswithcountreducer.class); // configure output TextOutputFormat.setOutputPath(job, new Path(tempOutputPath)); job.setoutputformatclass(textoutputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); 14 return job; } Build each job the same way we ve done before 2: Build Second Job private Job getmostseenjob(string intermediatetempdir, String finaloutput) throws IOException { Job job = Job.getInstance(getConf(), "MostSeenStartLetter"); job.setjarbyclass(getclass()); // configure output and input source KeyValueTextInputFormat.addInputPath(job, new Path(intermediateTempDir)); job.setinputformatclass(keyvaluetextinputformat.class); // configure mapper and reducer job.setmapperclass(mostseenstartlettermapper.class); job.setcombinerclass(mostseendstartletterreducer.class); job.setreducerclass(mostseendstartletterreducer.class); 15 } // configure output TextOutputFormat.setOutputPath(job, new Path(finalOutput)); job.setoutputformatclass(textoutputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); return job; Build each job the same way we ve done before

3: Add Each ControlledJob to JobControl control.addjob(step1); control.addjob(step2); 16 4: Execute JobControl in a Thread Thread workflowthread = new Thread(control, "Workflow-Thread"); workflowthread.setdaemon(true); workflowthread.start(); 17

5: Wait for JobControl to Complete while (!control.allfinished()){ Thread.sleep(500); } if ( control.getfailedjoblist().size() > 0 ){ log.error(control.getfailedjoblist().size() + " jobs failed!"); Simply wait for all jobs to complete for ( ControlledJob job : control.getfailedjoblist()){ log.error(job.getjobname() + " failed"); } Report results or Display errors } else { log.info("success!! Workflow completed [" + control.getsuccessfuljoblist().size() + "] jobs"); } 18 Run JobControl Example yarn jar $PLAY_AREA/HadoopSamples.jar \ mr.workflows.mostseenstartletterjobcontrol \ /training/data/hamlet.txt \ /training/playarea/wordcount Two Jobs got executed 19

JobControl Workflow Result $ hdfs dfs -cat /training/playarea/wordcount/part-r-00000 t 3711 20 Study MapReduce Algorithms Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer Download the book for free http://lintool.github.com/mapreducealgorithm s/index.html Buy a copy http://www.amazon.com/data-intensive- Processing-MapReduce-Synthesis- Technologies/dp/1608453421/ref=sr_1_1?ie=U TF8&qid=1340464379&sr=8-1&keywords=Data- Intensive+Text+Processing+with+MapReduce 21

Study MapReduce Patterns MapReduce Design Patterns Donald Miner (Author), Adam Shook (Author) O'Reilly Media (November 22, 2012) 22 2012 coreservlets.com and Dima May Wrap-Up Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.

Summary We learned how to Decompose Problems into MapReduce Workflow Utilize JobControl to implement MapReduce workflow 24 2012 coreservlets.com and Dima May Questions? More info: http://www.coreservlets.com/hadoop-tutorial/ Hadoop programming tutorial http://courses.coreservlets.com/hadoop-training.html Customized Hadoop training courses, at public venues or onsite at your organization http://courses.coreservlets.com/course-materials/java.html General Java programming tutorial http://www.coreservlets.com/java-8-tutorial/ Java 8 tutorial http://www.coreservlets.com/jsf-tutorial/jsf2/ JSF 2.2 tutorial http://www.coreservlets.com/jsf-tutorial/primefaces/ PrimeFaces tutorial http://coreservlets.com/ JSF 2, PrimeFaces, Java 7 or 8, Ajax, jquery, Hadoop, RESTful Web Services, Android, HTML5, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.