Introduc)on to. Eric Nagler 11/15/11



Similar documents
Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Strategies for scheduling Hadoop Jobs. Pere Urbon-Bayes

Introduc)on to Map- Reduce. Vincent Leroy

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Peers Techno log ies Pv t. L td. HADOOP

Important Notice. (c) Cloudera, Inc. All rights reserved.

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

ITG Software Engineering

Package hive. January 10, 2011

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Cloudera Certified Developer for Apache Hadoop

Map Reduce Workflows

Data processing goes big

Data Domain Profiling and Data Masking for Hadoop

Map Reduce & Hadoop Recommended Text:

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Hadoop Tutorial. General Instructions

Qsoft Inc

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Hadoop Streaming. Table of contents

IDS 561 Big data analytics Assignment 1

Configuring Secure Socket Layer (SSL) for use with BPM 7.5.x

Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Hadoop Ecosystem B Y R A H I M A.

Apache Oozie THE WORKFLOW SCHEDULER FOR HADOOP. Mohammad Kamrul Islam & Aravind Srinivasan

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Professional Hadoop Solutions

BIG DATA HADOOP TRAINING

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA - HADOOP PROFESSIONAL amron

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Big Business, Big Data, Industrialized Workload

Big Data and Apache Hadoop s MapReduce

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

COURSE CONTENT Big Data and Hadoop Training

Hadoop Basics with InfoSphere BigInsights

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Tes$ng Hadoop Applica$ons. Tom Wheeler

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Solution White Paper Connect Hadoop to the Enterprise

Complete Java Classes Hadoop Syllabus Contact No:

Ankush Cluster Manager - Hadoop2 Technology User Guide

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

CURSO: ADMINISTRADOR PARA APACHE HADOOP

ITG Software Engineering

MapReduce, Hadoop and Amazon AWS

Extending Hadoop beyond MapReduce

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Vaidya Guide. Table of contents

MapReduce. Tushar B. Kute,

Virtual Machine (VM) For Hadoop Training

Business Intelligence for Big Data

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

To reduce or not to reduce, that is the question

Lecture 3 Hadoop Technical Introduction CSE 490H

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Project 5 Twitter Analyzer Due: Fri :59:59 pm

Xiaoming Gao Hui Li Thilina Gunarathne

A Brief Introduction to Apache Tez

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Hadoop Development & BI- 0 to 100

docs.hortonworks.com

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Integrating Kerberos into Apache Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Open source Google-style large scale data analysis with Hadoop

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User's Guide

Cloud Computing. Chapter Hadoop

Package hive. July 3, 2015

Managing large clusters resources

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop Job Oriented Training Agenda

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Understanding Hadoop Performance on Lustre

Hadoop Forensics. Presented at SecTor. October, Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE

Transcription:

Introduc)on to Eric Nagler 11/15/11

What is Oozie? Oozie is a workflow scheduler for Hadoop Originally, designed at Yahoo! for their complex search engine workflows Now it is an open- source Apache incubator project 2

What is Oozie? Oozie allows a user to create Directed Acyclic Graphs of workflows and these can be ran in parallel and sequen)al in Hadoop Oozie can also run plain java classes, Pig workflows, and interact with the HDFS Nice if you need to delete or move files before a job runs Oozie can run job s sequen)ally (one aver the other) and in parallel (mul)ple at a )me) 3

Why use Oozie instead of just cascading a jobs one aver another? Major flexibility Start, Stop, Suspend, and re- run jobs Oozie allows you to restart from a failure You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes 4

Other Features Java Client API / Command Line Interface Launch, control, and monitor jobs from your Java Apps Web Service API You can control jobs from anywhere Run Periodic jobs Have jobs that you need to run every hour, day, week? Have Oozie run the jobs for you Receive an email when a job is complete 5

How do you make a workflow? First make a Hadoop job and make sure that it works using the jar command in Hadoop This ensures that the configura)on is correct for your job Make a jar out of your classes Then make a workflow.xml file and copy all of the job configura)on proper)es into the xml file. These include: Input files Output files Input readers and writers Mappers and reducers Job specific arguments 6

How do you make a workflow? You also need a job.proper)es file. This file defines the Name node, Job tracker, etc. It also gives the loca)on of the shared jars and other files When you have these files ready, you need to copy them into the HDFS and then you can run them from the command line 7

Oozie Start, End, Error Nodes Start Node Tells the applica)on where to start <start to= [NODE- NAME] /> End Node Signals the end of a Oozie Job <end name= [NODE- NAME] /> Error Node Signals that an error occurred and a message describing the error should be printed out <error name= [NODE- NAME] /> <message> [A custom message] </message> </error> 8

Oozie Ac)on Node Ac)on Nodes Ac)on Nodes specify the Map/Reduce, Pig, or java class to run All Nodes have ok and error tags Ok transi)ons to the next node Error goes to the error node and should print an error message <ac)on name= [NODE- NAME] > <ok to= [NODE- NAME] /> <error to= [NODE- NAME] /> </ac)on> 9

Oozie Map- Reduce Node Map/Reduce tags Ac)on tag used to run a map/reduce process. You need to supply the job tracker, name node, and your Hadoop job configura)on details <ac)on name= [NODE- NAME] > <map- reduce> <job- tracker>[job- TRACKER ADDRESS]</job- tracker> <name- node>[name- NODE ADDRESS]</name- node> <configura)on> [YOUR HADOOP CONFIGURATION] </configura)on> </map- reduce> <ok to= [NODE- NAME] /> <error to= [NODE- NAME] /> </ac)on> 10

Oozie Java Job Tag Java Job tags Runs the main() func)on of a java class <ac)on name= [NODE- NAME] > <java> <job- tracker>[job- TRACKER ADDRESS]</job- tracker> <name- node>[name- NODE ADDRESS]</name- node> <configura)on> [OTHER HADOOP CONFIGURATION ITEMS] </configura)on> <main- class>[main- CLASS PATH]</main- class> <java- opts>[any D JAVA ARGUMENTS]</java- opts> <arg>[command LINE ARGUMENTS]</arg> </java> <ok to= [NODE- NAME] /> <error to= [NODE- NAME] /> </ac)on> 11

Oozie File System Tag Fs tag Interact with the HDFS <ac)on name= [NODE- NAME] > <fs> <delete path= [PATH] /> <mkdir path= [PATH] /> <move source= [PATH] target= [PATH] /> <chmod path= [PATH] permissions= [PERMISSIONS] dir- file= false/ true /> </fs> <ok to= [NODE- NAME] /> <error to= [NODE- NAME] /> </ac)on> 12

Oozie Sub workflow tag Sub- workflow Most likely the most important node in Oozie Allows you to run a sub- workflow (another separate workflow) in a job. Good for specific jobs that you will run all the )me or long chains of jobs <ac)on name= [NODE- NAME] > <sub- workflow> <app- path>[child- WORKFLOW- PATH]</app- path> <configura)on> [Propagated configura)on] </configura)on> </sub- workflow> <ok to= [NODE- NAME] /> <error to= [NODE- NAME] /> </ac)on> 13

Oozie Fork/Join Node Parallel Fork/Join Nodes Fork Starts the parallel jobs <fork> <path start= firstjob > [OTHER JOBS] </fork> Join Parallel jobs re- join at this node. All forked jobs must be completed to con)nue the workflow <join name= [NAME JOBS] to= [NEXT- NODE] /> 14

Oozie Decision Nodes Need to make a decision? Decision nodes are a switch statements that will run different jobs based on the outcome of an expression <decision name= [NODE- NAME] > <switch> <case to= singlethreadedjob > ${fs:filesize(lastjob) lt 1 *GB} </case> <case to= MRJob > ${fs:filesize(lastjob) ge 1 *GB} </case> </switch> </decision> Other decision points include Hadoop counters, HDFS opera)ons, string manipula)ons 15

Oozie Parameteriza)on Parameteriza)on helps make flexible code Items like job- trackers, name- nodes, input paths, output table, table names, other constants should be in the job.proper)es file If you want to use a parameter just put it in a ${} 16

Oozie Parameteriza)on Example <ac)on name= [NODE- NAME] > <map- reduce> <job- tracker>${jobtracker}</job- tracker> <name- node>${namenode}</name- node> <configura)on> <property> <name>mapred.input.dir</name> <value>${inputdir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputdir}</value> </property> [OTHER HADOOP CONFIGURATION PARAMETERS] </configura)on> </map- reduce> <ok to= [NODE- NAME] /> <error to= [NODE- NAME] /> </ac)on> 17

Re- Running a Failed Job Hadoop job failures happen But, re- running or finishing your job is easy Two ways of accomplishing this: Skip Nodes (oozie.wf.rerun.skip.nodes) A comma separated list of nodes to skip in the next run ReRun Failed Nodes (oozie.wf.rerun.failnodes) Re Run the job from the failed node (true / false) Only one of these can be defined at a )me You can only skip nodes or re run nodes, you can t do both Define these in your job.proper)es file when you re- run your job 18

What haven t we talked about? Coordinator (Periodic) jobs SLA monitoring Custom Bundled Ac)ons Integra)ng PIG Local Oozie Runner Test a workflow locally to ensure that it runs 19

Notes All workflow items will start up a Map/Reduce Job. Includes file system manipula)ons and running Java main classes If your jobs have a prepara)on phase, you need to separate the prepara)on phase from the execu)on of the job If a job fails, you can re- run it or start it back up from where the job fails 20

Notes Oozie s)ll thinks that you are using the old Hadoop JobConf object. We should not be using this object as it is depreciated. To fix this, two proper)es can be added to force Oozie to use the new configura)on object Some)mes you may see more jobs start then are listed in your workflow.xml file. This is fine, there may be some prep work that Oozie is running before a job runs 21

Example Workflow Word Count URI s start fork join Extract URIs Matching Word end Word Count Words Extract Offsets Purple: Map/Reduce Job Blue: Java Job 22

Want to find out more? hzp://yahoo.github.com/oozie/ hzp://yahoo.github.com/oozie/releases/ 3.1.0/ hzp://incubator.apache.org/oozie/ 23