Programming and Debugging Large- Scale Data Processing Workflows

Similar documents

Programming and Debugging Large- Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research

How To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils

web-scale data processing Christopher Olston and many others Yahoo! Research

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Data processing goes big

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop Ecosystem B Y R A H I M A.

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Hadoop Job Oriented Training Agenda

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

A Brief Introduction to Apache Tez

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows

Big Data Too Big To Ignore

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Data and Algorithms of the Web: MapReduce

Internals of Hadoop Application Framework and Distributed File System

Workshop on Hadoop with Big Data

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop and Map-Reduce. Swati Gore

BIG DATA What it is and how to use?

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Cloud Computing. Up until now

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Cloud Computing at Google. Architecture

Data Management in the Cloud

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

How To Scale Out Of A Nosql Database

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hive Interview Questions

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Big Data and Scripting Systems build on top of Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Introduction to Hadoop

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data and Hadoop for the Executive A Reference Guide

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Apache Hadoop s MapReduce

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

StratoSphere Above the Clouds

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Manifest for Big Data Pig, Hive & Jaql

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Architectures for Big Data Analytics A database perspective

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Big Data Technology CS , Technion, Spring 2013

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction To Hive

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Deploying Hadoop with Manager

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Unified Batch & Stream Processing Platform

American International Journal of Research in Science, Technology, Engineering & Mathematics

Implement Hadoop jobs to extract business value from large and varied data sets

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data and Scripting Systems build on top of Hadoop

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data With Hadoop

Best Practices for Hadoop Data Analysis with Tableau

Cloudera Certified Developer for Apache Hadoop

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data Analytics Nokia

MapReduce everywhere. Carsten Hufe & Michael Hausenblas

Hadoop Project for IDEAL in CS5604

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Comparing High Level MapReduce Query Languages

Hadoop & Spark Using Amazon EMR

A programming model in Cloud: MapReduce

Play with Big Data on the Shoulders of Open Source

Application Development. A Paradigm Shift

Big Data Course Highlights

Hadoop Pig. Introduction Basic. Exercise

Large Scale Text Analysis Using the Map/Reduce

Apache Hadoop. Alexandru Costan

Hadoop implementation of MapReduce computational model. Ján Vaňo

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop. Sunday, November 25, 12

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Transcription:

Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues)

Context Elaborate processing of large data sets e.g.: web search pre- processing cross- dataset linkage web informakon extrackon inges)on storage & processing serving

Context storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorkng & hashing e.g. Map- Reduce Overview scalable file system e.g. GFS Detail: Inspector Gadget, RubySky Debugging aides: Before: example data generator During: instrumentakon framework ANer: provenance metadata manager

Pig: A High- Level Dataflow Language & RunKme for Hadoop Web browsing sessions with happy endings. Visits = load /data/visits as (user, url, time);! Visits = foreach Visits generate user, Canonicalize(url), time;!! Pages = load /data/pages as (url, pagerank);!! VP = join Visits by url, Pages by url;! UserVisits = group VP by user;! Sessions = foreach UserVisits generate flatten(findsessions(*));! HappyEndings = filter Sessions by BestIsLast(*);!! store HappyEndings into '/data/happy_endings';!

vs. mapreduce: less code! "The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. " " -- Prasenjit Mukherjee, Mahout project" 180 160 140 120 100 80 60 40 20 0 1/20 the lines of code Minutes 300 250 200 150 100 50 0 1/16 the development Kme Hadoop Pig Hadoop Pig performs on par with raw Hadoop

vs. SQL: step- by- step style; lower- level control "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of" creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. " " -- Jasmine Novak, Engineer, Yahoo!" "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on Map-Reduce] doesnʼt). " " -- Ricky Ho, Adobe Software"

Conceptually: A Graph of Data TransformaKons Find users who tend to visit good pages. Load Visits(user, url, Kme) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), Kme) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5

Load Visits(user, url, Kme) Load Pages(url, pagerank) Illustrated! Transform to (user, Canonicalize(url), Kme) (Amy, cnn.com, 8am) (Amy, hip://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) Join url = url Group by user (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) (www.cnn.com, 0.9) (www.snails.com, 0.4) (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) }) (Fred, { (Fred, www.snails.com, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgpr) ILLUSTRATE lets me check the output of my lengthy (Amy, 0.65) batch jobs and their (Fred, 0.4) custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive. Filter " " avgpr > 0.5 -- Russell Jurney, LinkedIn" (Amy, 0.65)

(Naïve Algorithm) Load Visits(user, url, Kme) Transform to (user, Canonicalize(url), Kme) (Amy, cnn.com, 8am) (Amy, hip://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Load Pages(url, pagerank) (www.youtube.com, 0.9) (www.frogs.com, 0.4) Filter avgpr > 0.5

Original UI Prototype (2008)

InteracKve Graphics Programming UI by Bret Victor, Apple Designer (2012) Creators need an immediate connection to what they create If you make a change you need to see the effect of that immediately. " " -- Bret Victor, Apple"

Pig Today Open- source (the Apache Pig Project) Dev./support/training by Cloudera, Hortonworks Offered on Amazon ElasKc Map- Reduce Used by LinkedIn, Neqlix, Salesforce, Twiier, Yahoo... At Yahoo, as of early 2011: 1000s of jobs/day 75%+ of Hadoop jobs Has an interackve example- generator command, but no side- by- side UI L

Next: NOVA storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorkng & hashing e.g. Map- Reduce scalable file system e.g. GFS Debugging aides: Before: example data generator During: instrumentakon framework ANer: provenance metadata manager

Why a Workflow Manager? ConKnuous data processing (simulated on top of Pig/Hadoop stakc processing layer) Independent scheduling of workflow modules

Example RSS feed NEW Workflow template deteckon ALL news arkcles ALL news site templates ALL NEW template tagging NEW NEW shingling NEW shingle hashes seen ALL NEW NEW de- duping NEW unique arkcles

Data Passes Through Many Sub- Systems Nova datum X ingeskon Pig Map- Reduce low- latency processor serving datum Y GFS provenance of X? metadata queries

Ibis Project metadata Ibis metadata queries integrated metadata answers data processing sub- systems metadata manager users Benefits: Provide uniform view to users Factor out metadata management code Decouple metadata lifekme from data/subsystem lifekme Challenges: Overhead of shipping metadata Disparate data/processing granularikes

Example Data and Process GranulariKes Workflow Table MR program Pig script Web page Row Cell Column group Column AlternaKve Pig job MR job MR job phase MR task Pig logical operakon Pig physical operakon Task aiempt data granulari7es process granulari7es

What s Hard About MulK- Granularity Provenance? Inference: Given relakonships expressed at one granularity, answer queries about other granularikes (the seman;cs are tricky here!) Efficiency: Implement inference without resorkng to materializing everything in terms of finest granularity (e.g. cells)

Inferring TransiKve Links map phase reduce phase IMDB web page Yahoo! Movies web page IMDB extracted table )tle year lead actor Avatar 2009 Worthington IncepKon 2010 DiCaprio Yahoo extracted table )tle year lead actor Avatar 2009 Saldana IncepKon 2010 DiCaprio map output 1 map output 2 combined table )tle year lead actor Avatar 2009 A1: Worthington A2: Saldana IncepKon 2010 DiCaprio Q: Which web page said Saldana?

Next: INSPECTOR GADGET storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorkng & hashing e.g. Map- Reduce scalable file system e.g. GFS Debugging aides: Before: example data generator During: instrumentakon framework ANer: provenance metadata manager

MoKvated by User Interviews Interviewed 10 Yahoo dataflow programmers (mostly Pig users; some users of other dataflow environments) Asked them how they (wish they could) debug

Summary of User Interviews # of requests feature 7 crash culprit determinakon 5 row- level integrity alerts 4 table- level integrity alerts 4 data samples 3 data summaries 3 memory use monitoring 3 backward tracing (provenance) 2 forward tracing 2 golden data/logic teskng 2 step- through debugging 2 latency alerts 1 latency profiling 1 overhead profiling 1 trial runs

Our Approach Goal: a programming framework for adding these behaviors, and others, to Pig Precept: avoid modifying Pig or tampering with data flowing through Pig Approach: perform Pig script rewrikng insert special UDFs that look like no- ops to Pig

Pig w/ Inspector Gadget load filter load join IG coordinator group count store

Example: Integrity Alerts load filter load alert! join IG coordinator propagate alert to user group count store

Example: Crash Culprit Determina;on Phases 1 to n- 1: record counts Phase n: records load filter load join IG coordinator group Phases 1 to n- 1: maintain count lower bounds Phase n: maintain last- seen records count store

Example: Forward Tracing load load filter IG coordinator traced records join group tracing instruckons report traced records to user count store

Flow end user result dataflow program + app. parameters applicakon IG driver library launch instrumented dataflow run(s) raw result(s) load load IG coordinator filter join dataflow engine run;me store

Agent & Coordinator APIs Agent Class init(args) tags = observerecord(record, tags) receivemessage(source, message) finish() Agent Messaging sendtocoordinator(message) sendtoagent(agentid, message) senddownstream(message) sendupstream(message) Coordinator Class init(args) receivemessage(source, message) output = finish() Coordinator Messaging sendtoagent(agentid, message)

ApplicaKons Developed For IG # of requests feature lines of code (Java) 7 crash culprit determinakon 141 5 row- level integrity alerts 89 4 table- level integrity alerts 99 4 data samples 97 3 data summaries 130 3 memory use monitoring N/A 3 backward tracing (provenance) 237 2 forward tracing 114 2 golden data/logic teskng 200 2 step- through debugging N/A 2 latency alerts 168 1 latency profiling 136 1 overhead profiling 124 1 trial runs 93

RubySky: ScripKng Framework with Built- in Debugging Support Scenario: ad- hoc data analysis e.g. we just did a crawl; is it too biased toward Yahoo content? In this scenario, somekmes using new data; always running new code à Bugs are the norm à Debugging should be a first class cikzen!

RubySky Example :Crawl_data <= load(context, "/crawl_data", schema {string :url;...})!! # quick-and-dirty determination of "domain" from "url"! # (e.g. http://www.yahoo.com/index.html => yahoo.com)! :With_domains <= Crawl_data.foreach() do r! url = r[:url]! prefix = http://www.! if (url[0, prefix.len] == prefix)! domain = url[prefix.len, url.index('/', prefix.len) - prefix.len]! else! throw Don t know how to parse this url: + url! end! Tuple.new([url, domain])! end!! :Yahoo_only <= With_domains.filter do r! r[:domain] =~ /^yahoo/! end!! :Grouped <= Yahoo_only.group!! :Grouped.each { r puts NUM. YAHOO URLs: + r[:yahoo_only].size }!

RubySky Example: w/debug stmts...! :With_domains <= Crawl_data.foreach() do r!...! else! domain = PUNT url! end! Tuple.new([url, domain])! end!! # examine a few extracted domains, as sanity check! :Domain_sample <= With_domains.foreach() do r! $count += 1! ($count < SAMPLE_SIZE)? [Tuple.new([r[:domain]])] : []! end! SHOW Domain_sample!! # alert if any extracted domain is empty or null! :Domain_missing <= With_domains.foreach() do r! (r[:domain] == "" or r[:domain] == nil)?! [Tuple.new(["ALERT! Missing domain from: " + r[:url]])] : []! end! SHOW Domain_missing!!...!

RubySky ExecuKon launch load client punt request response (code and/or data) domain samples empty domain alerts extract domain spy agent filter group answer (# yahoo URLs) count

Summary storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorkng & hashing e.g. Map- Reduce scalable file system e.g. GFS Nova Pig Debugging aides: Before: example data generator During: instrumentakon framework ANer: provenance metadata manager Dataflow Illustrator Inspector Gadget, RubySky Ibis

Related Work Pig: DryadLINQ, Hive, Jaql, Scope, rela;onal query languages Nova: BigTable, CBP, Oozie, Percolator, scien;fic workflow, incremental view maintenance Dataflow illustrator: [Mannila/Raiha, PODS 86], reverse query processing, constraint databases, hardware verifica;on & model checking Inspector gadget, RubySky: XTrace, taint tracking, aspect- oriented programming Ibis: Kepler COMAD, ZOOM user views, provenance management for databases & scien;fic workflows

Collaborators Shubham Chopra Tyson Condie Anish Das Sarma Alan Gates Pradeep Kamath Ravi Kumar Shravan Narayanamurthy Olga Natkovich Benjamin Reed Santhosh Srinivasan Utkarsh Srivastava Andrew Tomkins