web-scale data processing Christopher Olston and many others Yahoo! Research

Similar documents
Hadoop Pig. Introduction Basic. Exercise

Distributed and Cloud Computing

Programming and Debugging Large- Scale Data Processing Workflows

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

How To Write A Program On A Microsoft.Com (For Microsoft) With A Microsail (For Linux) And A Microosoft.Com Microsails (For Windows) (For Macintosh) ( For Microsils

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.

MR-(Mapreduce Programming Language)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Big Data Technology Pig: Query Language atop Map-Reduce

Connecting Hadoop with Oracle Database

Hadoop WordCount Explained! IT332 Distributed Systems

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Word count example Abdalrahman Alsaedi

Click Stream Data Analysis Using Hadoop

Introduction to Apache Pig Indexing and Search

Apache Hive. Big Data 2015

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Scalable Computing with Hadoop

Pig vs Hive. Big Data 2014

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Word Count Code using MR2 Classes and API

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

LANGUAGES FOR HADOOP: PIG & HIVE

Xiaoming Gao Hui Li Thilina Gunarathne

CS54100: Database Systems

How To Write A Mapreduce Program In Java.Io (Orchestra)

Data Management in the Cloud

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

The Hadoop Eco System Shanghai Data Science Meetup

BIG DATA, MAPREDUCE & HADOOP

Cloudera Certified Developer for Apache Hadoop

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

Introduction to MapReduce and Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

16.1 MAPREDUCE. For personal use only, not for distribution. 333

A Comparison of Approaches to Large-Scale Data Analysis

Big Data Technology CS , Technion, Spring 2013

Extreme Computing. Hadoop MapReduce in more detail.

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Mrs: MapReduce for Scientific Computing in Python

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Peers Techno log ies Pv t. L td. HADOOP

BIG DATA APPLICATIONS

Big Data Management and NoSQL Databases

HPCHadoop: MapReduce on Cray X-series

COURSE CONTENT Big Data and Hadoop Training

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Data Science in the Wild

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Cloud Computing

Moving from CS 61A Scheme to CS 61B Java

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Relational Processing on MapReduce

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Big Data for the JVM developer. Costin Leau,

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

BIG DATA - HADOOP PROFESSIONAL amron

Introduction to Parallel Programming and MapReduce

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Chapter 7. Using Hadoop Cluster and MapReduce

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop: Understanding the Big Data Processing Method

Workshop on Hadoop with Big Data

Other Map-Reduce (ish) Frameworks. William Cohen

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Data processing goes big

CIEL A universal execution engine for distributed data-flow computing

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Transcription:

web-scale data processing Christopher Olston and many others Yahoo! Research

Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding complex large-scale phenomena social systems (e.g. user-generated web content) economic systems (e.g. advertising markets) computer systems (e.g. web search engines) Data analysis is inner loop at Yahoo! et al. Big data necessitates parallel processing

Examples 1. Detect faces You have a function detectfaces() You want to run it over n images n is big 2. Study web usage You have a web crawl and click log Find sessions that end with the best page

Existing Work Parallel architectures cluster computing multi-core processors Data-parallel software parallel DBMS Map-Reduce, Dryad Data-parallel languages SQL NESL

Pig Project Data-parallel language ( Pig Latin ) Relational data manipulation primitives Imperative programming style Plug in code to customize processing Various crazy ideas Multi-program optimization Adaptive data placement Automatic example data generator

Pig Latin Language [SIGMOD 08]

Example 1 Detect faces in many images. I = load /mydata/images using ImageParser() as (id, image); F = foreach I generate id, detectfaces(image); store F into /mydata/faces ;

Example 2 Find sessions that end with the best page. Visits user url time Amy www.cnn.com 8:00 Amy www.crap.com 8:05 Amy www.myblog.com 10:00 Amy www.flickr.com 10:05 Fred cnn.com/index.htm 12:00 Pages url pagerank www.cnn.com 0.9 www.flickr.com 0.9 www.myblog.com 0.7 www.crap.com 0.2......

Efficient Evaluation Method the answer repartition by user repartition by url...... group-wise processing (identify sessions; examine pageranks) join...... Visits Pages

In Pig Latin Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(findsessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';

Pig Latin, in general transformations on sets of records easy for users high-level, extensible data processing primitives easy for the system exposes opportunities for parallelism and reuse operators: FILTER FOREACH GENERATE GROUP binary operators: JOIN COGROUP UNION

Related Languages SQL: declarative all-in-one blocks NESL: lacks join, cogroup Map-Reduce: special case of Pig Latin a = FOREACH input GENERATE flatten(map(*)); b = GROUP a BY $0; c = FOREACH b GENERATE Reduce(*); Sawzall: rigid map-then-reduce structure

Pig Latin = Sweet Spot Between SQL & Map-Reduce SQL Map-Reduce Programming style Large blocks of declarative Plug together pipes (but "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of constraints restricted to linear, 2-step creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of data what flow) your variables Built-in are, data and where you Group-by, are in the process Sort, Join, of analyzing Filter, your Group-by, data. Sort manipulations Aggregate, Top-k, etc... -- Jasmine Novak, Engineer, Yahoo! Execution model Fancy; trust the query Simple, transparent "PIG seems to give optimizer the necessary parallel programming construct Opportunities (FOREACH, for FLATTEN, Many COGROUP.. etc) and also give Few sufficient (logic control buried in map() automatic back optimization to the programmer (which purely declarative approach and reduce()) like [SQL on top of Map-Reduce] doesn t). -- Ricky Ho, Adobe Software

Map-Reduce as Backend ( SQL ) user automatic rewrite + optimize Pig or or Map-Reduce cluster details in [VLDB09]

Pig Latin vs. Map-Reduce: Code Reduction import java.io.ioexception; reporter.setstatus("ok"); lp.setoutputkeyclass(text.class); import java.util.arraylist; lp.setoutputvalueclass(text.class); import java.util.iterator; lp.setmapperclass(loadpages.class); Users = load users as (name, age); import java.util.list; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new for (String s1 : first) { Path("/user/gates/pages")); import org.apache.hadoop.fs.path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp, import org.apache.hadoop.io.longwritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages")); import org.apache.hadoop.io.text; oc.collect(null, new Text(outval)); lp.setnumreducetasks(0); import org.apache.hadoop.io.writable; reporter.setstatus("ok"); Job loadpages = new Job(lp); import org.apache.hadoop.io.writablecomparable; import Fltrd org.apache.hadoop.mapred.fileinputformat; = filter Users by JobConf lfu = new JobConf(MRExample.class); import org.apache.hadoop.mapred.fileoutputformat; lfu.setjobname("load and Filter Users"); import org.apache.hadoop.mapred.jobconf; lfu.setinputformat(textinputformat.class); import org.apache.hadoop.mapred.keyvaluetextinputformat; public static class LoadJoined extends MapReduceBase lfu.setoutputkeyclass(text.class); import org.apache.hadoop.mapred.mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setoutputvalueclass(text.class); import org.apache.hadoop.mapred.mapreducebase; lfu.setmapperclass(loadandfilterusers.class); import org.apache.hadoop.mapred.outputcollector; public void map( FileInputFormat.addInputPath(lfu, new import org.apache.hadoop.mapred.recordreader; age >= 18 Text and k, age <= 25; Path("/user/gates/users")); import org.apache.hadoop.mapred.reducer; Text val, FileOutputFormat.setOutputPath(lfu, import org.apache.hadoop.mapred.reporter; OutputCollector<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users")); import org.apache.hadoop.mapred.sequencefileinputformat; Reporter reporter) throws IOException { lfu.setnumreducetasks(0); import org.apache.hadoop.mapred.sequencefileoutputformat; // Find the url Job loadusers = new Job(lfu); import org.apache.hadoop.mapred.textinputformat; String line = val.tostring(); import org.apache.hadoop.mapred.jobcontrol.job; int firstcomma = line.indexof(','); JobConf join = new JobConf(MRExample.class); import Views org.apache.hadoop.mapred.jobcontrol.jobcontrol; = load views int secondcomma = line.indexof(',', as (user, firstcomma); url); join.setjobname("join Users and Pages"); import org.apache.hadoop.mapred.lib.identitymapper; String key = line.substring(firstcomma, secondcomma); join.setinputformat(keyvaluetextinputformat.class); // drop the rest of the record, I don't need it anymore, join.setoutputkeyclass(text.class); public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setoutputvalueclass(text.class); public static class LoadPages extends MapReduceBase Text outkey = new Text(key); join.setmapperclass(identitymapper.class); implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outkey, new LongWritable(1L)); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Jnd public void map(longwritable = k, Text join val, Fltrd by name, Views Path("/user/gates/tmp/indexed_pages")); by user; OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users")); // Pull the key out Writable> { FileOutputFormat.setOutputPath(join, new String line = val.tostring(); Path("/user/gates/tmp/joined")); int firstcomma = line.indexof(','); public void reduce( join.setnumreducetasks(50); String key = line.substring(0, firstcomma); Text key, Job joinjob = new Job(join); Grpd String value = line.substring(firstcomma = group + 1); Jnd Iterator<LongWritable> by url; iter, joinjob.adddependingjob(loadpages); Text outkey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinjob.adddependingjob(loadusers); // Prepend an index to the value so we know which file Reporter reporter) throws IOException { // it came from. // Add up all the values we see JobConf group = new JobConf(MRExample.class); Text outval = new Text("1" + value); group.setjobname("group URLs"); oc.collect(outkey, outval); long sum = 0; group.setinputformat(keyvaluetextinputformat.class); while (iter.hasnext()) { group.setoutputkeyclass(text.class); sum += iter.next().get(); group.setoutputvalueclass(longwritable.class); Smmd public static class LoadAndFilterUsers = extends foreach MapReduceBase Grpd reporter.setstatus("ok"); generate group, group.setoutputformat(sequencefileoutputformat.class); implements Mapper<LongWritable, Text, Text, Text> { group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); public void map(longwritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setreducerclass(reduceurls.class); OutputCollector<Text, Text> oc, FileInputFormat.addInputPath(group, new Reporter reporter) throws IOException { Path("/user/gates/tmp/joined")); // Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new String line = val.tostring(); COUNT(Jnd) implements Mapper<WritableComparable, as clicks; Writable, LongWritable, Path("/user/gates/tmp/grouped")); int firstcomma = line.indexof(','); Text> { group.setnumreducetasks(50); String value = line.substring(firstcomma + 1); Job groupjob = new Job(group); int age = Integer.parseInt(value); public void map( groupjob.adddependingjob(joinjob); if (age < 18 age > 25) return; WritableComparable key, String key = line.substring(0, firstcomma); Writable val, JobConf top100 = new JobConf(MRExample.class); Text outkey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setjobname("top 100 sites"); Srtd // Prepend an index to = the value order so we know which file Smmd Reporter by reporter) throws clicks IOException { desc; top100.setinputformat(sequencefileinputformat.class); // it came from. oc.collect((longwritable)val, (Text)key); top100.setoutputkeyclass(longwritable.class); Text outval = new Text("2" + value); top100.setoutputvalueclass(text.class); oc.collect(outkey, outval); top100.setoutputformat(sequencefileoutputformat.class); public static class LimitClicks extends MapReduceBase top100.setmapperclass(loadclicks.class); implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setcombinerclass(limitclicks.class); public static class Join extends MapReduceBase top100.setreducerclass(limitclicks.class); implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new public void reduce( Path("/user/gates/tmp/grouped")); public void reduce(text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25")); OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setnumreducetasks(1); Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100); // For each value, figure out which file it's from and limit.adddependingjob(groupjob); store it Top5 into top5sites ; // Only output the first 100 records // accordingly. while (count < 100 && iter.hasnext()) { JobControl jc = new JobControl("Find top 100 sites for users List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25"); List<String> second = new ArrayList<String>(); count++; jc.addjob(loadpages); jc.addjob(loadusers); while (iter.hasnext()) { jc.addjob(joinjob); Text t = iter.next(); jc.addjob(groupjob); String value = t.tostring(); public static void main(string[] args) throws IOException { jc.addjob(limit); if (value.charat(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run(); first.add(value.substring(1)); lp.setjobname("load Pages"); else second.add(value.substring(1)); lp.setinputformat(textinputformat.class); Top5 = limit Srtd 5;

Comparison 180 160 140 120 100 80 60 40 20 0 1/20 the lines of code Minutes 300 250 200 150 100 50 0 1/16 the development time Hadoop Pig Hadoop Pig performance 1.5x Hadoop

Ways to Run Pig Interactive shell Script file Embed in host language (e.g., Java) soon: Graphical editor

Status Open-source implementation http://hadoop.apache.org/pig Runs on Hadoop or local machine Active project; many refinements in the works Wide adoption in Yahoo 100s of users 1000s of Pig jobs/day 60% of ad-hoc Hadoop jobs are via Pig 40% of production jobs via Pig

Status Gaining traction externally log processing & aggregation building text indexes collaborative filtering, applied to image & video recommendation systems "The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. -- Prasenjit Mukherjee, Mahout project

Crazy Ideas [USENIX 08] [VLDB 08] [SIGMOD 09]

Crazy Idea #1 Multi-Program Optimization

Motivation User programs repeatedly scan the same data files web crawl search log Goal: Reduce redundant IOs, and hence improve overall system throughput Approach: Introduce shared scans capability Careful scheduling of jobs, to maximize benefit of shared scans

Scheduling Shared Scans 7 4 1 Scheduler looks at queues and anticipates arrivals 10 5 39 28 Executor 6 5 36 (Idle) 2 not popular Search Log Web Crawl Click Data

Crazy Idea #2 Adaptive Data Placement

Motivation Hadoop is good at localizing computation to data, for conventional map-reduce scenarios However, advanced query processing operations change this: Pre-hashed join of A & B: co-locate A & B? Frag-repl join of A & B: more replicas of B? Our idea: Adaptive pressure-based mechanism to move data s.t. better locality arises

Adaptive Data Placement jobs Job Localizer execution engine Scan C C A B Join Scan A A & B D A B Worker 1 Worker 2

Crazy Idea #3 Automatic Example Generator

Example Program (abstract view) Find users who tend to visit good pages. Load Visits(user, url, time) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5

Load Visits(user, url, time) Transform to (user, Canonicalize(url), time) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) Join url = url Load Pages(url, pagerank) (www.cnn.com, 0.9) (www.snails.com, 0.4) (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) Group by user (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) ) (Fred, { (Fred, www.snails.com, 11am, 0.4) ) Transform to (user, Average(pagerank) as avgpr) (Amy, 0.65) (Fred, 0.4) Filter avgpr > 0.5 (Amy, 0.65)

Objectives: Automatic Example Data Generator Realism Conciseness Completeness Challenges: Large original data Selective operators (e.g., join, filter) Noninvertible operators (e.g., UDFs)

Talk Summary Data-parallel language ( Pig Latin ) Sequence of data transformation steps Users can plug in custom code Research nuggets Joint scheduling of related programs, to amortize IOs Adaptive data placement, to enhance locality Automatic example data generator, to make user s life easier

Credits Yahoo! Grid Team project leads: Alan Gates Olga Natkovich Yahoo! Research project leads: Chris Olston Utkarsh Srivastava Ben Reed