web-scale data processing Christopher Olston and many others Yahoo! Research
Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding complex large-scale phenomena social systems (e.g. user-generated web content) economic systems (e.g. advertising markets) computer systems (e.g. web search engines) Data analysis is inner loop at Yahoo! et al. Big data necessitates parallel processing
Examples 1. Detect faces You have a function detectfaces() You want to run it over n images n is big 2. Study web usage You have a web crawl and click log Find sessions that end with the best page
Existing Work Parallel architectures cluster computing multi-core processors Data-parallel software parallel DBMS Map-Reduce, Dryad Data-parallel languages SQL NESL
Pig Project Data-parallel language ( Pig Latin ) Relational data manipulation primitives Imperative programming style Plug in code to customize processing Various crazy ideas Multi-program optimization Adaptive data placement Automatic example data generator
Pig Latin Language [SIGMOD 08]
Example 1 Detect faces in many images. I = load /mydata/images using ImageParser() as (id, image); F = foreach I generate id, detectfaces(image); store F into /mydata/faces ;
Example 2 Find sessions that end with the best page. Visits user url time Amy www.cnn.com 8:00 Amy www.crap.com 8:05 Amy www.myblog.com 10:00 Amy www.flickr.com 10:05 Fred cnn.com/index.htm 12:00 Pages url pagerank www.cnn.com 0.9 www.flickr.com 0.9 www.myblog.com 0.7 www.crap.com 0.2......
Efficient Evaluation Method the answer repartition by user repartition by url...... group-wise processing (identify sessions; examine pageranks) join...... Visits Pages
In Pig Latin Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(findsessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';
Pig Latin, in general transformations on sets of records easy for users high-level, extensible data processing primitives easy for the system exposes opportunities for parallelism and reuse operators: FILTER FOREACH GENERATE GROUP binary operators: JOIN COGROUP UNION
Related Languages SQL: declarative all-in-one blocks NESL: lacks join, cogroup Map-Reduce: special case of Pig Latin a = FOREACH input GENERATE flatten(map(*)); b = GROUP a BY $0; c = FOREACH b GENERATE Reduce(*); Sawzall: rigid map-then-reduce structure
Pig Latin = Sweet Spot Between SQL & Map-Reduce SQL Map-Reduce Programming style Large blocks of declarative Plug together pipes (but "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of constraints restricted to linear, 2-step creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of data what flow) your variables Built-in are, data and where you Group-by, are in the process Sort, Join, of analyzing Filter, your Group-by, data. Sort manipulations Aggregate, Top-k, etc... -- Jasmine Novak, Engineer, Yahoo! Execution model Fancy; trust the query Simple, transparent "PIG seems to give optimizer the necessary parallel programming construct Opportunities (FOREACH, for FLATTEN, Many COGROUP.. etc) and also give Few sufficient (logic control buried in map() automatic back optimization to the programmer (which purely declarative approach and reduce()) like [SQL on top of Map-Reduce] doesn t). -- Ricky Ho, Adobe Software
Map-Reduce as Backend ( SQL ) user automatic rewrite + optimize Pig or or Map-Reduce cluster details in [VLDB09]
Pig Latin vs. Map-Reduce: Code Reduction import java.io.ioexception; reporter.setstatus("ok"); lp.setoutputkeyclass(text.class); import java.util.arraylist; lp.setoutputvalueclass(text.class); import java.util.iterator; lp.setmapperclass(loadpages.class); Users = load users as (name, age); import java.util.list; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new for (String s1 : first) { Path("/user/gates/pages")); import org.apache.hadoop.fs.path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp, import org.apache.hadoop.io.longwritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages")); import org.apache.hadoop.io.text; oc.collect(null, new Text(outval)); lp.setnumreducetasks(0); import org.apache.hadoop.io.writable; reporter.setstatus("ok"); Job loadpages = new Job(lp); import org.apache.hadoop.io.writablecomparable; import Fltrd org.apache.hadoop.mapred.fileinputformat; = filter Users by JobConf lfu = new JobConf(MRExample.class); import org.apache.hadoop.mapred.fileoutputformat; lfu.setjobname("load and Filter Users"); import org.apache.hadoop.mapred.jobconf; lfu.setinputformat(textinputformat.class); import org.apache.hadoop.mapred.keyvaluetextinputformat; public static class LoadJoined extends MapReduceBase lfu.setoutputkeyclass(text.class); import org.apache.hadoop.mapred.mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setoutputvalueclass(text.class); import org.apache.hadoop.mapred.mapreducebase; lfu.setmapperclass(loadandfilterusers.class); import org.apache.hadoop.mapred.outputcollector; public void map( FileInputFormat.addInputPath(lfu, new import org.apache.hadoop.mapred.recordreader; age >= 18 Text and k, age <= 25; Path("/user/gates/users")); import org.apache.hadoop.mapred.reducer; Text val, FileOutputFormat.setOutputPath(lfu, import org.apache.hadoop.mapred.reporter; OutputCollector<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users")); import org.apache.hadoop.mapred.sequencefileinputformat; Reporter reporter) throws IOException { lfu.setnumreducetasks(0); import org.apache.hadoop.mapred.sequencefileoutputformat; // Find the url Job loadusers = new Job(lfu); import org.apache.hadoop.mapred.textinputformat; String line = val.tostring(); import org.apache.hadoop.mapred.jobcontrol.job; int firstcomma = line.indexof(','); JobConf join = new JobConf(MRExample.class); import Views org.apache.hadoop.mapred.jobcontrol.jobcontrol; = load views int secondcomma = line.indexof(',', as (user, firstcomma); url); join.setjobname("join Users and Pages"); import org.apache.hadoop.mapred.lib.identitymapper; String key = line.substring(firstcomma, secondcomma); join.setinputformat(keyvaluetextinputformat.class); // drop the rest of the record, I don't need it anymore, join.setoutputkeyclass(text.class); public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setoutputvalueclass(text.class); public static class LoadPages extends MapReduceBase Text outkey = new Text(key); join.setmapperclass(identitymapper.class); implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outkey, new LongWritable(1L)); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Jnd public void map(longwritable = k, Text join val, Fltrd by name, Views Path("/user/gates/tmp/indexed_pages")); by user; OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users")); // Pull the key out Writable> { FileOutputFormat.setOutputPath(join, new String line = val.tostring(); Path("/user/gates/tmp/joined")); int firstcomma = line.indexof(','); public void reduce( join.setnumreducetasks(50); String key = line.substring(0, firstcomma); Text key, Job joinjob = new Job(join); Grpd String value = line.substring(firstcomma = group + 1); Jnd Iterator<LongWritable> by url; iter, joinjob.adddependingjob(loadpages); Text outkey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinjob.adddependingjob(loadusers); // Prepend an index to the value so we know which file Reporter reporter) throws IOException { // it came from. // Add up all the values we see JobConf group = new JobConf(MRExample.class); Text outval = new Text("1" + value); group.setjobname("group URLs"); oc.collect(outkey, outval); long sum = 0; group.setinputformat(keyvaluetextinputformat.class); while (iter.hasnext()) { group.setoutputkeyclass(text.class); sum += iter.next().get(); group.setoutputvalueclass(longwritable.class); Smmd public static class LoadAndFilterUsers = extends foreach MapReduceBase Grpd reporter.setstatus("ok"); generate group, group.setoutputformat(sequencefileoutputformat.class); implements Mapper<LongWritable, Text, Text, Text> { group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); public void map(longwritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setreducerclass(reduceurls.class); OutputCollector<Text, Text> oc, FileInputFormat.addInputPath(group, new Reporter reporter) throws IOException { Path("/user/gates/tmp/joined")); // Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new String line = val.tostring(); COUNT(Jnd) implements Mapper<WritableComparable, as clicks; Writable, LongWritable, Path("/user/gates/tmp/grouped")); int firstcomma = line.indexof(','); Text> { group.setnumreducetasks(50); String value = line.substring(firstcomma + 1); Job groupjob = new Job(group); int age = Integer.parseInt(value); public void map( groupjob.adddependingjob(joinjob); if (age < 18 age > 25) return; WritableComparable key, String key = line.substring(0, firstcomma); Writable val, JobConf top100 = new JobConf(MRExample.class); Text outkey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setjobname("top 100 sites"); Srtd // Prepend an index to = the value order so we know which file Smmd Reporter by reporter) throws clicks IOException { desc; top100.setinputformat(sequencefileinputformat.class); // it came from. oc.collect((longwritable)val, (Text)key); top100.setoutputkeyclass(longwritable.class); Text outval = new Text("2" + value); top100.setoutputvalueclass(text.class); oc.collect(outkey, outval); top100.setoutputformat(sequencefileoutputformat.class); public static class LimitClicks extends MapReduceBase top100.setmapperclass(loadclicks.class); implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setcombinerclass(limitclicks.class); public static class Join extends MapReduceBase top100.setreducerclass(limitclicks.class); implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new public void reduce( Path("/user/gates/tmp/grouped")); public void reduce(text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25")); OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setnumreducetasks(1); Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100); // For each value, figure out which file it's from and limit.adddependingjob(groupjob); store it Top5 into top5sites ; // Only output the first 100 records // accordingly. while (count < 100 && iter.hasnext()) { JobControl jc = new JobControl("Find top 100 sites for users List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25"); List<String> second = new ArrayList<String>(); count++; jc.addjob(loadpages); jc.addjob(loadusers); while (iter.hasnext()) { jc.addjob(joinjob); Text t = iter.next(); jc.addjob(groupjob); String value = t.tostring(); public static void main(string[] args) throws IOException { jc.addjob(limit); if (value.charat(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run(); first.add(value.substring(1)); lp.setjobname("load Pages"); else second.add(value.substring(1)); lp.setinputformat(textinputformat.class); Top5 = limit Srtd 5;
Comparison 180 160 140 120 100 80 60 40 20 0 1/20 the lines of code Minutes 300 250 200 150 100 50 0 1/16 the development time Hadoop Pig Hadoop Pig performance 1.5x Hadoop
Ways to Run Pig Interactive shell Script file Embed in host language (e.g., Java) soon: Graphical editor
Status Open-source implementation http://hadoop.apache.org/pig Runs on Hadoop or local machine Active project; many refinements in the works Wide adoption in Yahoo 100s of users 1000s of Pig jobs/day 60% of ad-hoc Hadoop jobs are via Pig 40% of production jobs via Pig
Status Gaining traction externally log processing & aggregation building text indexes collaborative filtering, applied to image & video recommendation systems "The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. -- Prasenjit Mukherjee, Mahout project
Crazy Ideas [USENIX 08] [VLDB 08] [SIGMOD 09]
Crazy Idea #1 Multi-Program Optimization
Motivation User programs repeatedly scan the same data files web crawl search log Goal: Reduce redundant IOs, and hence improve overall system throughput Approach: Introduce shared scans capability Careful scheduling of jobs, to maximize benefit of shared scans
Scheduling Shared Scans 7 4 1 Scheduler looks at queues and anticipates arrivals 10 5 39 28 Executor 6 5 36 (Idle) 2 not popular Search Log Web Crawl Click Data
Crazy Idea #2 Adaptive Data Placement
Motivation Hadoop is good at localizing computation to data, for conventional map-reduce scenarios However, advanced query processing operations change this: Pre-hashed join of A & B: co-locate A & B? Frag-repl join of A & B: more replicas of B? Our idea: Adaptive pressure-based mechanism to move data s.t. better locality arises
Adaptive Data Placement jobs Job Localizer execution engine Scan C C A B Join Scan A A & B D A B Worker 1 Worker 2
Crazy Idea #3 Automatic Example Generator
Example Program (abstract view) Find users who tend to visit good pages. Load Visits(user, url, time) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5
Load Visits(user, url, time) Transform to (user, Canonicalize(url), time) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) Join url = url Load Pages(url, pagerank) (www.cnn.com, 0.9) (www.snails.com, 0.4) (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) Group by user (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) ) (Fred, { (Fred, www.snails.com, 11am, 0.4) ) Transform to (user, Average(pagerank) as avgpr) (Amy, 0.65) (Fred, 0.4) Filter avgpr > 0.5 (Amy, 0.65)
Objectives: Automatic Example Data Generator Realism Conciseness Completeness Challenges: Large original data Selective operators (e.g., join, filter) Noninvertible operators (e.g., UDFs)
Talk Summary Data-parallel language ( Pig Latin ) Sequence of data transformation steps Users can plug in custom code Research nuggets Joint scheduling of related programs, to amortize IOs Adaptive data placement, to enhance locality Automatic example data generator, to make user s life easier
Credits Yahoo! Grid Team project leads: Alan Gates Olga Natkovich Yahoo! Research project leads: Chris Olston Utkarsh Srivastava Ben Reed