Database design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce

Transcription

1 Database design and implementation CMPSCI 645 Lectures 22: Parallel Databases and MapReduce

2 What is a parallel database? 2

3 Parallel v.s. distributed databases } Parallel database system: } Improve performance through parallel implementa9on } Distributed database system: } Data is stored across several sites, each site managed by a DBMS capable of running independently 3

4 Parallel DBMSs } Goal } Improve performance by execu9ng mul9ple opera9ons in parallel } Key benefit } Cheaper to scale than relying on a single increasingly more powerful processor } Key challenge } Ensure overhead and conten9on do not kill performance 4

5 Performance metrics for parallel DBMSs } Speedup } } } } More processors higher speed Individual queries should run faster Should do more transac9ons per second (TPS) Fixed problem size overall, vary # of processors ("strong scaling ) } Scaleup } } } } More processors can process more data Fixed problem size per processor, vary # of processors ("weak scaling ) Batch scaleup } Same query on larger input data should take the same 9me Transac9on scaleup } N- 9mes as many TPS on N- 9mes larger database } But each transac9on typically remains small 5

6 Linear v.s. non-linear speedup Speedup # processors (=P) 6

7 Linear v.s. non-linear scaleup Batch Scaleup # processors (=P) AND data size 7

8 Challenges to linear speedup and scaleup } Startup cost } Cost of star9ng an opera9on on many processors } Interference } Conten9on for resources between processors } Skew } Slowest processor becomes the boxleneck 8

9 Architectures for parallel databases } Shared memory } Shared disk } Shared nothing 9

10 Shared memory P P P Interconnec9on Network Global Shared Memory D D D 10

11 Shared disk M M M P P P Interconnec9on Network D D D 11

12 Shared nothing Interconnec9on Network P P P M M M D D D 12

13 Shared nothing } Most scalable architecture } Minimizes interference by minimizing resource sharing } Can use commodity hardware } Also most difficult to program and manage We will focus on shared nothing Important ques9on: what exactly can we actually parallelize in a parallel database? 13

14 Taxonomy for parallel query evaluation } Inter- query parallelism } Each query runs on one processor } Inter- operator parallelism } A query runs on mul9ple processors } An operator runs on one processor } Intra- operator parallelism } An operator runs on mul9ple processors 14

15 Different types of parallelism } Par99oned parallelism } Par99on data over all nodes, get the nodes working to compute a given opera9on (scan, sort, join) } Pipelined parallelism } A chain of operators O_1, O_2,,O_k run in parallel, with O_1 working on tuple t_n, O2 on t_(n- 1), O_k on t_(n- k+1) } Can run these operators on different nodes } Some operators break pipelining, e.g. sort, hash } Independent operators } Consider bushy query plans } A join B, C join D are independent Bushy A B C D

16 Data partitioning schemes Par99oning a table: Range Hash Round Robin A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

17 Data partitioning What are the pros and cons? } Round robin } Good load balance but always needs to read all the data } Hash based par99oning } Good load balance but works only for equality predicates and full scans } Range based par99oning } Works well for range predicates but can suffer from data skew 17

18 Parallel evaluation of operators } Selec9on? } Aggregates? } Joins? 18

19 Parallel join: R join S on attribute x R S Hash on x Join each hash bucket 19

20 MapReduce Map Reduce 20

21 Example: document processing Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. How many big, medium, and small words are used? 21

22 Example: word length histogram Big = Yellow = 10+ lexers Medium = Red = 5..9 lexers Small = Blue = 2..4 lexers Tiny = Pink = 1 lexer A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. 22

23 Example: word length histogram Process each chunk on a different computer Chunk 1 A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will Chunk 2 dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. 23

24 Example: word length histogram Map Task 1 (204 words) A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will (key, value) (yellow, 16) (red, 78) (blue, 107) (pink, 3) Map Task 2 (190 words) dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. (yellow, 19) (red, 72) (blue, 93) (pink, 6 ) 24

25 Example: word length histogram Shuffle step Map task 1 Reduce tasks A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will (yellow, 16) (red, 78) (blue, 107) (pink, 3) (yellow, 16) (yellow, 19) (red, 78) (red, 72) (yellow, 35) (red, 150) Map task 2 dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. (yellow, 19) (red, 72) (blue, 93) (pink, 6 ) (blue, 93) (blue, 107) (pink, 6) (pink, 3) (blue, 200) (pink, 9) 25

26 MapReduce programming model } Input & Output: each a set of key/value pairs } Programmer specifies two func9ons: map (in_key, in_value) - > list(out_key, intermediate_value) } Processes input key/value pair } Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) - > list(out_value) } Combines all intermediate values for a par9cular key } Produces a set of merged output values (usually just one) Inspired by primi9ves from func9onal programming languages such as Lisp, Scheme, and Haskell 26

27 MapReduce } Google: [Dean 2004] } Open source implementa9on: Hadoop } Map- reduce = high- level programming model and implementa9on for large- scale parallel data processing 27

28 Motivation: large-scale data processing } Want to process lots of data, unstructured or structured } Want to parallelize across hundreds/thousands of commodity computers } } New defini9on of cluster compu9ng: large numbers of low- end processors working in parallel to solve a compu8ng problem. Parallel DB: a small number of high- end servers. } Want to make this easy 28

29 Implementation } There is one master node } Master par99ons input file into M splits, by key } Master assigns workers (=servers) to the M map tasks, keeps track of their progress } Workers write their output to local disk, par99on into R regions } Master assigns workers to the R reduce tasks } Reduce workers read regions from the map workers local disks 29

30 Why is MapReduce successful? } Easy } } } Democra9za9on of parallel compu9ng Just two serial func9ons Time to first query: a few hours (contrast with parallel DB ) } Flexible } } } } Schema- free, In situ processing First, load your data into the database First, convert your images to bitmaps First, encode your 3D mesh as triangle soup } Fault- tolerance 30

31 Example: Count Word Occurrences map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); How do we implement this using a rela8onal DBMS? Customized data loading (data may be used only once), then Group By. 31

32 Click Stream Analysis: Page Frequencies Clicks(time, url, referral_url, user_id, geo_info ) map(string tuple_id, String tuple): EmitIntermediate(url, "1"); reduce(string url, Iterator list_tuples): int result = 0; for each t in list_tuples: result += ParseInt(t); Emit(AsString(result)); Select count(*) From Clicks Group By url; 32

33 Parallelism } The map() func9on is stateless, so many instances can run in parallel on different splits (chunks) of input data } The reduce() func9on is stateful, but works on an output key at a 9me, so many copies can run in parallel on different keys (groups) } Performance boxleneck: reduce phase can t start un9l map phase is completely finished. 33

34 MapReduce summary } Hides scheduling and paralleliza9on details } However, very limited queries } } Difficult to write more complex tasks Need mul9ple mapreduce opera9ons } Solu9on: } } } } } Use MapReduce as a run9me for higher level languages Pig (Yahoo!, now apache project): RA- like operators Hive (apache project): SQL Scope (MS): SQL! But proprietary DryadLINQ (MS): LINQ! But also proprietary 34

35 MapReduce: A major step backwards? } Seminal debate in Jan 2008 } David DeWiX, Michael Stonebraker } Five points } MapReduce is a step backwards in database access } MapReduce is a poor implementa9on } MapReduce is not novel } MapReduce is missing features } MapReduce is incompa9ble with the DBMS tools

36 MapReduce is A step backwards in database access } No schema or schema free } Separa9on of the schema from the applica9on is good } High- level access languages are good

37 MapReduce is A poor implementation } No index. Only offers brute force access. } Poor handling of skew } Shuffle phase incurs a huge random access on disks

38 MapReduce is Not novel } User- defined func9ons have been around in database for decades } Many of the parallel distributed processing techniques have been extensively researched in database literature

39 MapReduce is Missing features } Bulk loader } Indexing } Updates } Transac9ons } Integrity constraints } Referen9al integrity } Views

40 MapReduce is Incompatible with the DBMS tools } Report writers } Business intelligence tools } Data mining tools } Replica9on tools } Database design tools

41 Do you agree or disagree?

42 Making parallelism simple } Sequen9al reads = good read speeds } In large cluster failures are guaranteed; MapReduce handles retries } Good fit for batch processing applica9ons that need to touch all your data: } data mining } model tuning } Bad fit for applica9ons that need to find one par9cular record } Bad fit for applica9ons that need to communicate between processes; oriented around independent units of work

43 MapReduce vs RDBMS } RDBMS } Declara9ve query languages } Schemas } Logical Data Independence } Indexing } Algebraic Op9miza9on } Caching/Materialized Views } ACID/Transac8ons } MapReduce } High Scalability } Fault- tolerance } One- person deployment DryadLINQ, Pig, HIVE HIVE, Pig Hbase Pig, (Dryad, HIVE) 43

44 Questions } Is it silly for MapReduce to try and be like a parallel database? } SQL on Hadoop: Hive } Is it silly for a parallel database to try and be like MapReduce? } MapReduce in SQL: Greenplum, Aster Data, 44

45 Things people like about Hadoop } $0 to get started } No viable open source parallel database } Scalability and fault tolerance on commodity hardware } Easy to set up } Works ok without tuning } Freedom from the Warehouse Priesthood } Analysts like to load data in HDFS and experiment with it } A rigid warehouse schema is o}en not what they want } Open source brings innova9on and choices } Languages, ML libs, } Extensibility & Programmability of the pla~orm } Able to do stuff they probably couldn t do in a parallel database 45

46 Shared-nothing parallel databases } Teradata } Greenplum } Netezza } Aster Data Systems } Datallegro } Ver9ca } MonetDB EMC (July 2010) IBM (Sep 2010) Microso} (July 2008) HP (Feb 2011) Teradata (March 2011) Commercialized as Vectorwise 46

47 What is Pig? } An engine for execu9ng programs on top of Hadoop } It provides a language, Pig La9n, to specify these programs } An Apache open source project } hxp://hadoop.apache.org/pig/

48 Why use Pig? Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged Load Users Filter by age Load Pages Join on name Group on url Count clicks Order by clicks Take top 5

49 In MapReduce import java.io.ioexception; import java.util.arraylist; import java.util.iterator; import java.util.list; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.mapred.fileinputformat; import org.apache.hadoop.mapred.fileoutputformat; import org.apache.hadoop.mapred.jobconf; import org.apache.hadoop.mapred.keyvaluetextinputformat; import org.apache.hadoop.mapred.mapper; import org.apache.hadoop.mapred.mapreducebase; import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.recordreader; import org.apache.hadoop.mapred.reducer; import org.apache.hadoop.mapred.reporter; import org.apache.hadoop.mapred.sequencefileinputformat; import org.apache.hadoop.mapred.sequencefileoutputformat; import org.apache.hadoop.mapred.textinputformat; import org.apache.hadoop.mapred.jobcontrol.job; import org.apache.hadoop.mapred.jobcontrol.jobcontrol; import org.apache.hadoop.mapred.lib.identitymapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String key = line.substring(0, firstcomma); String value = line.substring(firstcomma + 1); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("1" + value); oc.collect(outkey, outval); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String value = line.substring(firstcomma + 1); int age = Integer.parseInt(value); if (age < 18 age > 25) return; String key = line.substring(0, firstcomma); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("2" + value); oc.collect(outkey, outval); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasnext()) { Text t = iter.next(); String value = t.tostring(); if (value.charat(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1)); reporter.setstatus("ok"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setstatus("ok"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.tostring(); int firstcomma = line.indexof(','); int secondcomma = line.indexof(',', firstcomma); String key = line.substring(firstcomma, secondcomma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outkey = new Text(key); oc.collect(outkey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasnext()) { sum += iter.next().get(); reporter.setstatus("ok"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((longwritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasnext()) { oc.collect(key, iter.next()); count++; } } } public static void main(string[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setjobname("load Pages"); lp.setinputformat(textinputformat.class); lp.setoutputkeyclass(text.class); lp.setoutputvalueclass(text.class); lp.setmapperclass(loadpages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setnumreducetasks(0); Job loadpages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setjobname("load and Filter Users"); lfu.setinputformat(textinputformat.class); lfu.setoutputkeyclass(text.class); lfu.setoutputvalueclass(text.class); lfu.setmapperclass(loadandfilterusers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setnumreducetasks(0); Job loadusers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setjobname("join Users and Pages"); join.setinputformat(keyvaluetextinputformat.class); join.setoutputkeyclass(text.class); join.setoutputvalueclass(text.class); join.setmapperclass(identitymapper.class); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setnumreducetasks(50); Job joinjob = new Job(join); joinjob.adddependingjob(loadpages); joinjob.adddependingjob(loadusers); JobConf group = new JobConf(MRExample.class); group.setjobname("group URLs"); group.setinputformat(keyvaluetextinputformat.class); group.setoutputkeyclass(text.class); group.setoutputvalueclass(longwritable.class); group.setoutputformat(sequencefileoutputformat.class); group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); group.setreducerclass(reduceurls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setnumreducetasks(50); Job groupjob = new Job(group); groupjob.adddependingjob(joinjob); JobConf top100 = new JobConf(MRExample.class); top100.setjobname("top 100 sites"); top100.setinputformat(sequencefileinputformat.class); top100.setoutputkeyclass(longwritable.class); top100.setoutputvalueclass(text.class); top100.setoutputformat(sequencefileoutputformat.class); top100.setmapperclass(loadclicks.class); top100.setcombinerclass(limitclicks.class); top100.setreducerclass(limitclicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setnumreducetasks(1); Job limit = new Job(top100); limit.adddependingjob(groupjob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addjob(loadpages); jc.addjob(loadusers); jc.addjob(joinjob); jc.addjob(groupjob); jc.addjob(limit); jc.run(); } } 170 lines of code, 4 hours to write

50 In Pig Latin Users = load users as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into top5sites ; 9 lines of code, 15 minutes to write

51 Essence of Pig } Map- Reduce is too low a level to program, SQL too high } Pig La9n, a language intended to sit between the two: } Impera9ve } Provides standard rela9onal transforms (join, sort, etc.) } Schemas are op9onal, used when available, can be defined at run9me } User Defined Func9ons are first class ci9zens } Opportuni9es for advanced op9mizer but op9miza9ons by programmer also possible

52 How It Works Logical Plan rela9onal algebra Plan standard op9miza9ons Script A = load B = filter C = group D = foreach Parser Logical Plan Logical Plan Seman9c Checks Logical Op9mizer Logical Plan MapReduce Launcher Map- Reduce Plan Physical To MR Translator Physical Plan Logical to Physical Translator Jar to hadoop Map- Reduce Plan = physical operators broken into Map, Combine, and Reduce stages Physical Plan = physical operators to be executed

53 Fragment replicate join Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using replicated ; Pages Users Pages block 1 Map 1 Users Pages block 2 Map 2 Users

54 Hash join Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Users by name, Pages by user; Pages Users Map 1 User block n (1, user) Reducer 1 (1, fred) (2, fred) (2, fred) Map 2 Reducer 2 Page block m (2, name) (1, jane) (2, jane) (2, jane)

55 Skew join Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using skewed ; Pages Users Map 1 Pages block n S P (1, user) Reducer 1 (1, fred, p1) (1, fred, p2) (2, fred) Map 2 Users block m S P Reducer 2 (1, fred, p3) (2, name) (1, fred, p4) (2, fred)

56 Merge join Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using merge ; Pages aaron Users aaron Pages Map 1 aaron amr Pages amy barb Map 2 Users aaron Users amy

57 What are people doing with Pig } At Yahoo ~70% of Hadoop jobs are Pig jobs } Being used at TwiXer, LinkedIn, and other companies } Available as part of Amazon EMR web service and Cloudera Hadoop distribu9on } What users use Pig for: } Search infrastructure } Ad relevance } Model training } User intent analysis } Web log processing } Image processing } Incremental processing of large data sets

58 Questions } Is it silly for MapReduce to try and be like a parallel database? } SQL on Hadoop: Hive } Is it silly for a parallel database to try and be like MapReduce? } MapReduce in SQL: Greenplum, Aster Data, 58