Data Science in the Wild

Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot be processed on a single machine Required in many data science tasks Tools and techniques based on distributed systems 2 1

Examples of Tools Computation MapReduce (Hadoop) Spark Platforms / warehouse Pig Hive Shark Database systems CouchDB MongoDB Analytics Mahout 3 Motivation: Google Example 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Takes even more to do something useful with the data! Need a distributed system 4 2

Cluster Architecture 1 Gbps between any pair of nodes in a rack Switch 2-10 Gbps backbone between racks Switch Switch CPU Mem CPU Mem CPU Mem CPU Mem Disk Disk Disk Disk Each rack contains 16-64 nodes 5 In 2011 it was guestimated that Cornell Google Tech had 1M machines, http://bit.ly/shh0ro 6 3

Important Challenges How to utilize many servers and synchronize their work How to easily write distributed programs Be able to cope with machine failure Machine failure example: - 10,000 servers - Expected lifespan of a server is 3 years - About 10 machines will crash every day 7 Important Challenges Utilize many servers and synchronize their work Be able to cope with machine failure Suppose there is a machine that synchronizes the work of the other servers (master) - How to handle a crash of such machine? - Who will detect that the machine is not responding? - How to avoid having two different masters? 8 4

Distributed Data Processing Storage Infrastructure File System Google: GFS Hadoop: HDFS Programming model MapReduce Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common 9 Distributed File System Chunk servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Node in Hadoop s HDFS Stores metadata about where files are stored Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data 10 5

Distributed File System Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure C 0 C 1 C 5 C 2 D 0 C 5 C 1 C 3 C 2 C 5 D 1 D 0 C 0 C 5 D 0 C 2 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N 11 MapReduce Distributed Computing 12 6

MapReduce Model Basic operations Map: produce a list of (key, value) pairs from the input structured as a (key value) pair of a different type (k1,v1) list (k2, v2) Reduce: produce a list of values from an input that consists of a key and a list of values associated with that key (k2, list(v2)) list(v2) 13 Note: inspired by map/reduce functions in Lisp and other functional programming languages. Data Flow 14 7

Example map(string key, String value) : // key: document name // value: document contents for each word w in value: EmitIntermediate(w, 1 ); reduce(string key, Iterator values) : // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 15 Example: Map Phase inputs tasks (M=3) partitions (intermediate files) (R=2) When in the course of human events it It was the best of times and the worst of times map (when,1), (course,1) (human,1) (events,1) (best,1) (in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) Over the past five years, the authors and many map (over,1), (past,1) (five,1) (years,1) (authors,1) (many,1) (the,1), (the,1) (and,1) This paper evaluates the suitability of the map (this,1) (paper,1) (evaluates,1) (suitability,1) (the,1) (of,1) (the,1) 16 Note: partition function places small words in one partition and large words in another. 8

Example: Reduce Phase partition (intermediate files) (R=2) reduce task (in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) (the,1), (the,1) (and,1) sort run-time function (the,1) (of,1) (the,1) (and, (1)) (in,(1)) (it, (1,1)) (the, (1,1,1,1,1,1)) (of, (1,1,1)) (was,(1)) reduce user s function (and,1) (in,1) (it, 2) (of, 3) (the,6) (was,1) Note: only one of the two reduce tasks shown 17 More Examples Count of URL Access Frequency Reverse Web-Link Graph Term-Vector per Host Inverted Index Distributed Sort Sample 18 9

Example: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs in a large corpus of documents Very easy with MapReduce: Map: Extract (5-word sequence, count) from document Reduce: Combine the counts 19 Distributed Execution Overview User Program fork fork fork assign map Master assign reduce Input Data Split 0 Split 1 Split 2 20 read Worker Worker Worker local write Intermediate files on local disk remote read, sort Worker Worker write Output File 0 Output File 1 10

Operation 1. Splits the input files into M pieces and assign the pieces to Map processes 2. The master assign the M map tasks and the R reduce tasks to idle workers 3. A map process reads the input split, parses the key/value pairs and applies the Map function. Results are buffered in memory 21 22 Operation 4. Periodically, write buffer to local disk, partitioned into R regions (by the partition function). Let the master know where the partitions are 5. A reducer receives locations of its M chunks from the master. It reads them, and sorts all the data it read How? 6. The reducer passes the key and list of values to the Reduce function and appends the output to the result After this, can wake up the client 11

Execution Environment Input key*value pairs Input key*value pairs... map map Data store 1 Data store n (key 1, (key 2, (key 3, (key 1, (key 2, (key 3, values...) values...) values...) values...) values...) values...) == Barrier == : Aggregates intermediate values by output key key 1, key 2, key 3, intermediate intermediate intermediate values values values reduce reduce reduce No reduce can begin until map is complete Tasks scheduled based on location of data If map worker fails any time before reduce finishes, task must be completely rerun Master must communicate locations of intermediate files final key 1 values final key 2 values final key 3 values 23 Note: figure and text from presentation by Jeff Dean. Worker Failure When worker A is idle or inaccessible for a long time, the master assigns the task to another machine What happens if the two machines complete the task and report this to the master? 24 12

Master Failure In principle, can be handled by writing checkpoints In practice, the user should execute again the entire MapReduce task 25 How Many Hosts to Allocate? The number of map processes, depends on the size of the input and the size of a block The number of reducers should be so that M*R will not be too big for the memory of the master and not to produce too many chunks of output 26 13

Sampling Using MapReduce 27 Simple Random Sample Given a population P of size N, choose a sample of size k A simple random sample of size k: any two subsets of the population of size k have the same probability to be selected Different from Bernoulli Sampling or Poisson Sampling for which the sample size is not a constant 28 14

Simple Random Sample Common trick in DB: assign a random number to each item and sort Costly if N is very big (what is the cost?) Is there a more efficient way to sample the data? 29 Reservoir Sampling Reservoir sampling described by [Knuth 69, 81]; enhancements [Vitter 85] Fixed size k uniform sample from arbitrary size N stream in one pass No need to know stream size in advance Include first k items w.p. 1 Include item n > k with probability p(n) = k/n, n > k Pick j uniformly from {1,2,,n} If j k, swap item n into location j in reservoir, discard replaced item Simple proof shows the uniformity of the sampling method: Let S n = sample set after n arrivals m (< n) n k=7 Previously sampled item: induction New item: selection probability Prob[n S n ] = p n := k/n 30 m S n-1 w.p. p n-1 m S n w.p. p n-1 * (1 p n / k) = p n 15

Using MapReduce to Sample a Distributed Dataset The population is distributed over several servers A constraint s k defines which objects of the population are relevant There may be different constraints that refer to different samples The goal: for each constraint, find a simple random sample of the objects that satisfy the constraint 31 Naïve Approach map (null, t) [(s k, t)] (if t satisfies s k ) reduce (s k, [t 1,, t N ]) (SRS ([t 1,, t N ], f k )) - Apply map based on the constraints - Apply a simple random sample in the reduce phase 32 16

MapReduce Evaluator Split 1 Map phase Combiner phase x1 Q= x1 x2 Split 2 Split 3 Using Algorithm R 33 MapReduce Reduce phase x1 Q= x1 x2 Using Algorithm R 34 17

Proper Sampling in the Reduce Phase Samples may have been drawn from subsets of different sizes 1 2 3 4 5 6 sample size = 2 2 3 5 6 35 A-priori probability: MR-SQE map (null, t) [(s k, t)] (if t satisfies s k ) combine (s k, [t 1,, t N ]) (SRS ([t 1,, t N ], f k ), N) reduce (s k, [(Š 1, Ň 1 ),, (Š K, Ň K )]) [unified-sampler ({(Š 1, Ň 1 ),, (Š K, Ň K )}, f k )] - Apply map based on the stratum constraints - Apply a combiner to sample the sets during the map process, to decrease the amount of transmitted data - In the reduce, unify the samples, while taking into account the initial sizes of the sets and the required strata sizes 36 18

HADOOP (MapReduce) 37 MapReduce vs Hadoop Unless you work at Google, or use Google App Engine, you will use Hadoop MapReduce Started by Yahoo Developed by Apache Google s implementation (at least the one described) is written in C++ Hadoop is written in Java 38 19

Major Components User Components: Mapper Reducer Combiner (Optional) Partitioner (Optional) (Shuffle) Writable(s) (Optional) System Components: Master Input Splitter* Output Committer* * You can use your own if you really want! 39 Image source: http://www.ibm.com/developerworks/java/library/l-hadoop-3/index.html Notes Mappers and Reducers are typically single threaded and deterministic Determinism allows for restarting of failed jobs, or speculative execution Need to handle more data? Just add more Mappers/Reducers! No need to handle multithreaded code Since they re all independent of each other, you can run (almost) arbitrary number of nodes 40 20

Notes Mappers/Reducers run on arbitrary machines A machine typically has multiple map and reduce slots available to it, typically one per processor core Mappers/Reducers run entirely independent of each other: In Hadoop, they run in separate JVMs 41 Input Splitter Is responsible for splitting your input into multiple chunks These chunks are then used as input for your mappers Splits on logical boundaries. The default is 64MB per chunk Typically, the user can just use one of the built in splitters, unless required to read from a specially formatted file 42 21

Code Example Next, we present a code example of the word-count application This is not optimized. See in the following tutorial how to write it is an optimized way: http://hadoop.apache.org/docs/r1.2.1/mapred_tutori al.html 43 1. package org.myorg; 2. 3. import java.io.ioexception; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. 11. import org.apache.hadoop.util.*; 12. public class WordCount { 13. 44 22

14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.tostring(); Define the map function 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasmoretokens()) { 22. word.set(tokenizer.nexttoken()); 23. output.collect(word, one); 24. } 25. } 26. 45 } Reporter for reporting progress 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasnext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37. 46 Define the reduce function 23

38. public static void main(string[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); 41. 42. conf.setoutputkeyclass(text.class); 43. conf.setoutputvalueclass(intwritable.class); 44. 45. conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); 48. Set the mapper Set a combiner Set the reducer 47 JobConf.setOutputKeyComparatorClass(Class) to set the partition of map keys to reducers 38. public static void main(string[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); 41. 42. conf.setoutputkeyclass(text.class); JobConf.setNumReduceTasks(int) tells how to set the number of reduce tasks 43. conf.setoutputvalueclass(intwritable.class); 44. 45. conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); 48. Set the mapper Set a combiner Set the reducer 48 JobConf.setNumMapTasks(int) hint how to set the number of map tasks 24

Combiner Essentially an intermediate reducer Is optional Reduces output from each mapper, reducing bandwidth and sorting Cannot change the type of its input Input types must be the same as output types 49 InputFormat generates InputSplit 49. conf.setinputformat(textinputformat.class); 50. conf.setoutputformat(textoutputformat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59. 50 25

Output Committer Is responsible for taking the reduce output, and committing it to a file Typically, this committer needs a corresponding input splitter (so that another job can read the input) Again, usually built in splitters are good enough, unless you need to output a special kind of file 51 Partitioner (Shuffler) Decides which pairs are sent to which reducer Default is simply: Key.hashCode() % numofreducers User can override to: Provide (more) uniform distribution of load between reducers Some values might need to be sent to the same reducer Ex. To compute the relative frequency of a pair of words <W1, W2> you would need to make sure all of word W1 are sent to the same reducer Binning of results 52 26

Master Responsible for scheduling & managing jobs Scheduled computation should be close to the data if possible Bandwidth is expensive! (and slow) This relies on a Distributed File System (GFS / HDFS)! If a task fails to report progress (such as reading input, writing output, etc), crashes, the machine goes down, etc, it is assumed to be stuck, and is killed, and the step is re-launched (with the same input) The Master is handled by the framework, no user code is necessary 53 Master Cont. HDFS can replicate data to be local if necessary for scheduling Because our nodes are (or at least should be) deterministic The Master can restart failed nodes (Nodes should have no side effects!) If a node is the last step, and is completing slowly, the master can launch a second copy of that node This can be due to hardware issues, network issues, etc. First one to complete wins, then any other runs are killed 54 27

Backup Tasks A slow running task (straggler) prolong overall execution Stragglers often caused by circumstances local to the worker on which the straggler task is running Overload on worker machined due to scheduler Frequent recoverable disk errors Solution Abort stragglers when map/reduce computation is near end (progress monitored by Master) For each aborted straggler, schedule backup (replacement) task on another worker Can significantly improve overall completion time 55 Backup Tasks (1) without backup tasks (2) with backup tasks (normal) 56 28

Writables Are types that can be serialized (deserialized) to a stream Are required to be input/output classes, as the framework will serialize the data before writing them to disk User can implement this interface, and use their own types for their input/output/intermediate values There are default classes for basic values, like Strings, Integers, Longs, etc. Can also handle data structures, such as arrays, maps, etc. Your application needs at least six writables 2 for your input 2 for your intermediate values (Map <-> Reduce) 2 for your output 57 29