Data Science in the Wild

Similar documents
Hadoop WordCount Explained! IT332 Distributed Systems

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Word count example Abdalrahman Alsaedi

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

How To Write A Mapreduce Program In Java.Io (Orchestra)

Introduction to MapReduce and Hadoop

Hadoop: Understanding the Big Data Processing Method

CS54100: Database Systems

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

map/reduce connected components

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Jeffrey D. Ullman slides. MapReduce for data intensive computing

By Hrudaya nath K Cloud Computing

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Introduction To Hadoop

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Introduction to Parallel Programming and MapReduce

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

HPCHadoop: MapReduce on Cray X-series

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Management and NoSQL Databases

MR-(Mapreduce Programming Language)

Word Count Code using MR2 Classes and API

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Map Reduce / Hadoop / HDFS

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Extreme Computing. Hadoop MapReduce in more detail.

Big Data and Apache Hadoop s MapReduce

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

MapReduce (in the cloud)

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Programming in Hadoop Programming, Tuning & Debugging

Hadoop Design and k-means Clustering

Download and install Download virtual machine Import virtual machine in Virtualbox

Introduction to Cloud Computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Getting to know Apache Hadoop

Introduction to Hadoop

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Big Data Processing with Google s MapReduce. Alexandru Costan

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Hadoop Configuration and First Examples

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Internals of Hadoop Application Framework and Distributed File System

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Big Data Analytics* Outline. Issues. Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

CSE-E5430 Scalable Cloud Computing Lecture 2

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop Architecture. Part 1

Introduction to Hadoop

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

INTRODUCTION TO HADOOP

Introduction to MapReduce and Hadoop

MAPREDUCE Programming Model

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop and Map-Reduce. Swati Gore

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Advanced Data Management Technologies

Hadoop IST 734 SS CHUNG

The Hadoop Eco System Shanghai Data Science Meetup

Parallel Processing of cluster by Map Reduce

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Introduc)on to Map- Reduce. Vincent Leroy

Xiaoming Gao Hui Li Thilina Gunarathne

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Scalable Computing with Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Intro to Map/Reduce a.k.a. Hadoop

ImprovedApproachestoHandleBigdatathroughHadoop

University of Maryland. Tuesday, February 2, 2010

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Data Science Analytics & Research Centre

Implementations of iterative algorithms in Hadoop and Spark

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 3 Hadoop Technical Introduction CSE 490H

HDFS. Hadoop Distributed File System

Connecting Hadoop with Oracle Database

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

HADOOP PERFORMANCE TUNING

Transcription:

Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot be processed on a single machine Required in many data science tasks Tools and techniques based on distributed systems 2 1

Examples of Tools Computation MapReduce (Hadoop) Spark Platforms / warehouse Pig Hive Shark Database systems CouchDB MongoDB Analytics Mahout 3 Motivation: Google Example 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Takes even more to do something useful with the data! Need a distributed system 4 2

Cluster Architecture 1 Gbps between any pair of nodes in a rack Switch 2-10 Gbps backbone between racks Switch Switch CPU Mem CPU Mem CPU Mem CPU Mem Disk Disk Disk Disk Each rack contains 16-64 nodes 5 In 2011 it was guestimated that Cornell Google Tech had 1M machines, http://bit.ly/shh0ro 6 3

Important Challenges How to utilize many servers and synchronize their work How to easily write distributed programs Be able to cope with machine failure Machine failure example: - 10,000 servers - Expected lifespan of a server is 3 years - About 10 machines will crash every day 7 Important Challenges Utilize many servers and synchronize their work Be able to cope with machine failure Suppose there is a machine that synchronizes the work of the other servers (master) - How to handle a crash of such machine? - Who will detect that the machine is not responding? - How to avoid having two different masters? 8 4

Distributed Data Processing Storage Infrastructure File System Google: GFS Hadoop: HDFS Programming model MapReduce Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common 9 Distributed File System Chunk servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Node in Hadoop s HDFS Stores metadata about where files are stored Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data 10 5

Distributed File System Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure C 0 C 1 C 5 C 2 D 0 C 5 C 1 C 3 C 2 C 5 D 1 D 0 C 0 C 5 D 0 C 2 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N 11 MapReduce Distributed Computing 12 6

MapReduce Model Basic operations Map: produce a list of (key, value) pairs from the input structured as a (key value) pair of a different type (k1,v1) list (k2, v2) Reduce: produce a list of values from an input that consists of a key and a list of values associated with that key (k2, list(v2)) list(v2) 13 Note: inspired by map/reduce functions in Lisp and other functional programming languages. Data Flow 14 7

Example map(string key, String value) : // key: document name // value: document contents for each word w in value: EmitIntermediate(w, 1 ); reduce(string key, Iterator values) : // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 15 Example: Map Phase inputs tasks (M=3) partitions (intermediate files) (R=2) When in the course of human events it It was the best of times and the worst of times map (when,1), (course,1) (human,1) (events,1) (best,1) (in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) Over the past five years, the authors and many map (over,1), (past,1) (five,1) (years,1) (authors,1) (many,1) (the,1), (the,1) (and,1) This paper evaluates the suitability of the map (this,1) (paper,1) (evaluates,1) (suitability,1) (the,1) (of,1) (the,1) 16 Note: partition function places small words in one partition and large words in another. 8

Example: Reduce Phase partition (intermediate files) (R=2) reduce task (in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) (the,1), (the,1) (and,1) sort run-time function (the,1) (of,1) (the,1) (and, (1)) (in,(1)) (it, (1,1)) (the, (1,1,1,1,1,1)) (of, (1,1,1)) (was,(1)) reduce user s function (and,1) (in,1) (it, 2) (of, 3) (the,6) (was,1) Note: only one of the two reduce tasks shown 17 More Examples Count of URL Access Frequency Reverse Web-Link Graph Term-Vector per Host Inverted Index Distributed Sort Sample 18 9

Example: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs in a large corpus of documents Very easy with MapReduce: Map: Extract (5-word sequence, count) from document Reduce: Combine the counts 19 Distributed Execution Overview User Program fork fork fork assign map Master assign reduce Input Data Split 0 Split 1 Split 2 20 read Worker Worker Worker local write Intermediate files on local disk remote read, sort Worker Worker write Output File 0 Output File 1 10

Operation 1. Splits the input files into M pieces and assign the pieces to Map processes 2. The master assign the M map tasks and the R reduce tasks to idle workers 3. A map process reads the input split, parses the key/value pairs and applies the Map function. Results are buffered in memory 21 22 Operation 4. Periodically, write buffer to local disk, partitioned into R regions (by the partition function). Let the master know where the partitions are 5. A reducer receives locations of its M chunks from the master. It reads them, and sorts all the data it read How? 6. The reducer passes the key and list of values to the Reduce function and appends the output to the result After this, can wake up the client 11

Execution Environment Input key*value pairs Input key*value pairs... map map Data store 1 Data store n (key 1, (key 2, (key 3, (key 1, (key 2, (key 3, values...) values...) values...) values...) values...) values...) == Barrier == : Aggregates intermediate values by output key key 1, key 2, key 3, intermediate intermediate intermediate values values values reduce reduce reduce No reduce can begin until map is complete Tasks scheduled based on location of data If map worker fails any time before reduce finishes, task must be completely rerun Master must communicate locations of intermediate files final key 1 values final key 2 values final key 3 values 23 Note: figure and text from presentation by Jeff Dean. Worker Failure When worker A is idle or inaccessible for a long time, the master assigns the task to another machine What happens if the two machines complete the task and report this to the master? 24 12

Master Failure In principle, can be handled by writing checkpoints In practice, the user should execute again the entire MapReduce task 25 How Many Hosts to Allocate? The number of map processes, depends on the size of the input and the size of a block The number of reducers should be so that M*R will not be too big for the memory of the master and not to produce too many chunks of output 26 13

Sampling Using MapReduce 27 Simple Random Sample Given a population P of size N, choose a sample of size k A simple random sample of size k: any two subsets of the population of size k have the same probability to be selected Different from Bernoulli Sampling or Poisson Sampling for which the sample size is not a constant 28 14

Simple Random Sample Common trick in DB: assign a random number to each item and sort Costly if N is very big (what is the cost?) Is there a more efficient way to sample the data? 29 Reservoir Sampling Reservoir sampling described by [Knuth 69, 81]; enhancements [Vitter 85] Fixed size k uniform sample from arbitrary size N stream in one pass No need to know stream size in advance Include first k items w.p. 1 Include item n > k with probability p(n) = k/n, n > k Pick j uniformly from {1,2,,n} If j k, swap item n into location j in reservoir, discard replaced item Simple proof shows the uniformity of the sampling method: Let S n = sample set after n arrivals m (< n) n k=7 Previously sampled item: induction New item: selection probability Prob[n S n ] = p n := k/n 30 m S n-1 w.p. p n-1 m S n w.p. p n-1 * (1 p n / k) = p n 15

Using MapReduce to Sample a Distributed Dataset The population is distributed over several servers A constraint s k defines which objects of the population are relevant There may be different constraints that refer to different samples The goal: for each constraint, find a simple random sample of the objects that satisfy the constraint 31 Naïve Approach map (null, t) [(s k, t)] (if t satisfies s k ) reduce (s k, [t 1,, t N ]) (SRS ([t 1,, t N ], f k )) - Apply map based on the constraints - Apply a simple random sample in the reduce phase 32 16

MapReduce Evaluator Split 1 Map phase Combiner phase x1 Q= x1 x2 Split 2 Split 3 Using Algorithm R 33 MapReduce Reduce phase x1 Q= x1 x2 Using Algorithm R 34 17

Proper Sampling in the Reduce Phase Samples may have been drawn from subsets of different sizes 1 2 3 4 5 6 sample size = 2 2 3 5 6 35 A-priori probability: MR-SQE map (null, t) [(s k, t)] (if t satisfies s k ) combine (s k, [t 1,, t N ]) (SRS ([t 1,, t N ], f k ), N) reduce (s k, [(Š 1, Ň 1 ),, (Š K, Ň K )]) [unified-sampler ({(Š 1, Ň 1 ),, (Š K, Ň K )}, f k )] - Apply map based on the stratum constraints - Apply a combiner to sample the sets during the map process, to decrease the amount of transmitted data - In the reduce, unify the samples, while taking into account the initial sizes of the sets and the required strata sizes 36 18

HADOOP (MapReduce) 37 MapReduce vs Hadoop Unless you work at Google, or use Google App Engine, you will use Hadoop MapReduce Started by Yahoo Developed by Apache Google s implementation (at least the one described) is written in C++ Hadoop is written in Java 38 19

Major Components User Components: Mapper Reducer Combiner (Optional) Partitioner (Optional) (Shuffle) Writable(s) (Optional) System Components: Master Input Splitter* Output Committer* * You can use your own if you really want! 39 Image source: http://www.ibm.com/developerworks/java/library/l-hadoop-3/index.html Notes Mappers and Reducers are typically single threaded and deterministic Determinism allows for restarting of failed jobs, or speculative execution Need to handle more data? Just add more Mappers/Reducers! No need to handle multithreaded code Since they re all independent of each other, you can run (almost) arbitrary number of nodes 40 20

Notes Mappers/Reducers run on arbitrary machines A machine typically has multiple map and reduce slots available to it, typically one per processor core Mappers/Reducers run entirely independent of each other: In Hadoop, they run in separate JVMs 41 Input Splitter Is responsible for splitting your input into multiple chunks These chunks are then used as input for your mappers Splits on logical boundaries. The default is 64MB per chunk Typically, the user can just use one of the built in splitters, unless required to read from a specially formatted file 42 21

Code Example Next, we present a code example of the word-count application This is not optimized. See in the following tutorial how to write it is an optimized way: http://hadoop.apache.org/docs/r1.2.1/mapred_tutori al.html 43 1. package org.myorg; 2. 3. import java.io.ioexception; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. 11. import org.apache.hadoop.util.*; 12. public class WordCount { 13. 44 22

14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.tostring(); Define the map function 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasmoretokens()) { 22. word.set(tokenizer.nexttoken()); 23. output.collect(word, one); 24. } 25. } 26. 45 } Reporter for reporting progress 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasnext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37. 46 Define the reduce function 23

38. public static void main(string[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); 41. 42. conf.setoutputkeyclass(text.class); 43. conf.setoutputvalueclass(intwritable.class); 44. 45. conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); 48. Set the mapper Set a combiner Set the reducer 47 JobConf.setOutputKeyComparatorClass(Class) to set the partition of map keys to reducers 38. public static void main(string[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); 41. 42. conf.setoutputkeyclass(text.class); JobConf.setNumReduceTasks(int) tells how to set the number of reduce tasks 43. conf.setoutputvalueclass(intwritable.class); 44. 45. conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); 48. Set the mapper Set a combiner Set the reducer 48 JobConf.setNumMapTasks(int) hint how to set the number of map tasks 24

Combiner Essentially an intermediate reducer Is optional Reduces output from each mapper, reducing bandwidth and sorting Cannot change the type of its input Input types must be the same as output types 49 InputFormat generates InputSplit 49. conf.setinputformat(textinputformat.class); 50. conf.setoutputformat(textoutputformat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59. 50 25

Output Committer Is responsible for taking the reduce output, and committing it to a file Typically, this committer needs a corresponding input splitter (so that another job can read the input) Again, usually built in splitters are good enough, unless you need to output a special kind of file 51 Partitioner (Shuffler) Decides which pairs are sent to which reducer Default is simply: Key.hashCode() % numofreducers User can override to: Provide (more) uniform distribution of load between reducers Some values might need to be sent to the same reducer Ex. To compute the relative frequency of a pair of words <W1, W2> you would need to make sure all of word W1 are sent to the same reducer Binning of results 52 26

Master Responsible for scheduling & managing jobs Scheduled computation should be close to the data if possible Bandwidth is expensive! (and slow) This relies on a Distributed File System (GFS / HDFS)! If a task fails to report progress (such as reading input, writing output, etc), crashes, the machine goes down, etc, it is assumed to be stuck, and is killed, and the step is re-launched (with the same input) The Master is handled by the framework, no user code is necessary 53 Master Cont. HDFS can replicate data to be local if necessary for scheduling Because our nodes are (or at least should be) deterministic The Master can restart failed nodes (Nodes should have no side effects!) If a node is the last step, and is completing slowly, the master can launch a second copy of that node This can be due to hardware issues, network issues, etc. First one to complete wins, then any other runs are killed 54 27

Backup Tasks A slow running task (straggler) prolong overall execution Stragglers often caused by circumstances local to the worker on which the straggler task is running Overload on worker machined due to scheduler Frequent recoverable disk errors Solution Abort stragglers when map/reduce computation is near end (progress monitored by Master) For each aborted straggler, schedule backup (replacement) task on another worker Can significantly improve overall completion time 55 Backup Tasks (1) without backup tasks (2) with backup tasks (normal) 56 28

Writables Are types that can be serialized (deserialized) to a stream Are required to be input/output classes, as the framework will serialize the data before writing them to disk User can implement this interface, and use their own types for their input/output/intermediate values There are default classes for basic values, like Strings, Integers, Longs, etc. Can also handle data structures, such as arrays, maps, etc. Your application needs at least six writables 2 for your input 2 for your intermediate values (Map <-> Reduce) 2 for your output 57 29