MapReduce Framework for Distributed Computation

Size: px
Start display at page:

Download "MapReduce Framework for Distributed Computation"

Transcription

1 MapReduce Framework for Distributed Computation Summer School on Massive Data Management Daniel McDermott Eastern Washington University Cheney, WA, U.S.A July 4, 2013 Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

2 About Me Born in Los Angeles, CA, USA Currently live in Spokane, WA, USA Graduate Student at Eastern Washington University Systems Administrator for the CS department You can contact me for help with any of the materials at: I am always on the Freenode IRC network as username onefish irc://irc.freenode.net/#discoproject Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

3 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

4 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

5 Defined Data-Intensive Text Processing with MapReduce, 2010 MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. First described in MapReduce: Simplified Data Processing on Large Clusters by Jeffery Dean and Sanjay Ghemawat, Google Inc, Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

6 Two Aspects A Programming Model / Paradigm from functional programming algorithm design restriction! A Execution Framework / Runtime Automatic parallel/distributed execution on large cluster computers. Scales horizontally, not vertically. Assumes failures are common. Moves computation to the data. Hide system-level details from the programmer. Scales seamlessly. The restirctions of the Programming Model will enable the features of the Exection Framework Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

7 Overall Theme of MapReduce Given a large dataset, apply some transformation to each element or record of the dataset (map), producing a temporary intermediate dataset. Then iterate over the intermediate dataset performing an aggregation, summarization, or similar reduction (reduce). A surprising number of problems in Datamining and Computer Science can be phrased in this way. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

8 The Big Data Revolution The Unreasonable Effectiveness of Data, 2009 Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead the best approach appears to be harnessing the power of data... Sloan Digital Sky Survey produces 500TB of astronomical images each month 1 CERN: Estimates 15PB of data generated each year. 2 In the future, personal DNA sequencing will become routine The amount of data in existence doubles roughly every two years home.web.cern.ch/about/computing 3 Cisco, The Economist Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

9 Scope of problems A word of warning To a man with a hammer, every problem looks like a nail. MapReduce is a data parallel paradigm focused on scalability and data bandwidth Most solutions involve sequentially reading large amounts of static data from disks and moving it through a wide computation pipeline MapReduce is a batch processing system Moving data will dominate the cost of computation MapReduce is not: Low latency A data retrieval method A supercomputing method Data is large, computation is relatively small Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

10 Some MapReduce Solutions Text processing: wordcount, co-occurrence matrix, relative frequencies, distributed grep Inverted Index construction Graph Algorithms (SSSP, PageRank) Relational Algebra Clustering Algorithms Matrix Multiplications Summarization / Histogram Construction Distirbuted Sort Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

11 MapReduce Implementations Strictly MapReduce: Hadoop (Apache Foundation, Java) Disco (Nokia Research Center, Python) Google s (in-house, C++) Systems which use MapReduce: MongoDB (distributed document store) Cassandra (distributed database system) CouchDB (distributed document store) Lucene (free search engine software) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

12 Who Uses MapReduce? Deployed widely at Google, Amazon, Facebook, Yahoo, Ebay, IBM, Nokia, Qualcomm, LinkedIn, CERN, and others. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

13 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

14 Parallel Programming Review Parallel: Multiple threads / processors acting on same problem Single greatest challenge to parallel algorithm design: Shared State i n t x = 4 ; #g l o b a l / s h a r e d v a r i a b l e s i n t y = 0 ; func foo ( ) { x++; x = y ; } func bar ( ) { y++; x += 2 ; } new t h r e a d ( foo ). run ( ) new t h r e a d ( bar ). run ( ) What are x and y??? Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

15 Parallel = Unpredictable Applies to more than just integers Producer / consumer problems Reporting status to a master process Notifying other threads of state changes All require some synchronization Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

16 Synchronization Primitives Special shared variables which support atomic operations Sempahores: Imagine a set of train tracks that must cross a bridge Semaphore is the flag on each side of the bridge Mutex: unlock() and lock() Condition Variables wait() and notify(), wait blocks thread until it receives a notify Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

17 Solution? Protect critical section... semaphore sem ; func foo ( ) { sem. l o c k ( ) x++; x = y ; sem. u n l o c k ( ) } func bar ( ) { sem. l o c k ( ) y++; x += 2 ; sem. u n l o c k ( ) } Wait a second... Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

18 Fixed Solution Force foo to run before bar. semaphore sem ; b o o l done = f a l s e ; c o n d i t i o n V a r cv ; func foo ( ) { sem. l o c k ( ) x++; x = y ; done = t r u e ; sem. u n l o c k ( ) cv. n o t i f y ( ) } func bar ( ) { sem. l o c k ( ) i f (! done ) cv. w a i t ( sem ) y++; x += 2 ; sem. u n l o c k ( ) } We need to synchronize every time we access or update shared state What if the threads are distributed? Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

19 Ahmdahl s Law 1 B + 1 B n The theoretical speedup of any parallel algorithm is bounded by its strictly sequential portion. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

20 Distributed Computing Bottlenecks In the case of distributed computing, this sequential portion will almost always be dominated by the cost of communication across an interconnect Even with libraries, distributed systems programming is burdensome Even fastest interconnects incur incredible delay compared to the speed of modern CPUs Focus of distributed algorithm design is on reducing this communication Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

21 Parallel vs Distributed Parallel: Special purpose interconnects focused on low latency, shared memory model, computationally intense problems such as physics simulations and dynamic systems modeling. Frequent random access syncronization in small bytes Distributed: Inexpensive hardware, 1Gbit ethernet, data intensive problems with minimal shared state between compute elements, large streaming reads and writes Scaling Models vertical: scale up : buy better, faster, special purpose hardware HPC horizontal: scale out : buy more hardware of same type MapReduce Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

22 Traditional HPC Datacenter Must scale vertically Parallel system focused on low latency Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

23 The MapReduce Datacenter Scales horizontally with cheap, consumer-grade, hardware Distributed system focused on wide throughput Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

24 Distributed File Systems MapReduce file systems oriented towards large, sequential, reads and writes. Streaming data to and from the disk. Fault tolerant via block replication Logical Generally not POSIX compatible Examples: Google s BigTable (Bigtable: A Distributed Storage System for Structured Data, Dean, Burrows, et.al. 2006) HDFS - Hadoop Distributed File System DDFS - Disco Distributed File System Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

25 Distributed File System Architecture Large Block Size (typically 64MB) With K replication factor, K 1 nodes can fail Name node serves metadata about all files on file system Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

26 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

27 Functional Programming Review Applications of functions do not modify data, always create new data Original data structures always exist unmodified Data flow is implicit in program design = Order of application across threads or functions does not matter There are no side-effects and thus no shared-state between function applications or threads Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

28 Functional Programming Review Pure functions func foo ( m y l i s t : i n t l i s t ) = sum ( m y l i s t ) + prod ( m y l i s t ) + l e n ( m y l i s t ) Order of application of sum(),prod(), etc does not matter they only produce new data Each purely expresses a mathematical computation Thus each function can be put in it s own thread Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

29 No Side Effects Functional updates do not modify structures func append ( x, l s t ) = l e t l s t = r e v e r s e l s t i n r e v e r s e ( x : : l s t ) This reverses a list (which creates a new list), prepends an element, then reverses it again But it never modifies lst Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

30 Higher Order Functions Functions can take other functions as their arguments func dodouble ( fun, x ) = fun ( fun ( x ) ) func myfun ( x ) = x 3 dodouble ( myfun, 2) What about?: func myfun ( x ) = x + x dodouble ( myfun, 5) returns (2 3) 3 = 18 Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

31 Common Functional Programming Patterns Functional programming often operates over lists... Just as in procedural programming, functional programming has common patterns which are built into the languages Map: Element-wise transform a list Fold: Accumulate a list into a single value Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

32 Map Map Function map(func, list) Input: list: A list to map over. func: A function to apply to each element in the list, which returns a new element. Output: A new list, representing func applied to every element of list newlist new empty list foreach element in list do newlist.append(func(element)) return newlist Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

33 Fold or Accumulate (Reduce) Fold Function fold(acc, func, list) Input: acc: An accumulator value. func: A function which takes a list element and a value and returns a new value. list: A list to read. Output: The final value of acc. foreach element in list do acc = func(acc, element) return acc Can acc be a list? Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

34 Fold Exercise Can we implement: func foo ( m y l i s t : i n t l i s t ) = sum ( m y l i s t ) + prod ( m y l i s t ) + l e n ( m y l i s t ) Using only folds? Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

35 Exercise Solved func foo ( m y l i s t : i n t l i s t ) = sum ( m y l i s t ) + prod ( m y l i s t ) + l e n ( m y l i s t ) func sum ( l s t ) = f o l d ( lambda ( x, a)=>x+a ), 0, l s t func prod ( l s t ) = f o l d ( lambda ( x, a)=>x a ), 1, l s t func l e n ( l s t ) = f o l d ( lambda ( x, a)=>1+a ), 0, l s t Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

36 Map and Fold Together fold(acc, f, map(g, list)) Length of longest word Given a list of words, output the length of the longest string in the set Given input one fish two fish red fish blue fish, should output: 4 Wordcount Given a string (or doccument), output the number of times each word appears in the string. Given input one fish two fish red fish blue fish, should output:. { one : 1, two : 1, red : 1, blue : 1, fish : 4} Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

37 Exercises Solved func l e n L o n g e s t ( l s t ) = f o l d (0, max, map( len, l s t ) ) func wordcount ( l s t ) = f o l d ( d i c t {}, combine, map( count, l s t ) ) count ( s t r ) = ( s t r, 1) combine ( d i c t {}, elem ) = d i c t { elem [ 0 ] } += elem [ 1 ] There is a simpler way to do wordcount... Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

38 Parallelizing Map In a purely functional setting, there are no side effect from elements of a list being computed by map The order of application of func to the list elements is commutative, thus we can parallelize This is the core property of MapReduce as a programming model Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

39 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

40 Redefining Map Instead of a list, the input to map will be records from a data source, supplied as key, value pairs (ex. filename,contents) Mapping over the entire data source is accomplished by mapping over each separate record in parallel (a two level map) Each map process will output one or more intermediate key, value pairs User code is responsible for iteration and application User Map Function method Map(docid a, doc d) forall the term t doc d do emit(term t, count 1) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

41 Redefining Reduce After map phase, all intermediate values for a given key are aggregated into a list Reduce summarizes or combines the intermediate values per key into one or more final values In practice, there is usually one final value per key Again, the user code is responsible for iteration and application User Reduce Function method Reduce(term t, counts [c 1, c 2,...]) sum 0 forall the count c counts [c 1, c 2,...] do sum sum + c emit(term t, count sum) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

42 Paradigm and Runtime Differences How MapReduce different from FP: The runtime map and reduce are not! the functional programming map and fold The user code is responsible for the iteration over datasets and application of computation The programming model serves a guideline to implement the functions, a blueprint Aside: Sometimes, we can cheat / break the programming model Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

43 Programming Model to Runtime The programmer defines a map and a reduce function with the types: The framework: map(k1, V 1) (K2, [v2]) reduce(k2, [v2]) (K3, v3) 1 Splits the input into chunks or blocks 2 Assigns each block to a map task, assigns all tasks to worker machines (mappers) 3 Each worker applies the map function to each element of its assigned block outputing keys and values 4 Aggregates these values by key (shuffle and sort) 5 Assigns (partitions) each key to a reduce task, assign all tasks to worker machines (reducers) 6 Each worker applies the reduce function over the values for its keys Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

44 Programming Model to Runtime Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

45 Wordcount example WordCount class Job method Map(docid a, doc d) forall the term t doc d do emit(term t, count 1) method Reduce(term t, counts [c 1, c 2,...]) sum 0 forall the count c counts [c 1, c 2,...] do sum sum + c emit(term t, count sum) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

46 Wordcount Data Flow Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

47 Refinements Combiner Function Like a mini-reducer running on the mapper side. Goal is to aggregate values to minimize data transfered across the network. Runs on the mappers before data is shuffled across the network Viewed by the framework as optional Partition Function By default: With r partitions, send the (key, val) pair to the hash(key) mod r th reduce task. Redefinable may want certain keys to always appear together. For instance all URL s for a particular site. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

48 Improved Wordcount WordCount class Job method Map(docid a, doc d) forall the term t doc d do emit(term t, count 1) method Combine(term t, counts [c 1, c 2,...]) partsum 0 forall the count c counts [c 1, c 2,...] do partsum partsum + c emit(term t, count partsum) method Reduce(term t, partsums [c 1, c 2,...]) sum 0 forall the partsums c partsums [c 1, c 2,...] do sum sum + c emit(term t, count sum) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

49 More Complete Wordcount Example Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

50 Systems Terminology Job: An API call to a MapReduce cluster, includes a Map and Reduce function and associated parameters. Each job will be divided into tasks by the runtime. Task: A unit of work, as a subprocess to be run on the cluster, divided up by: Map Task: each file or block on the distributed FS Reduce Task: each partition of the intermediate key space Worker: A slot of computation, often a processor in the cluster, where tasks can be assigned. Mapper: A worker assigned a Map Task Reducer: A worker assigned a Reduce Task Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

51 MapReduce Systems Perspective Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

52 Locality Aware Scheduling Data is replicated across the cluster on the same machines that perform computation. Increased replication factor gives more flexibility to scheduler. Key Feature: Move computation to data, do not fetch data to computation. Programs are small, data is large. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

53 Fault Tolerance Inexpensive, commodity, hardware Re-Execution: machine or rack failure run task again. (note we must re-run all completed tasks on that node as well, why?) Bad Record Skipping: user code crashes on certain input log error and skip Speculative Execution: stragglers use idle machines to replicate in-progress jobs Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

54 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

55 Inverted Index Given a set of documents which contain terms, output a set of terms which contain document IDs. The document IDs should be sorted within their respective terms, for faster indexing Each document ID can have associated with it some payload (i.e. the term frequency per document) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

56 Inverted Index Algorithm InvertedIndex class Job method Map(docid a, doc d) H new AssociativeArray forall the term t doc d do H{t} H{t} + 1 forall the term t H do emit(term t, posting (n, H{t})) method Reduce(term t, postings[(n 1, f 1 ), (n 2,f 2 ),...] P new list forall the posting (a, f) postings[(n 1, f 1 ), (n 2, f 2 ),...] do P.add((a,f)) P.sort() emit(term t, postings P) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

57 Inverted Index Execution Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

58 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

59 Relational Joins Review All common relational algebra operations can be computed by MapReduce. See Ulman, 2013 for algorithms for selection, projection, set operations, etc. Defn Natural Join Given two relationships, R and S, which share a common descriptor in their schema, output corresponding tuples from both relationships whose values agree on the common descriptor. Example: Given relationships R(a, b) and S(b, c), join S, R on b a b b c Would output (5,5,6) (2,9,7) (9,1,8) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

60 Natural Join Algorithms Relational Algebra in MapReduce is critical to many higher level systems which use MapReduce as their foundation Three types of MapReduce Join Algorithms Reduce Side Join Map Side Join Memory-backed Join Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

61 Reduce-side Join one-to-one NaturalJoin class Job method Map(relation id, tuple (x,y)) if id == R then emit(y, tuple (x, y)) if id == S then emit(x, tuple (x, y)) method Reduce(joinkey b, tuples [t 1,...]) if size(tuples) == 2 then emit(b, merge(tuples[t 1, t 2 ])) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

62 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

63 Cluster Setup Handout Six Machines, each: 4x 2.8Ghz Intel Xeon 5GB Memory 4x 73GB 10,000rpm SCSI U320 in RAID 5 (175GB after OS + formatting) 30GB Memory Total 24 Processors Total 1TB Distributed Storage 1Gbit switched network Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

64 Cluster Setup Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

65 Disco Framework MapReduce runtime in Erlang + Python Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

66 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

67 Main Theme of MapReduce Algorithms Primary focuses of efficient MapReduce algorithm design: phrasing the solution within the restrictions of the programming model reducing the amount of intermediate data that may need to be fetched across the network dealing with scalability concerns and set size vs I/O tradeoffs Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

68 What the programmer cannot control Where a map task or reduce task runs (i.e. on which node) When a map task or reduce task finishes Which input key-value pairs are processed by a specific map task Which intermediate key-value pairs are processed by a specific reducer Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

69 What the programmer can control Construct complex data structures as keys and values to store and communicate partial results Run initialization code at the beginning or end of each Map and Reduce Task Preserve state in map and reduce functions across multiple input or intermediate keys Ability to control the sort order of intermediate keys, thus the order the reducer will encounter the keys The ability to control the partition of the key space, thus the set of keys that will appear at a particular reduce task Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

70 The Communication Cost Model Dean, 2011 The communication cost of an algorithm is the sum of communication cost of all the tasks implementing the algorithm The cost of communication will vastly dominate the cost of CPU operations. The algorithm being executed by each task is typically very simple, often linear of it s input Due to horizontal scalability, computation is cheap compared to communication When measuring the communication cost, we only count the outputs of each task Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

71 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

72 WordCount Revisited WordCount class Job method Map(docid a, doc d) forall the term t doc d do emit(term t, count 1) method Reduce(term t, counts [c 1, c 2,...]) sum 0 forall the count c counts [c 1, c 2,...] do sum sum + c emit(term t, count sum) The communication cost of this implementation is O(n), with n terms in the entire collection Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

73 In-mapper combining Local Aggregation Leveraging the associativity and commutativity of the Reduce operation to combine values before Reducers fetch them across the network. In MapReduce Frameworks, combiners are viewed by the runtime as optional optimizations. The correctness of the algorithm cannot depend on the combiner. Force In-mapper combining Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

74 WordCount Improved Tally counts for entire document WordCount method Map(docid a, doc d) H new AssociativeArray forall the term t doc d do H{t} H{t} + 1 forall the term t H do emit(term t, count H{t} ) (Reduce is the same) The communication cost is O(dσ), with d documents and a language of size σ Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

75 Wordcount Best Tally counts across documents by preserving state across calls to map + in-mapper combining WordCount method Initialize H new AssociativeArray method Map(docid a, doc d) forall the term t doc d do H{t} H{t} + 1 method Finalize forall the term t H do emit(term t, count H{t} ) The communication cost is O(kσ), with k workers and a language of size σ Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

76 Distributed Mean Problem Given a large dataset where input keys are strings and values are integers, we wish to compute the mean of all integers associated with the same key, rounded to the nearest integer For example, a user log representing time spent viewing particular web pages on your site: PageA 23 PageB 38 PageA 9 PageB 40 PageB 89 PageA 26 PageB 33 This user spends 19 seconds on PageA and 50 seconds on PageB, on average Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

77 Naive Implementation DistributedMeans method Map(string t, int r) emit(string t, int r ) method Reduce(string t, ints [r 1, r 2,...]) sum 0 cnt 0 forall the int r ints [r 1, r 2,...] do sum sum + r cnt sum + 1 r avg sum/cnt emit(string t, int r avg ) Because Map is an identity, we will need to shuffle the entire dataset across the network in the worst case, the communication cost will be O(n) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

78 Correctness of Local Aggregation Simple Local Aggregation is only correct in the case where the function Reduce is going to operate over is both associative and commutative. Notice that: Mean(1,2,3,4,5) Mean(Mean(1,2),Mean(3,4,5)) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

79 First Attempt DistributedMeans method Map(string t, int r) emit(string t, pair(r,1) ) method Combine(string t, [(s 1,c 1 ), (s 2,c 2 ),...]) sum 0 cnt 0 forall the int r ints [r 1, r 2,...] do sum sum + r cnt sum + 1 emit(string t, pair (sum,cnt)) method Reduce(string t, pairs [(s 1,c 1 ), (s 2,c 2 ),...]) sum 0 cnt 0 forall the pair (s,c) pairs [(s 1,c 1 ), (s 2,c 2 ),...] do sum sum + r cnt sum + 1 emit(string t, pair (sum,cnt)) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

80 Correct Solution DistributedMeans method Map(string t, int r) emit(string t, int r ) method Combine(string t, ints [r 1, r 2,...]) sum 0 cnt 0 forall the int r ints [r 1, r 2,...] do sum sum + r cnt sum + 1 emit(string t, pair (sum,cnt)) method Reduce(string t, pairs [(s 1,c 1 ), (s 2,c 2 ),...]) sum 0 cnt 0 forall the pair (s,c) pairs [(s 1,c 1 ), (s 2,c 2 ),...] do sum sum + r cnt sum + 1 emit(string t, pair (sum,cnt)) Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

81 Effectiveness of Local Aggregation Highly dependant on the size of the intermediate key space, the number of workers, reduce tasks, and the distribution of intermediate keys via the partition function. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

82 Outline 1 What is MapReduce? 2 Distributed Systems Background 3 Functional Programming Roots 4 MapReduce Execution Framework 5 Inverted Index 6 Relational Algebra 7 Lab / Demonstration 8 MapReduce Algorithm Design 9 Local Aggregation 10 Pairs and Stripes Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

83 Co-occurrence Matrix Construction Definition A n n matrix where n is the number of unique words in the corpus. A cell m ij contains the number of times word w i occurs with w j. Occurrence is defined by some context, such as same sentence, paragraph, or a sliding window of k words. For example, with k = 1 and the string one fish two fish one fish two one fish two For large vocabularies, the size of this problem quickly grows out of control. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

84 Pairs Solution CooccurPairs method Map(docid a, doc d) forall the term w doc d do forall the term u Neighbors(w) do emit((w, u), count 1) method Reduce(pair p, counts[c 1, c 2,...]) s 0 forall the count c counts[c 1, c 2,...] do s s + c emit(pair p, count s )) Emit a count for each co-occurrence Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

85 Analysis of Pairs For a corpus of size n, at worst we will produce O(n 2 ) communications. However, requires limited space complexity. If the vocabulary is very large, and the context of neighbor very wide, this solution will continue to scale. Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

86 Stripes Solution CooccurPairs method Map(docid a, doc d) forall the term w doc d do H new AssociativeArray forall the term u Neighbors(w) do H{u} H{u} + 1 emit(term u), stripe H) method Reduce(pair p, stripes [H 1, H 2,...]) H f new AssociativeArray forall the stripe H stripes[h 1, H 2,...] do Sum(H f H emit(term w, stripe H f )) Emit a stripe for each word Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

87 Analysis of Stripes We will now produce O(kσ) potential communications in the worste case, where σ is our vocabulary size and k is the number of tasks reduce. However there is a memory tradeoff. What if there is not enough memory to fit data into a stripe? Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

88 Further Resources Runtimes / Frameworks: Disco - MapReduce in Python + Erlang Hadoop - MapReduce in Java Free Online Books: Data Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Mining of Massive Datasets, Anand Rajaraman and Jeffery Ulman. The Datacenter as a Computer: Introduction to the Design of Warehouse-Scale Machines, Luiz Barroso and Urz Holze. S00193ED1V01Y200905CAC006 Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

89 Further Resources Papers: Vidoes: Bigtable: A Distributed Storage System for Structured Data, Dean, Burrows, et.al. 2006) MapReduce: Simplified Data Processing on Large Clusters, Dean & Ghemawat 2004 Google Developer Series on Cluster Computing and MapReduce Questions? Daniel McDermott (EWU) MapReduce for Distributed Computation July 4, / 89

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak MapReduce Systems Sara Bouchenak Sara.Bouchenak@imag.fr http://sardes.inrialpes.fr/~bouchena/teaching/ Lectures based on the following slides: http://code.google.com/edu/submissions/mapreduceminilecture/listing.html

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

What is cloud computing?

What is cloud computing? Introduction to Clouds and MapReduce Jimmy Lin University of Maryland What is cloud computing? With mods by Alan Sussman This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

MapReduce: Algorithm Design Patterns

MapReduce: Algorithm Design Patterns Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004) MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Cloud Computing Summary and Preparation for Examination

Cloud Computing Summary and Preparation for Examination Basics of Cloud Computing Lecture 8 Cloud Computing Summary and Preparation for Examination Satish Srirama Outline Quick recap of what we have learnt as part of this course How to prepare for the examination

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

SEAIP 2009 Presentation

SEAIP 2009 Presentation SEAIP 2009 Presentation By David Tan Chair of Yahoo! Hadoop SIG, 2008-2009,Singapore EXCO Member of SGF SIG Imperial College (UK), Institute of Fluid Science (Japan) & Chicago BOOTH GSB (USA) Alumni Email:

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

MapReduce and the New Software Stack

MapReduce and the New Software Stack 20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Duke University http://www.cs.duke.edu/starfish

Duke University http://www.cs.duke.edu/starfish Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note:

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #13: NoSQL and MapReduce Announcements HW4 is out You have to use the PGSQL server START EARLY!! We can not help if everyone

More information

MapReduce for Data Warehouses

MapReduce for Data Warehouses MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information