MapReduce. Olivier Curé. January 6, Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France

Transcription

1 Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France January 6, 2014

2 In more and more situations, data is being so big that it can not be processed on a single machine. Examples: Storage of new data per day (2012): Facebook (500TB), Twitter (175 million tweets/day). Facebook has over 100PB of photos, 250 billion photos, 350 million new ones uploaded every day Youtube uploads 60 hours of video every minute. LHC machines produce 1PB/second (filtered in hardware)

3 Big data Olivier Cure

4 To process big data, parallelisation is needed is a popular framework to support this kind of processing. Some refer to it as the killer app of cloud computing. It became popular after the publication of a Google paper. It is interesting to have an overall view of the Google infrastructure to understand.

5 GFS Chubby WorkQueue BigTable Google s infrastructure Google File System Chubby WorkQueue BigTable Sawzall

6 Google cluster GFS Chubby WorkQueue BigTable

7 GFS Chubby WorkQueue BigTable Google File System 1 distributed file system optimized for large files (up to several GB) mainly for read access, write access appends data at the end of the file works on clusters of commodity hardware 1 Ghemawat, S. et al: The Google File System (SOSP 03)

8 GFS Chubby WorkQueue BigTable large files are partitioned in chunks of 64MB A file is replicated on several machines (usually 3) On a chunkserver, a chunk is sliced into 64KB blocks A chunk is stored as a standard linux file Files must be stored redundantly

9 GFS Chubby WorkQueue BigTable nodes of a system are a master or a chunkserver A cluster has one master (which has one copy) the master handles all the metadata and verifies the integrity of the data, gives authorization to write on a file, manages load balancing, makes sure that blocks are copied sufficiently (in general 3 copies). interactions with the master are kept to its minimum in order not to become the bottleneck of the system.

10 GFS Chubby WorkQueue BigTable most nodes are chunkservers handles storage of chunks provides read and write access to clients.

11 GFS Chubby WorkQueue BigTable Reading data client asks the master for addresses of machines storing the searched chunks if no read operation is active of this chunk, the master provides this list client then communicates directly with a chunkserver

12 GFS Chubby WorkQueue BigTable Writing data A chunkserver storing the data becomes a (temporary) primary chunkserver. The master provides to the client the primary and secondary chunkservers. Client sends the data to write to the primary chunkserver and the list of secondaries. primary chunkserver sends the data to the closest secondary which does the same thing. thus all copies receive the data for appending the chunk

13 GFS Chubby WorkQueue BigTable GFS summary

14 GFS Chubby WorkQueue BigTable Chubby 2 highly available and persistent distributed lock service. heavily used inside Google as name server, supplanting DNS. uses the Paxos 3 algorithm to keep its (5) replicas consistent in the face of failure and to elect a machine to play the role of a master. The master grants clients read/write locks to files. Limited to coarse-grained locks. 2 Burrows, M.: The Chubby lock service for loosely-coupled distributed systems (OSDI 06) 3 Lampart, L.: Paxos made simple

15 GFS Chubby WorkQueue BigTable WorkQueue Google s job scheduler on a cluster of machines Creates a large-scale time sharing system out of an array of computers and their disks. It schedules jobs, allocates resources, reports status and collects results. Tasks run on the same machines as GFS (since GFS as a storage system does not heavily load CPUs) Can reschedule jobs that fail.

16 GFS Chubby WorkQueue BigTable BigTable 4 a distributed storage system for managing structured data provides clients with a simple data model: a sparse, distributed, persistent multidimensional sorted map The map is indexed by a row, a column and a timestamp Bigtable uses GFS to store log and data files Bigtable relies on Chubby to elect a master, to allow the master to discover the servers it controls, and to permit clients to find the master. Bigtable only supports single-row transactions. 4 Chang, F. et al: Bigtable: A Distributed Storage System for Structured Data (OSDI 06)

17 GFS Chubby WorkQueue BigTable Data model A tablet, a set of rows, is the unit of distribution and load balancing. Column keys are grouped into sets called column families (basic unit of access control). Versioning of cell data with timestamps.

18 GFS Chubby WorkQueue BigTable Building blocks Bigtable data is stored using the SSTable file format. A SSTable provides a persistent ordered immutable map from keys to values. A SSTable contains a sequence of blocks and block index (generally loaded into memory).

19 GFS Chubby WorkQueue BigTable Components A client library. Clients rarely communicate with the master. A master which assigns tablets to tabletservers, detects addition and removal of tabletservers, balances tablets load on tabletservers and handles schema changes. tabletservers handle read and write requests

20 GFS Chubby WorkQueue BigTable Tablet location Analoguous to a B+-tree Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates and acquires an exclusive lock

21 GFS Chubby WorkQueue BigTable Tablet assignment Master keeps track of the set of live tablet servers, including which tablets are unassigned Tablet server creates and acquires exclusive lock on a uniquely named file in a specific chubby directory Master monitors this servers directory to discover tablet servers A tablet server stops serving if it loses its exclusive lock A tablet tries to reacquire lock as long as its file exists; if it doesnt then kills itself Master periodically asks each tablet server for the status of its lock

22 Bigtable cell GFS Chubby WorkQueue BigTable

23 Origin Execution Example Sawzall Pig Latin 5 a programming model. an implementation for processing and generating large datasets. programs are automatically parallelized and executed on a cluster of machines. partitioning data, scheduling execution, handling machines failure and managing inter-machine communication are taken care of. 5 Dean, J. et al: : Simplified data processing on large clusters (OSDI 04)

24 Origin Execution Example Sawzall Pig Latin Functional programming : Lisp ( John McCarthy). map: application of a function over a list of elements. reduce: reduces a list to a scalar value. Example in Haskell: let mysquare x = x*x map mysquare [1,2,3,4] [1,4,9,16] mysum [1,2,3,4,5] 15 where mysum definition is: mysym [] = 0 mysum (x:xs) = x + sum xs

25 Origin Execution Example Sawzall Pig Latin So it only takes to write 2 functions: a map and a reduce map: processes key-value pairs and outputs an intermediate set of key-value pairs. reduce: processes the intermediate key-value pairs, merging the values for the associated keys.

26 Origin Execution Example Sawzall Pig Latin framework partitions the input data such that it can be distributed on the cluster of machines for parallel processing. run map tasks in parallel after all maps are complete, consolidated all emitted values for each unique key partition space of output map keys and run reduce function in parallel

27 Origin Execution Example Sawzall Pig Latin

28 Origin Execution Example Sawzall Pig Latin Count the words of a very big file. map(string key, string value) { for each word w in value: EmitIntermediate(w, 1 ); } reduce(string key, Iterator values) { int result=0; for each v in values: result +=ParseInt(v); Emit(AsString(result)); }

29 Origin Execution Example Sawzall Pig Latin

30 Origin Execution Example Sawzall Pig Latin PageRank PageRank: a tool for evaluating the importance of Web pages in a way that is not easy to fool. It is a function that assigns a real number to each page in the Web. The higher the PageRank of a page, the more important it is.

31 Origin Execution Example Sawzall Pig Latin Matrix-Vector for PageRank A matrix (nxn) and Vector v (n), matrix vector product is a vector x of length n with x i = n j=1 m ij v j if n is tens of billions, you need something like. We assume that v fits in main memory and is part of the input of many Map task. M and v are stored in a GFS.

32 Origin Execution Example Sawzall Pig Latin Matrix-Vector for PageRank 2 The row-column (row) of each element of the matrix (vector) are discoverable from its position in the file, e.g., (i, j, m ij ). Each Map task will take the entire vector v and a chunk of the matrix M. From each matrix element m ij it produces the key-value pair (i, mij vj ). Thus, all terms of the sum that make up the component x i of the matrix-vector product will get the same key. A reduce task has simply to sum all the values associated with a given key i. The result will be a pair (i, xi ).

33 Origin Execution Example Sawzall Pig Latin Matrix-Vector for PageRank 3 If v does not fit into main memory: we can divide the matrix and vector into stripes such that a stripe of the vector fits into main memory. The ith stripe of the matrix multiplies only components from the ith stripe of the vector. Each Map task is assigned a chunk from one of the stripes of the matrix and gets the entire corresponding stripe of the vector. The Map and Reduce tasks can then act exactly as was described above for the case where Map tasks get the entire vector.

34 Origin Execution Example Sawzall Pig Latin What is it used for? User behavior analysis AB testing Ad targeting Trending topics Recommendation

35 Origin Execution Example Sawzall Pig Latin Sawzall 6 A Domain Specific Language built on top of interpreted, procedural programming language to define the Map functions Only for commutative and associative tasks. Aggregators (Reduce functions) are written in another language (Python, C++,..) 6 Pike, R. et al: Interpreting the Data: Parallel Analysis with Sawzall (Scientific Programming Journal 13:4 2005)

36 Origin Execution Example Sawzall Pig Latin Pig Latin 7 Pig is the system that enables the compilation of Pig Latin scripts into physical plan that are executed over Hadoop. Pig Latin combines the high-level declarative querying approach of SQL and the low-level programming of. Pig Latin programs are specifying a query execution plan. It has a flexible, fully nested data model and has extensive support for user-defined functions (currently written in Java). It is meant for offline, scan-centric workloads. Other high level interfaces on top of MR: Hive, Scope, Dryad/Linq, Cascalog,.. 7 Olston C. et al: Pig Latin: A Not-So-Foreign Language for Data Processing (SIGMOD 08)

37 and RDBMS Apache s stack of Google s infrastructure Hadoop Google File System HDFS Chubby Zookeeper WorkQueue BigTable HBase Sawzall rightarrow Pig Latin

38 and RDBMS For RDBMS community: no schemas (KV pairs), uses brute force instead of indexing, not compatible with DBMS tools. map function is like a GROUP BY clause reduce function is like an aggregation function (e.g. average, sum) Parallel DBs have been around for more than 30 years (e.g. Teradata).

39 and RDBMS completes DBMS techno 8 is more like an ETL system than a DBMS. It is faster to load data than a parallel DB. But parallel DB performs operations faster than MR once data is loaded. 8 M. Stonebraker et al. and Parallel DBMSs: Friends of foes? CACM. jan 2010

40 Document-oriented database Project of the Apache foundation Started in 2005 Implemented in Erlang Characteristics: build for the Web, scales, replication built-in, Schema-free, indexable, append-only storage, atomic updates, no locking mechanism of data: first to commit wins, eventually consistent.

41 Build for the Web Access control: HTTP, REST-based inteface (GET, PUT, POST, DELETE) Query language: Javascript (as jobs) Storage format: JSON

42 CLI Examples curl Create a database: curl -X PUT get all databases: curl -X GET all dbs delete a db: curl -X DELETE Get info on a DB: curl -X GET Create a doc: curl -X PUT -d name : olivier Cure, login : ocure, follows :[ ianhorrocks, bijanparsia ] And get a doc: curl -X GET

43 Futon administration interface for utils

44 Views When a view is queried for the first time, runs through every document in the database and runs the view function against it. It then takes the result of the view, which is stored in the form of rows of key/value pairs, and stores it in an individual B-tree file. Materialized (permanent) views (Btree, stored in design document) vs temporary views (not for production systems) Views are written using a map/reduce approach. it uses a the emit function to produce a result. It accepts 2 arguments: a key and a value. the reduce function accepts 3 arguments: key, value, rereduce and returns a single value result.

45 Reduce If rereduce is false, the keys argument will be a list of keys and IDs for each row emitted by the map function, and the values argument will be an array of the values emitted by the map function. If rereduce is true, however, the keys argument will be null, and the values argument will be an array of the results produced by the previous invocations of the reduce function.

46 Views Views can be created, invoked using Futon or the command line. create a temporary view: curl -X POST movies/ temp view -d map : function(doc) {emit(doc. id,doc);} Call a permanent view: curl -X GET movies/ design/examples/ view/actors

47 Twitter example { name : Miles Davis, id : mdavis, pwd : sowhat, follows:[ jcoltrane, srollins, wshorter ], followedby:[ jcoltrane, pmetheny ], tweets:[ { date : T11:40:52.280Z, contents : that is cool }, { date : T12:40:52.280Z, contents : tutu }, { date : T11:40:52.280Z, contents : good times! } ] }

48 Some views function(doc) { emit(doc. id, doc); } function(doc) { emit(doc.login, doc.follows); } function(doc) { emit(doc. id, doc.tweets.length); }

49 More views function(doc) { for (i in doc.followedby) { tmp = doc.followedby[i]; emit( followedby :tmp, doc. id); } } function (key, values) { return values.length; }

50 Word counts in tweets function(doc) { for (i in doc.tweets) { var words = doc.tweets[i].contents.tolowercase().replace(/[â-z]+/g, ).split( ); for (word in words) emit(words[word], 1); } } function(key, values, rereduce) { return sum(values); }