Data Management Using MapReduce

Transcription

1 Data Management Using MapReduce M. Tamer Özsu University of Waterloo CS742-Distributed & Parallel DBMS M. Tamer Özsu 1 / 24

2 Basics For data analysis of very large data sets Highly dynamic, irregular, schemaless, etc. SQL too heavy Embarrassingly parallel problems New, simple parallel programming model Data structured as (key, value) pairs E.g. (doc-id, content), (word, count), etc. Functional programming style with two functions to be given: Map(k1,v1) list(k2,v2) Reduce(k2, list (v2)) list(v3) Implemented on a distributed file system (e.g., Google File System) on very large clusters CS742-Distributed & Parallel DBMS M. Tamer Özsu 2 / 24

3 Map Function User-defined function Processes input key/value pairs Produces a set of intermediate key/value pairs Map function I/O Input: read a chunk from distributed file system (DFS) Output: Write to intermediate file on local disk MapReduce library Executes function Groups together all intermediate values with the same key (i.e., generates a set of lists) Passes these lists to reduce functions Effect of function Processes and partitions input data Builds a distributed (trnasparent to user) Similar to group by operaton in SQL CS742-Distributed & Parallel DBMS M. Tamer Özsu 3 / 24

4 Reduce Function User-defined function Accepts one intermediate key and a set of values for that key (i.e., a list) Merges these values together to form a (possibly) smaller set Typically, zero or one output value is generated per invocation Reduce function I/O Input: read from intermediate files using remote reads on local files of corresponding per nodes Output: Each reducees writes its output as a file back to DFS Effect of function Similar to aggregation operaton in SQL CS742-Distributed & Parallel DBMS M. Tamer Özsu 4 / 24

5 MapReduce Processing Input data set. Map Map Map (k 1, v) (k 2, v) (k 2, v) (k 2, v) (k 1, v) Group by k Group by k (k 1, (v, v, v)) Reduce (k 1, (v, v, v, v)) Reduce Output data set Map (k 1, v) (k 2, v) CS742-Distributed & Parallel DBMS M. Tamer Özsu 5 / 24

6 Example 1 Assume you are reading the monthly average temperatures for each of the 12 months of a year for a bunch of cities, i.e., each input is the name of the city (key) and the average monthly temperature. Compute the average annual temperature for each city. Map: Input: City, Month, MonthAvgTemp 1. Create key/value pairs Output: City, MonthAvgTemp Reduce: Input: City, MonthAvgTemp 1. Sort to get City, list(monthavgtemp) (i.e., it combines in a list the monthly average temperatures for a given city) 2. Compute average over list(monthavgtemp) Output: City, AnnualAvgTemp CS742-Distributed & Parallel DBMS M. Tamer Özsu 6 / 24

7 Example 2 Consider EMP (ENAME, TITLE, CITY) Query: Map: SELECT CITY, COUNT( ) FROM EMP WHERE ENAME LIKE \%Smith GROUP BY CITY I n p u t : ( TID, emp ), Output : ( CITY, 1 ) i f emp.ename l i k e \%Smith return ( CITY, 1 ) Reduce: I n p u t : ( CITY, l i s t ( 1 ) ), Output : ( CITY,SUM( l i s t ( 1 ) ) ) return ( CITY,SUM( 1 ) ) CS742-Distributed & Parallel DBMS M. Tamer Özsu 7 / 24

8 MapReduce Architecture Master Worker Map Process Input Module Map Module Combine Module Partition Module Worker Map Process Input Module Map Module Combine Module Partition Module Worker Map Process Input Module Map Module Combine Module Partition Module Scheduler Worker Reduce Process Group Module Reduce Module Output Module Worker Reduce Process Group Module Reduce Module Output Module CS742-Distributed & Parallel DBMS M. Tamer Özsu 8 / 24

9 MapReduce: Simplified Data Processing on Large Clusters Execution Flow with Architecture (1) fork User Program (1) fork (1) fork (2) assign Master (2) assign reduce worker split 0 split 1 split 2 (3) read (4) local write worker (5) remote (5) read worker (6) write output file 0 split 3 split 4 worker output file 1 worker Input files Map phasr Intermediate files (on local disks) Reduce phase Output files Fig. 1. Execution overview. From: J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Comm. ACM, CS742-Distributed & Parallel DBMS M. Tamer Özsu 9 / 24

10 Characteristics Flexibility User can write any and reduce function code No need to know how to parallelize Scalability Elastic scalability Automatic load balancing Efficiency Simple: parallel scan No database loading Fault tolerance Worker failure Master pings workers periodically; assumes failure if no response Tasks (both and reduce) on failed workers scheduled on a different worker node Master failure Checkpoints of master state Recovery after failure progress can halt Replication on distributed data store CS742-Distributed & Parallel DBMS M. Tamer Özsu 10 / 24

11 Hadoop Most popular MapReduce implementation developed by Yahoo! Two components Processing engine HDFS: Hadoop Distributed Storage System others possible Can be deployed on the same machine or on different machines Processes Job tracker: hosted on the master node and implements the schedule Task tracker: hosted on the worker nodes and accepts tasks from job tracker and executes them HDFS Name node: stores how data are partitioned, monitors the status of data nodes, and data dictionary Data node: Stores and manages data chunks assigned to it Worker 1 Name Node Worker n MapReduce Task Tracker Job Tracker Task Tracker HDFS Data Node Name Node Data Node CS742-Distributed & Parallel DBMS M. Tamer Özsu 11 / 24

12 Hadoop UDF Functions Phase Name Function InputFormat::getSplit Partition the input data into different splits. Each split is processed by a per Map and may consist of several chunks. RecordReader::next Define how a split is divided into items. Each item is a key/value pair and used as the input for the function. Mapper:: Users can customize the function to process the input data. The input data are transformed into some intermediate key/value pairs. WritableComparable::compareTo The comparison function for the key/- value pairs. Job::setCombinerClass Specify how the key/value pair are aggregated locally. Shuffle Job::setPartitionerClass Specify how the intermediate key/value pairs are shuffled to different reducers. Job::setGroupingComparatorClass Specify how the key/value pairs are Reduce grouped in the reduce phase. Reducer::reduce Users write their own reduce functions to perform the corresponding jobs. CS742-Distributed & Parallel DBMS M. Tamer Özsu 12 / 24

13 MapReduce Implementations Name Language File System Index Master Server Multiple Job Support Hadoop Java HDFS No Name Node and Job Yes Tracker Disco Python and Erlang Distributed Index Disco Server Skynet Ruby MySQL or No Any node in the cluster No Unix File System FileMap Shell Unix File No Any node in the cluster No and Perl Scripts System Twister Java Unix File No One master node in broker Yes System network Cascading Java HDFS No Name Node and Job Yes Tracker No No CS742-Distributed & Parallel DBMS M. Tamer Özsu 13 / 24

14 MapReduce Languages Hive Pig Language Declarative SQL-like Dataflow Data model Nested Nested UDF Supported Supported Data partition Supported Not supported Interface Command line, web, Command line JDBC/ODBC server Query optimization Rule based Rule based Metastore Supported Not supported CS742-Distributed & Parallel DBMS M. Tamer Özsu 14 / 24

15 MapReduce Implementations of Database Operators Select and Project can be easily implemented in the function Aggregation is not difficult (see next slide) Join requires more work MapReduce join implementations Repartition join θ-join Equi-join Semi-join Broadcast join Map-only join Similarity join Partition join Multiple MapReduce jobs Multi-way join Replicated join CS742-Distributed & Parallel DBMS M. Tamer Özsu 15 / 24

16 Aggregation Extracting aggregation attribute (Aid) Grouping by Aid Applying the aggregation function for the tuples with the same Aid Mapper 1 R 1 R1 2 R2 Mapper 2 R 3 R3 4 R4 Aid Value 1 R1 2 R2 Aid Value 1 R3 2 R4 Partitioning by Aid (Round Robin) Reducer 1 Aid 1 Reducer 2 Aid 2 Value R1 R3 Value R2 R4 reduce reduce Result 1, f(r1, R3) Result 2, f(r2, R4) Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 16 / 24

17 θ-join Baseline implementation of R(A, B) S(B, C) 1. Partition R and assign each partition to pers 2. Each per takes a, b tuples and converts them to a list of key/value pairs of the form (b, a, R ) 3. Each reducer pulls the pairs with the same key 4. Each reducer joins tuples of R with tuples of S CS742-Distributed & Parallel DBMS M. Tamer Özsu 17 / 24

18 θ-join If θ equals = (i.e., equijoin) Repartition join Semijoin-based join Map-only join If θ equals Randomly assigning bucket ID (Bid) Grouping by Bid Aggregate tuples based on origins Local θ-join (θ is ) Mapper 1 R 1 R1 2 R2 3 R3 4 R4 Mapper 1 S 1 S1 2 S2 3 S3 4 S4 Bid Tuple 1 R,1,R1 4 R,2,R2 3 R,3,R3 2 R,2,R4 Bid Tuple 4 S,1,S1 2 S,2,S2 3 S,3,S3 1 S,2,S4 Partitioning by Bid (Round Robin) Reducer 1 Bid Tuple R,1,R1 1 S,4,S4 R,3,R3 3 S,3,S3 Reducer 2 reduce Origin Tuple R 1,R1 R 4,R4 Origin Tuple S 3,S3 S 4,S4 Result R1 S3 R1 S4 R4 S3 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 18 / 24

19 Repartition Join Tagging origins Grouping by keys Local join Mapper 1 R 1 R1 2 R2 3 R3 4 R4 Mapper 1 S 1 S1 2 S2 3 S3 4 S4 1 R,R1 2 R,R2 3 R,R3 4 R,R4 1 S,S1 2 S,S2 3 S,S3 4 S,S4 Partitioning by key (Round Robin) Reducer 1 Key 1 3 Reducer 2 Key 2 4 Tuple R,R1 S,S1 R,R3 S,S3 Tuple R,R2 S,S2 R,R4 S,S4 reduce reduce Result R1 S1 R3 S3 Result R2 S2 R4 S4 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 19 / 24

20 Semijoin-based Join Extracting join keys Broadcasting keys of R to all the splits of S and join S with keys of R Broadcasting the results of the previous job (S) to all the splits of R, and locally joining R with S R 1 R1 3 R2 1 R3 4 R4 MapReduce Key Mapper 1 Key S 1 S1 2 S2 Mapper 2 Key S 3 S3 4 S4 1 S1 3 S3 4 S4 Mapper 1 S 1 S1 3 S3 4 S4 R 1 R1 2 R2 Mapper 2 Result R1 S1 R2 S3 Job 1 Full MapReduce job Job 2 Map-only job Job 3 Map-only job CS742-Distributed & Parallel DBMS M. Tamer Özsu 20 / 24

21 Map-only Join Broadcast join: If inner relation outer relation no shuffling Map phase similar to third job of semijoin-based join Each per loads the full inner table to build an in-memory hash; scan the outer relation Trojan join: Relations are already co-partitioned on the join key all tuples of both relations are co-located on the same node Co-partitioning implemented by one job Scheduler loads co-partitioned data chunks in the same per to perform a local join Co-partitioned Split Co-group Co-group H R Data R H S Data S H R Data R H S Data S Footer CS742-Distributed & Parallel DBMS M. Tamer Özsu 21 / 24

22 Multiway Join Multiple MapReduce jobs R S T (R S) T Each join implemented in one MapReduce job Join ordering problem Replicated join Single MapReduce job CS742-Distributed & Parallel DBMS M. Tamer Özsu 22 / 24

23 Replicated Join Generating keys and tagging origins Joining the three tables locally Mapper 1 R Rid Value 1 C1 2 C2 Mapper 2 T Rid Sid Value 1 1 L1 1 2 L2 2 2 L3 Mapper 3 S Sid Value 1 O1 2 O2 1, null C,C1 2, null C,C2 1,1 L,L1 1,2 L,L2 2,2 L,L3 Key null,1 null,2 Value O,O1 O,O2 Partitioning by key (Round Robin) Shuffling tuples to multiple reducers if necessary Reducer 1 1,null C, C1 1,1 L, L1 null,1 O, O1 Reducer 2 1,null C, C1 1,2 L, L2 null,2 O, O2 Reducer 3 Key 2,null null,1 Reducer 4 Value C, C2 O, O1 2,null C, C2 2,2 L, L3 null,2 O, O2 reduce reduce reduce reduce Result C1,L1,O1 Result C1,L2,O2 Result Result C2,L2,O2 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 23 / 24

24 DBMS on MapReduce HadoopDB Llama Cheetah Language SQL-like Simple interface SQL Storage Row store Column store Hybrid store Data compression No Yes Yes Indexing Query optimization parti- Horizontally tioned Data partition Local index in each database instance Rule based optimization plus local optimization by PostgreSQL Vertically partitioned Horizontally partitioned at chunk level No index Local index for each data chunk Column-based optimization, Multi-query optimiza- late tion, materialized materialization and views processing multiway join in one job CS742-Distributed & Parallel DBMS M. Tamer Özsu 24 / 24