Databases 2 (VU) ( )

Transcription

1 Databases 2 (VU) ( ) MapReduce (Part 3) Mark Kröll KTI, TU Graz Nov. 14, 2016 Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

2 Outline 1 Problems Suited for Map-Reduce Matrix-Vector Multiplication Relational-Algebra Operations 2 Hadoop Ecosystem Big Data Storage Technologies Slides are partially based on Mining Massive Datasets by Jure Leskovec Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

3 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

4 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place not suitable when managing online sales (Amazon) the principal operations on Amazon data involve responding to searches for products, recording sales, and so on, processes that involve relatively little calculation and that change the database won t see MapReduce for handling Web requests (even if we have millions of users) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

5 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place not suitable when managing online sales (Amazon) the principal operations on Amazon data involve responding to searches for products, recording sales, and so on, processes that involve relatively little calculation and that change the database won t see MapReduce for handling Web requests (even if we have millions of users) however, you want to use MapReduce for analytic queries on the data generated by an e.g. Web application find users with similar buying patterns ranking search results Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

6 MapReduce: Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

7 MapReduce: Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into MapReduce programming model another important class of operations that can use MapReduce effectively are relational-algebra operations Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

8 MapReduce: Applications Matrix-Vector Multiplication Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

9 Matrix-Vector Multiplication Matrix-Vector Multiplication Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

10 Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory the matrix M and the vector v will be each stored in a file of the DFS assume that the row-column coordinates of a matrix element (indices) can be discovered for example, each value is stored as a triple (i, j, m ij ) similarly, the position of v j can be discovered analogously Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

11 Matrix-Vector Multiplication Matrix-Vector Multiplication Map Function: the map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

12 Matrix-Vector Multiplication Matrix-Vector Multiplication Map Function: the map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Reduce Function: reduce function sums all the values associated with a given key i result is a pair (i, x i ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

13 Matrix-Vector Multiplication Matrix-Vector Multiplication however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

14 Matrix-Vector Multiplication Matrix-Vector Multiplication however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory alternatively we can divide the matrix M into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height use enough stripes so that the portion of the vector in one stripe can fit into main memory at a compute node Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

15 Matrix-Vector Multiplication Matrix-Vector Multiplication Figure: Divide matrix M and vector v into stripes Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

16 Matrix-Vector Multiplication Matrix-Vector Multiplication the ith stripe of matrix M multiplies only components from the ith stripe of the vector can divide matrix M into one file for each stripe, and do the same for the vector v each Map task is assigned a chunk from one of the stripes in the matrix and gets the entire corresponding stripe of the vector Map and Reduce tasks can then act exactly as before need to sum up once more the results of the stripes multiplication Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

17 Relational-Algebra Operations Relational-Algebra Operations many operation on data can be described easily in terms of the common database-query primitives the queries themselves must not be executed within a DBMS e.g. standard operations on relations such as selection Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

18 Relational-Algebra Operations Relational-Algebra Operations many operation on data can be described easily in terms of the common database-query primitives the queries themselves must not be executed within a DBMS e.g. standard operations on relations such as selection a relation is a table with column headers called attributes the set of attributes of a relation R is called its schema: R(A 1, A 2,..., A n ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

19 Relational-Algebra Operations Relation Links From To url1 url2 url1 url3 url2 url3 url2 url Table: The relation consists of the set of pairs of URL s, such that the first has one or more links to the second Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

20 Relational-Algebra Operations Relation Links a tuple is a pair of URLs such that there is at least one link from the first to the second URL the first row (url1, url2) states that the Web page at url1 points to the Web page at url2 a similar relation is typically stored by a search engine (with billions of tuples) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

21 Relational-Algebra Operations Relational-Algebra Standard operation on relations are 1 Selection(σ): apply a condition C to each tuple and output only tuples that satisfy C 2 Projection (π): produce from each tuple only a subset S of attributes 3 Union, Intersection, Difference: set operations on tuples 4 Natural Join ( ): Given two relations compare each pair of tuples and output those that agree on all common attributes 5 Grouping and Aggregation (γ, θ): partition the tuples in a relation according to their values in a set of attributes. For each group perform one of the operations such as Sum, Count, Avg, Min or Max Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

22 Relational-Algebra Operations Example 1: Paths of length 2 find paths of length 2 in the Web using the Links relation in other words find triples of URLs (u, v, w) such that there is a link between u and v and a link between v and w we want to take natural join of Links with itself let us describe this with two copies of Links: L1(U1, U2) and L2(U2, U3) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

23 Relational-Algebra Operations Example 1: Paths of length 2 now we compute L1(U1, U2) L2(U2, U3) for each tuple t1 of L1 and each tuple t2 of L2, we see if their U2 components are same these components are the second component of t1 and the first component of t2) if these two components agree, we produce (U1, U2, U3) as a result if we want only to check for the existence of the path of length two we might want to project onto U1 and U3 π U1,U3 (L1 L2) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

24 Relational-Algebra Operations Example 2: Number of friends imagine that a social-networking site has a relation Friends(User, Friend) suppose we want to calculate the statistics about the number of friends of each user in terms of relational algebra we would perform grouping and aggregation: γ User,COUNT (Friend) (Friends) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

25 Relational-Algebra Operations Example 2: Number of friends imagine that a social-networking site has a relation Friends(User, Friend) suppose we want to calculate the statistics about the number of friends of each user in terms of relational algebra we would perform grouping and aggregation: γ User,COUNT (Friend) (Friends) this operation groups all tuples by the value of the first component and then counts the number of friends one tuple for each group, and a typical tuple would look like (Sally, 300), if user Sally has 300 friends Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

26 Relational-Algebra Operations Selection by MapReduce Selection given is a relation R; we want to compute σ C (R); can be done most conveniently in the map part alone Map Function: for each tuple t in R, test if it satisfies C if so produce the key value pair (t, t) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

27 Relational-Algebra Operations Selection by MapReduce Selection given is a relation R; we want to compute σ C (R); can be done most conveniently in the map part alone Map Function: for each tuple t in R, test if it satisfies C if so produce the key value pair (t, t) Reduce Function: the reduce function is identity it simply passes each key-value pair to the output Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

28 Relational-Algebra Operations Projection by Map-Reduce Projection given is a relation R; we want to compute π S (R) Map Function: for each tuple t in R construct a tuple t by eliminating from t those components that are not in projection S output the key-value pair (t, t ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

29 Relational-Algebra Operations Projection by Map-Reduce Projection given is a relation R; we want to compute π S (R) Map Function: for each tuple t in R construct a tuple t by eliminating from t those components that are not in projection S output the key-value pair (t, t ) Reduce Function: for each key t there will be one or more key-value pairs (t, t ) the reduce function turns (t, [t, t,..., t ]) into (t, t ) the Reduce operation equals a duplicate elimination Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

30 Relational-Algebra Operations Natural-Join by Map-Reduce Natural-Join given are relations R(A, B) and S(B, C); we want to compute R S must find tuples that agree on their B components Map Function: for each tuple (a, b) of R produce the key-value pair (b, (R, a)) for each tuple (b, c) of S produce the key-value pair (b, (S, c)) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

31 Relational-Algebra Operations Natural-Join by Map-Reduce Natural-Join given are relations R(A, B) and S(B, C); we want to compute R S must find tuples that agree on their B components Map Function: for each tuple (a, b) of R produce the key-value pair (b, (R, a)) for each tuple (b, c) of S produce the key-value pair (b, (S, c)) Reduce Function: each key b will be associated with a list of pairs that are either of the form (R, a) or (S, c) construct all pairs consisting of the values (a, b, c) the challenge is to convert one s task in a way that it can be processed by MapReduce; so that it adheres to its internal key/value structure Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

32 Relational-Algebra Operations Grouping and Aggregation by Map-Reduce Grouping and Aggregation given is a relation R(A, B, C); we want to compute γ A,θ(B) (R) Map Function: Map produces the grouping for each tuple (a, b, c) produce the key-value pair (a, b) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

33 Relational-Algebra Operations Grouping and Aggregation by Map-Reduce Grouping and Aggregation given is a relation R(A, B, C); we want to compute γ A,θ(B) (R) Map Function: Map produces the grouping for each tuple (a, b, c) produce the key-value pair (a, b) Reduce Function: reduce function produces the aggregation each key a represents a group apply the aggregation operator θ to the list [b 1, b 2,..., b n ] of the values associated with a output is a pair (a, x), where x is the result of θ applied to the list Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

34 Hadoop Ecosystem Hadoop Eco System (v1) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

35 Hadoop Ecosystem Hadoop Eco System (v1) HBase open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

36 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

37 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive Pig open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis high-level platform for creating programs that run on Apache Hadoop (language is called Pig Latin) abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

38 Hadoop Ecosystem Hadoop Eco System (v2) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

39 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

40 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

41 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm Giraph Impala are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) utilizes Apache Hadoop s MapReduce implementation to process graphs is Cloudera s open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

42 Hadoop Ecosystem Data Processing Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

43 Hadoop Ecosystem Java(-ish) is the Hadoop language Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

44 Hadoop Ecosystem The good, the bad and the ugly The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) The Bad Rapid development, hard to keep overview 150 projects in (or near) Hadoop Eco System1 The Ugly Maven dependency hell if integrated with other systems Spark depends on > 50 libraries with a specific version! Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

45 Hadoop Ecosystem History of Hadoop Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

46 Hadoop Ecosystem System Design Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

47 Hadoop Ecosystem System Design Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

48 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

49 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

50 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Message-based: Kafka distributed Producer/Consumer messaging system data partitioned in topics producer groups / consumer groups Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

51 Hadoop Ecosystem Big Data Storage Technologies HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

52 Hadoop Ecosystem Big Data Storage Technologies HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

53 Hadoop Ecosystem Big Data Storage Technologies HBASE Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

54 Hadoop Ecosystem Big Data Storage Technologies HBASE Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

55 Hadoop Ecosystem Big Data Storage Technologies kafka Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

56 Hadoop Ecosystem Big Data Storage Technologies kafka Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

57 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

58 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Part 2: maximizing parallelism input data skew Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

59 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Part 2: maximizing parallelism input data skew Part 3: applications hadoop ecosystem Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

60 Hadoop Ecosystem Big Data Storage Technologies The End Next: Graph databases, Nov.28th Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41