MapReduce examples. CSE 344 section 8 worksheet. May 19, 2011



Similar documents
SQL Query Evaluation. Winter Lecture 23

Relational Databases

Chapter 13: Query Processing. Basic Steps in Query Processing

Relational Database: Additional Operations on Relations; SQL

Lecture 6: Query optimization, query tuning. Rasmus Pagh

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Evaluation of Expressions

Translating SQL into the Relational Algebra

3. Relational Model and Relational Algebra

Chapter 14: Query Optimization

Query Processing. Q Query Plan. Example: Select B,D From R,S Where R.A = c S.E = 2 R.C=S.C. Himanshu Gupta CSE 532 Query Proc. 1

SUBQUERIES AND VIEWS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 6

Relational Algebra. Basic Operations Algebra of Bags

Improving Maintenance and Performance of SQL queries

Relational Division and SQL

Chapter 13: Query Optimization

Relational Algebra and SQL

MATH10040 Chapter 2: Prime and relatively prime numbers

MapReduce and the New Software Stack

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Pre-Algebra Lecture 6

Chapter 2: Algorithm Discovery and Design. Invitation to Computer Science, C++ Version, Third Edition

Introduction to tuple calculus Tore Risch

Reading 13 : Finite State Automata and Regular Expressions

Developing MapReduce Programs

(Refer Slide Time: 2:03)

Computation Strategies for Basic Number Facts +, -, x,

A Little Set Theory (Never Hurt Anybody)

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Introduction to database design

Pseudo code Tutorial and Exercises Teacher s Version

Relational Algebra 2. Chapter 5.2 V3.0. Napier University Dr Gordon Russell

1.4 Compound Inequalities

A. TRUE-FALSE: GROUP 2 PRACTICE EXAMPLES FOR THE REVIEW QUIZ:

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Introduction to Parallel Programming and MapReduce

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

Exercise 1: Relational Model

PERFORMANCE TIPS FOR BATCH JOBS

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Dynamic Programming Problem Set Partial Solution CMPSC 465

Database Constraints and Design

Lecture 2: Universality

5544 = = = Now we have to find a divisor of 693. We can try 3, and 693 = 3 231,and we keep dividing by 3 to get: 1

2) Write in detail the issues in the design of code generator.

Binary Adders: Half Adders and Full Adders

Inside the PostgreSQL Query Optimizer

Lecture 8: Routing I Distance-vector Algorithms. CSE 123: Computer Networks Stefan Savage

Turing Machines, Part I

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Relational Algebra. Module 3, Lecture 1. Database Management Systems, R. Ramakrishnan 1

CS 3719 (Theory of Computation and Algorithms) Lecture 4

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Matrix Multiplication

Relational Algebra. Query Languages Review. Operators. Select (σ), Project (π), Union ( ), Difference (-), Join: Natural (*) and Theta ( )

Factorizations: Searching for Factor Strings

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Examples on Monopoly and Third Degree Price Discrimination

Lecture 4: More SQL and Relational Algebra

LiTH, Tekniska högskolan vid Linköpings universitet 1(7) IDA, Institutionen för datavetenskap Juha Takkinen

Big Data and Scripting map/reduce in Hadoop

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

LAMC Beginners Circle: Parity of a Permutation Problems from Handout by Oleg Gleizer Solutions by James Newton

Simple Regression Theory II 2010 Samuel L. Baker

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Decimal Notations for Fractions Number and Operations Fractions /4.NF

Vieta s Formulas and the Identity Theorem

Programming Languages

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Section 1. Inequalities

Session 6 Number Theory

Query Processing C H A P T E R12. Practice Exercises

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Dynamic Programming. Lecture Overview Introduction

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

CHAPTER 7: AGGREGATE DEMAND AND AGGREGATE SUPPLY

DATABASE DESIGN - 1DL400

MapReduce: Algorithm Design Patterns

Data Structures. Topic #12

Chapter 1: Introduction

Semantic Errors in SQL Queries: A Quite Complete List

Automata and Computability. Solutions to Exercises

Polynomial and Rational Functions

6. Standard Algorithms

The Relational Algebra

CISC - Curriculum & Instruction Steering Committee. California County Superintendents Educational Services Association

2.2 Derivative as a Function

SQL: Queries, Programming, Triggers

10.2 Series and Convergence

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Example Instances. SQL: Queries, Programming, Triggers. Conceptual Evaluation Strategy. Basic SQL Query. A Note on Range Variables

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

6.4 Normal Distribution

Oracle Database 11g: SQL Tuning Workshop

Transcription:

MapReduce examples CSE 344 section 8 worksheet May 19, 2011 In today s section, we will be covering some more examples of using MapReduce to implement relational queries. Recall how MapReduce works from the programmer s perspective: 1. The input is a set of (key, value) pairs. 2. The map function is run on each (key, value) pair, producing a bag of intermediate (key, value) pairs: map (inkey, invalue): // do some processing on (inkey, invalue) emit_intermediate(hkey1, hvalue1) emit_intermediate(hkey2, hvalue2) //... 3. The MapReduce implementation groups the intermediate (key, value) pairs by the intermediate key. Despite the name, this grouping is very different from the grouping operator of the relational algebra, or the GROUP BY clause of SQL. Instead of producing only the grouping key and the aggregate values, if any, MapReduce grouping also outputs a bag containing all the values associated with each value of the grouping key. In addition, grouping is separated from aggregation computation, which goes in the reduce function. 4. The reduce function is run on each distinct intermediate key, along with a bag of all the values associated with that key. It produces a bag of final values: // do some processing on hkey, each element of hvalues[] emit(fvalue1) emit(fvalue2) //... 1

Suppose you have two relations: R(a, b) and S(b, c), whose contents are stored as files on disk. Before you can process these relations using MapReduce, you need to parse each tuple in each relation as a (key, value) pair. This task is relatively simple for our two relations, which only have two attributes to begin with. Let s treat R.a as the key for tuples in R (note that this does not have to be a true key in the relational sense), and similarly, let s treat S.b as the key for S. For the values of the (key, value) pairs, we won t use the single non-key attribute. Instead, we ll use a composite value consisting of both attributes, plus a tag indicating where the value came from. So, for relation R, the values will have three components, value.tag, which is always the string R, and value.a and value.b for the two attributes of R. Similarly, relation S s tuples will have MapReduce values containing value.tag, which is always S, and value.b and value.c for S s attributes. This convention allows us to use the same map function for tuples from both relations; we ll see the use of that soon. Given this notation for the (K, V) pairs of R and S, let s try to write pseudocode algorithms for the following relational operations: 1. Selecting tuples from R: σ a<10 R In this simple example, all the work is done in the map function, where we copy the input to the intermediate data, but only for tuples that meet the selection condition: if inkey < 10: emit_intermediate(inkey, invalue) Reduce then simply outputs all the values it is given: for each t in hvalues: emit(t) 2

2. Eliminate duplicates from R: δ(r) For this problem we will use a simple trick, described in your textbook we ll use the fact that duplicate elimination in the bag relational algebra is equivalent to grouping on all attributes of the relation. MapReduce does grouping for us, so all we need is to make the entire tuple the intermediate key to group on. // We won t use the intermediate value, // so we just put in a dummy value emit_intermediate(invalue, abc ) Once we do that we just output the intermediate key as the final value: emit(hkey) 3. Natural join of R and S: R R.b=S.b S The map function outputs the same value as its input, but changes the key to always be the join attribute b: emit_intermediate(invalue.b, invalue) Then, after the MapReduce system groups together the intermediate data by the intermediate key, i.e. the b values, we use the reduce function to do a nested loop join over each group. Because all the values from each group have the same join attribute, we don t check the join attribute in the nested loop. We do need to check which relation each tuple comes from, so that (for example) we don t join a tuple from R with itself, or with another R tuple. for each r in hvalues: for each s in hvalues: if r.tag = R and s.tag = S : emit(r.a, r.b, s.c) Note that the analogy with the nested loop join breaks down in another way: instead of reading the relations directly from disk, we have used MapReduce to build the entire Cartesian product of R and S, and then grouped the Cartesian product by the join attribute b. So this is actually a very inefficient way to compute a join. 3

4. 3-way natural join: R R.b=S.b S S.c=T.c T, where we introduce a new relation T(c, d). One way to do this join might be to split the join into two MapReduce jobs. The first job does one of the joins, using the previous algorithm for a single natural join. The second job then joins the result of the first job with the relation left over. This approach actually results in two different possible implementations, depending on which pair of relations we want to join first. One way to judge which method is better is to look at the number of tuples in each relation, and compare the relative sizes of the first job s output under each implementation. The smaller size for the first join is then considered better, because in our naïve view, smaller is faster smaller data sets mean less network traffic, which is the real bottleneck of a distributed system like MapReduce (although the gap is narrowing rapidly). 5. Grouped and aggregated join: γ (a,sum(c) s) (R R.b=S.b S) As with the three-way join, we ll probably want to use more than 1 MapReduce job for this problem. One approach uses 2 jobs: the first job does the 2-way join as we did in the problem above, then the second job does all the grouping and aggregation. Grouping is done automatically by MapReduce; all we have to do is to output the grouping attribute(s) as the intermediate key in map: map (inkey, invalue): emit_intermediate(invalue.a, invalue) Then we can compute the sum aggregate in reduce: reduce (hkey, hvalues[]): val sum := 0 for each t in hvalues: sum += t.c emit(hkey, sum) 4

Another solution approach is based on the observation that you don t need to carry around all the tuples in the Cartesian product to compute the aggregate value sum(c). You do need all three distinct attributes, because the same tuple from S may appear many times in the join, due to several tuples in R that have the same b attribute). But you don t need every possible combination of values of all three attributes. Instead, you can do the grouping and aggregate computation on S alone, grouping on S.b instead of R.a. Then join the grouped results with R. Finally, repeat the grouping and aggregation (to take care of those S repeats), but with R.a as the (correct) grouping attribute. This sounds more complicated, and it is, using 3 MapReduce jobs instead of two. Can you think of a scenario where this 3-job algorithm might actually be faster than the 2-job algorithm? 5