Processing Joins over Big Data in MapReduce

Similar documents

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

On Saying Enough Already! in MapReduce

Similarity Search in a Very Large Scale Using Hadoop and HBase

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

16.1 MAPREDUCE. For personal use only, not for distribution. 333

A very short Intro to Hadoop

Big Data With Hadoop

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

MapReduce Join Strategies for Key-Value Storage

Efficient Parallel knn Joins for Large Data in MapReduce

Developing MapReduce Programs

CSE-E5430 Scalable Cloud Computing Lecture 2

Distributed Computing and Big Data: Hadoop and MapReduce

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data Processing with Google s MapReduce. Alexandru Costan

A Survey of Large-Scale Analytical Query Processing in MapReduce

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop

Introduction to Hadoop

Big Data and Scripting map/reduce in Hadoop

Load-Balancing the Distance Computations in Record Linkage

Classification On The Clouds Using MapReduce

ITG Software Engineering

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Cloud Computing at Google. Architecture

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Distributed Apriori in Hadoop MapReduce Framework

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Infrastructures for big data

Chapter 7. Using Hadoop Cluster and MapReduce

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Using distributed technologies to analyze Big Data

Hadoop Ecosystem B Y R A H I M A.

Report: Declarative Machine Learning on MapReduce (SystemML)

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

The Stratosphere Big Data Analytics Platform

Implement Hadoop jobs to extract business value from large and varied data sets

Map Reduce / Hadoop / HDFS

Hadoop Job Oriented Training Agenda

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Duke University

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Big Systems, Big Data

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

A Brief Outline on Bigdata Hadoop

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop SNS. renren.com. Saturday, December 3, 11

Analysis of MapReduce Algorithms

Lecture 10 - Functional programming: Hadoop and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How To Write A Paper On Bloom Join On A Distributed Database

SQL Query Evaluation. Winter Lecture 23

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

COURSE CONTENT Big Data and Hadoop Training

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi

Chapter 13: Query Processing. Basic Steps in Query Processing

Intro to Map/Reduce a.k.a. Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

NoSQL and Hadoop Technologies On Oracle Cloud

Cloudera Certified Developer for Apache Hadoop

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

BIG DATA What it is and how to use?

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MAPREDUCE Programming Model

Vertica Live Aggregate Projections

Comparing SQL and NOSQL databases

Map Reduce & Hadoop Recommended Text:

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Using In-Memory Computing to Simplify Big Data Analytics

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Log Mining Based on Hadoop s Map and Reduce Technique

Challenges for Data Driven Systems

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Constructing a Data Lake: Hadoop and Oracle Database United!

Transcription:

Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/ cdoulk@unipi.gr 12 th Hellenic Data Management Symposium (HDMS 14), 24 July 2014

Aims and Scope Provide an overview of join processing in MapReduce/Hadoop, focusing on complex join types, besides equi-joins e.g. theta-join, similarity join, top-k join, k-nn join On top of MapReduce Binary joins Identify common techniques (at high level) that can be used as building blocks when designing a parallel join algorithm 2

Not Covered in this Tutorial Distributed platforms for Big Data management other than Hadoop (e.g. Storm, NoSQL stores, ) Approaches that rely on changing the internal mechanism of Hadoop (e.g. changing data layouts, loopaware Hadoop, etc.) Multi-way joins (e.g. check out the work by Afrati et al.) Check out related tutorials MapReduce Algorithms for Big Data Analytics [K.Shim@VLDB12] Efficient Big Data Processing in Hadoop MapReduce [J.Dittrich@VLDB12] 3

Related Research Projects CloudIX: Cloud-based Indexing and Query Processing Funding program: MC-IEF, 2011-2013 https://research.idi.ntnu.no/cloudix/ Roadrunner: Scalable and Efficient Analytics for Big Data Funding program: ΑΡΙΣΤΕΙΑ ΙΙ, ΓΓΕΤ, 2014-2015 http://www.ds.unipi.gr/projects/roadrunner/ Open positions: Post-doc, Phd/Msc 4

MapReduce basics Outline Map-side and Reduce-side join Join types and algorithms Equi-joins Theta-joins Similarity joins Top-k joins knn joins Important phases of join algorithms Pre-processing, pre-filtering, partitioning, replication, load balancing Summary & open research directions 5

MapReduce Basics 6

MapReduce vs. RDBMS Source: T.White Hadoop: The Definitive Guide 7

Hadoop Distributed File System (HDFS) 8 Source: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linuxsingle-node-cluster/

MapReduce A programming framework for parallel and scalable processing of huge datasets A job in MapReduce is defined as two separate phases Function Map (also called mapper) Function Reduce (also called reducer) Data is represented as key-value pairs Map (k1, v1) List (k2, v2) Reduce (k2, List(v2)) List (k3, v3) The data output by the Map phase are partitioned and sort-merged in order to reach the machines that compute the Reduce phase This intermediate phase is called Shuffle 9

Doc1 Doc2 Doc3 Doc4 Doc5 Brazil, Germany, Argentine Chile, Germany, Brazil Greece, Japan, Brazil Chile, Germany, Chile, Brazil, Germany M M KEY VALUE Brazil 1 Germany KEY 1 VALUE Argentine Brazil 1 1,1,1,1 Chile Chile 1 1,1,1 Germany 1 Brazil 1 KEY VALUE KEY VALUE Greece 1 Argentine 1 Japan 1 Germany 1,1,1,1 Brazil 1 Chile 1 Germany 1 R R KEY VALUE Brazil 4 Chile 3 KEY VALUE Argentine 1 Germany 4 Chile 1 Brazil 1 M Germany 1 10

Implementation (High Level) mapper (filename, file-contents): for each word in file-contents: output (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value output(word, sum) 11

MapReduce/Hadoop: 6 Basic Steps Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June 2014. 12

1. Input Reader Splits Reads data from files and transforms them to key-value pairs It is possible to support different input sources, such as a database or main-memory Doc1 Doc2 Brazil, Germany, Argentine Chile, Germany, Brazil M The data form splits which is the unit of data handled by a map task A typical size of a split is the size of a block (default: 64MB in HDFS, but is customizable) Doc3 Doc4 Doc5 Greece, Japan, Brazil Chile, Germany, Chile, Brazil, Germany M 13

2. Map Function Takes as input a key-value pair from the Input Reader Runs the code (logic) of the Map function on the pair Produces as result a new key-value pair The results of the Map function are placed in a buffer in memory, and written on disk when full (e.g. 80%) spill files The files are merged in one sorted file 14

3. Combiner Function Its use is optional Used when The Map function produces many multiple intermediate keys The Reduce function is commutative (a + b = b + a) and associative (a + b) + c = a + (b + c) In this case, the Combiner function performs partial merging so that pairs with the same key will be processed as a group by a Reduce task KEY VALUE Brazil 1 KEY VALUE M Germany 1 Argentine 1 Chile 1 C Brazil 2 Germany 2 Argentine 1 R Germany 1 Chile 1 15 Brazil 1

4. Partition Function Default: a hash function is used to partition the result of Map tasks to Reduce tasks Usually, works well for load balancing However, it is often useful to employ other partition functions Such a function can be user-defined Brazil Germany Argentine Chile Greece R R 16

5. Reduce Function The Reduce function is called once for each discrete key and is applied to all values associated with the key All pairs with the same key will be processed as a group The input to each Reduce task is given in increased key order It is possible for the user to define a comparator which will be used for sorting 17

6. Output Writer Responsible for writing the output to secondary storage (disk) Usually, this is a file However, it is possible to modify the function to store data, e.g., in a database 18

MapReduce Advantages Scalability Fault-tolerance Flexibility Ease of programming 19

10 Weaknesses/Limitations of MapReduce 1. Efficient access to data 2. High communication cost + Systematic classification of existing work until 2012 (over 110 references) 3. Redundant processing + Comparison of offered features 4. Recomputation + Summary of techniques used 5. Lack of early termination 6. Lack of iteration 7. Fast computation of approximate results 8. Load balancing 9. Real-time processing 10. Lack of support for operators with multiple inputs Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June 2014. 20

M-way Operations in MapReduce Not naturally supported Other systems support such operations by design (e.g., Hyracs@UCI, Storm) How in MapReduce? Two families of techniques (high level classification) Map-side join Reduce-side join 21

Map-side Join L Uid Time Action 1 19:45 Update L uid=uidr 1 21:12 Read 2 21:11 Read 2 23:45 Write M Uname cdoulk cdoulk Action Update Read R 3 23:09 Insert Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad M M Uname Jimper0 Jimper0 Uname Spirosoik Action Read Write Action Insert 22

Map-side Join Operates on the map side, no reduce phase Requirements for each input Divided into the same number of partitions Sorted by the join key Has the same number of keys Advantages Very efficient (no intermediate data, no shuffling, simply scan and join) Disadvantages Very strict requirements, in practice an extra MR job is necessary to prepare the inputs High required memory (buffers both input partitions) Source: T.White Hadoop: The Definitive Guide 23

L Uid Time Action 1 19:45 Update 2 21:11 Read 1 21:12 Read 3 23:09 Insert 2 23:45 Write R Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad Reduce-side Join M M M M M 1,L Update 1,L Read 1,R cdoulk 2,L Read 2,L Write 2,R jimper0 3,L Insert 3,R spirosoik R R R L uid=uidr Uname Action cdoulk Update cdoulk Read Uname Action Jimper0 Read Jimper0 Write Uname Action Spirosoik Insert 24

Reduce-side Join a.k.a. repartition join The join computation is performed on the reduce side Advantages The most general method (similar to a traditional parallel join) Smaller memory footprint (only tuples of one dataset with the same key) Disadvantages Cost of shuffling, I/O and communication costs for transferring data to reducers High memory requirements for skewed data Source: T.White Hadoop: The Definitive Guide 25

Join Types and Algorithms 26

Example: Equi-Join L Uid Time Action 1 19:45 Update 2 21:11 Read 1 21:12 Read 3 23:09 Insert 2 23:45 Write R Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad L uid=uidr 27

Equi-Joins: Broadcast Join Map-only job Assumes that R << L DFS R L1 L2 L3 R R R M M M L1 R L2 R L3 R Blanas et al. SIGMOD 10 28

Equi-Joins: Broadcast Join Advantages Efficient: only map-phase Load the small dataset R in hash table: faster probes No pre-processing (e.g., sorting) Disadvantages One dataset must be quite small Fit in memory Distribute to all mappers 29

Equi-Joins: Semi-join R contains only those keys in L.uid L1 L1 L2 M M R Distinct keys L.uid R1 R2 M M R 1 R 2 R L2 M M L1 R L2 R L3 M R3 M R 3 L3 M Broadcast Join L3 R Blanas et al. SIGMOD 10 30

Equi-Joins: Semi-join Advantages When R is large, reduces size of intermediate data and communication costs Useful when many tuples of one dataset do not join with tuples of the other dataset Disadvantages Three MR jobs overheads: job initialization, communication, local and remote I/O Multiple accesses of datasets 31

Equi-Joins: Per-Split Semi-join Distinct keys Mappers output records of R tagged with RLi L1 L1 L2 M M L1.uid L2.uid R1 R2 M M R RL1 RL2 L2 M M L1 R L2 R L3 M L3.uid R3 M R RL3 L3 M L3 R Join Blanas et al. SIGMOD 10 32

Equi-Joins: Per-Split Semi-join Advantages (compared to Semi-join) Makes the 3 rd phase cheaper, as it moves only the records in R that will join with each split of L Disadvantages (compared to Semi-join) First two phases are more complicated 33

Other Approaches for Equi-Joins Map-Reduce-Merge [Yang et al. SIGMOD 07] Map-Join-Reduce [Jiang et al. TKDE 11] Llama [Lin et al. SIGMOD 11] Multi-way Equi-Join [Afrati et al. TKDE 11] 34

Example: Theta-Join S Uid Time Action 1 19:45 Update 2 21:11 Read 1 21:12 Read 3 23:09 Insert 2 23:45 Write S T Uid Uname Status 1 cdoulk Lecturer 2 jimper0 Postgrad 3 spirosoik Postgrad S.uid<=T.uidT Challenge: How to map the inequality join condition to a key-equality based computer paradigm 35

Join Matrix Representation M(i,j)=true, if the i-th tuple of S and the j-th tuple of T satisfy the join condition Problem: find a mapping of join matrix cells to reducers that minimizes job completion time Each join output tuple should be produced by exactly one reducer 36 Okcan et al. SIGMOD 11

Matrix-to-Reducer Mapping Standard Random Balanced 5 7 7 7 8 9 5 7 7 7 8 9 5 7 7 7 8 9 5 1 5 3 5 7 7 2 7 7 2 3 1 3 1 2 7 7 1 2 8 1 8 1 8 9 9 3 9 9 2 1 9 9 3 Max-reducer-input=5 Max-reducer-output=6 Max-reducer-input=8 Max-reducer-output=4 Max-reducer-input=5 Max-reducer-output=4 Okcan et al. SIGMOD 11 37

MapReduce Algorithm: 1-Bucket-Theta Given a matrix-to-reducer mapping For each incoming S-tuple, say x S Assign a random matrix row from 1 S For all regions that correspond to this row Output (regionid, (x, S )) For each incoming T-tuple, say y T Assign a random matrix column from 1 T For all regions that correspond to this column Output (regionid, (y, T )) 5 7 7 8 9 9 5 7 7 7 8 9 1 2 3 Okcan et al. SIGMOD 11 38

In the paper How to perform near-optimal partitioning with strong optimality guarantees For cross-product SxT More efficient algorithms (M-Bucket-I, M-Bucket- O) when more statistics are available Okcan et al. SIGMOD 11 39

Other Approaches for Theta-Joins Multi-way Theta-joins [Zhang et al.vldb 12] Binary Theta-joins [Koumarelas et al.edbtw 14] 40

Example: (Set) Similarity Join L Rid a b 1 A B C 2 D E F 3 C F 4 E C D 5 B A F R Rid a c 1 G H I 2 A F G 3 B D F L sim(l.a,r.a)>τ R Example (strings): S1: I will call back [I, will, call, back] S2: I will call you soon [I, will, call, you, soon] e.g. sim(): Jaccard, τ = 0.4 sim(s1,s2) = 3 / 6 = 0.5 > 0.4 41

Background: Set Similarity Joins Rely on filters for efficiency, such as prefix filtering Global token ordering, sort tokens based on increasing frequency order Jaccard(s1,s2)>=τ Overlap(s1,s2)>=a a = τ / (1+ τ) * ( s1 + s2 ) Similar records need to share at least one common token in their prefixes Example: s1={a,b,c,d,e} Prefix length of s1: s1 - ceil( τ* s1 ) +1 For τ=0.9: 5-ceil(0.9*5)+1= 1 Prefix of s1 is {A} For τ=0.8: 5-ceil(0.8*5)+1= 2 Prefix of s1 is {A,B} For τ=0.6: 5-ceil(0.6*5)+1= 3 Prefix of s1 is {A,B,C} 42

Set Similarity Joins Solves the problem of Set Similarity Self- Join in 3 stages Can be implemented as 3 (or more) MR jobs #1: Token Ordering #2: RID-pair Generation #3: Record Join Vernica et al. SIGMOD 10 43

Set Similarity Joins Stage 1: Token Ordering Vernica et al. SIGMOD 10 44

Set Similarity Joins Stage 2: RID-pair Generation Vernica et al. SIGMOD 10 45

Set Similarity Joins Stage 3: Record Join Vernica et al. SIGMOD 10 46

Other Approaches for Similarity Joins V-SMART-Join [Metwally et al.vldb 12] SSJ-2R [Baraglia et al. ICDM 10] Silva et al. [Cloud-I 12] Top-k similarity join [Kim et al. ICDE 12] Fuzzy join [Afrati et al. ICDE 12] 47

Example: Top-k Join L Supplier Product Discount S1 Monitor 100 S2 CPU 160 S3 CPU 120 S4 HDD 190 S5 HDD 90 S6 DVD-RW 200 R Customer Product Price C1 Monitor 800 C2 CPU 240 C3 HDD 500 C4 CPU 550 C5 HDD 700 C6 Monitor 600 Topk(L L.product=R.productR) 48

RanKloud Basic ideas Generate statistics at runtime (adaptive sampling) Calculate threshold that identifies candidate tuples for inclusion in the top-k set Repartition data to achieve load balancing Cannot guarantee completeness May produce fewer than k results Probably requires 2 MapReduce jobs Candan et al. IEEE Multimedia 2011 49

Top-k Joins: Our Approach Basic features Exploit histograms built seamlessly during data upload to HDFS Compute bound Access only part of the input data (Map phase) Perform the top-k join in a load-balanced way (Reduce phase) working paper Doulkeridis et al. ICDE 12, CloudI 12@VLDB 50

Other Approaches for Top-k Joins BFHM Rank-Join in HBase [Ntarmos et al.vldb 14] Tomorrow, HDMS session 5, 10:50-12:30 51

Example: knn-join R Rid X Y 1 10 11 2 2 19 3 22 1 4 9 12 5 45 3 S Sid X Y 1 16 2 2 43 5 3 45 76 knnj(r,s)={(r, knn(r,s)) for all r R} 52

knn Joins in MapReduce Straightforward solution: Block nested loop join (H-BNLJ) 1 st MapReduce job Partition R and S in n equi-sized blocks Every pair of blocks is partitioned into a bucket (n 2 buckets) Invoke r=n 2 reducers to do knn join (nk candidates for each record) 2 nd MapReduce job Partition output records of 1 st phase by record ID In the reduce phase, sort in ascending order of distance Then, report top-k results for each record Improvement (H-BRJ): Build an index for the local S block in the reducer, to find knn efficiently Zhang et al. EDBT 12 53

H-zkNNJ Exact solution is too costly! go for approximate solution instead Solves the problem in 3 phases #1: Construction of shifted copies of R and S #2: Compute candidate k-nn for each record r R #3: Determine actual k-nn for each r Zhang et al. EDBT 12 54

H-zkNNJ: Phase 1 Stored locally Estimated quantiles A[0] A[n] for partitioning Ri, Si Zhang et al. EDBT 12 55

H-zkNNJ: Phase 2 α*n reducers Zhang et al. EDBT 12 56

H-zkNNJ: Phase 3 Output of Phase 2: For each record r R: C(r) candidates Compute knn(r, C(r)) trivial Zhang et al. EDBT 12 57

Other Approaches for knn Joins [Lu et al.vldb 12] Partitioning based on Voronoi diagrams 58

Important Phases of Join Algorithms How to boost the speed of my join algorithm? 59

Pre-processing Includes techniques such as Ordering [Vernica et al. SIGMOD 10] Global token ordering Sampling [Kim et al. ICDE 12] Upper bound on the distance of k-th closest pair Frequency computation [Metwally et al.vldb 12] Partial results and cardinalities of items Columnar storage, sorted [Lin et al. SIGMOD 11] Enables map-side join processing 60

Pre-filtering Reduces the candidate input tuples to be processed Essential pairs [Kim et al. ICDE 12] Partitioning only those pairs necessary for producing top-k most similar pairs 61

Partitioning More advanced partitioning schemes than hashing (application-specific) Theta-joins [Okcan et al. SIGMOD 11] Map the join matrix to the reduce tasks knn joins [Lu et al. VLDB 12] Z-ordering for 1-dimensional mapping Range partitioning Top-k joins Partitioning schemes tailored to top-k queries Utility-aware partitioning [RanKloud] Angle-based partitioning [Doulkeridis et al. CloudI 12] 62

Replication/Duplication In certain parallel join algorithms, replication of records is unavoidable How can we control the amount of replication? Set-similarity join [Vernica et al. SIGMOD 10] Grouped tokens Multi-way theta joins [Zhang et al.vldb 12] Minimize duplication of records to partitions Use a multidimensional space partitioning method to assign partitions of the join space to reducers 63

Load Balancing Data skew causes unbalanced allocation of work to reducers (lack of a built-in data-aware load balancing mechanism) Frequency-based assignment of keys to reducers [Vernica et al. SIGMOD 10] Cardinality-based assignment [Metwally et al.vldb 12] Minimize job completion time [Okcan et al. SIGMOD 11] Approximate quantiles [Lu et al. VLDB 12] Partition the input relations to equi-sized partitions 64

Analysis of Join Processing Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June 2014. 65

Summary and Open Research Directions 66

Summary Processing joins over Big Data in MapReduce is a challenging task Especially for complex join types, large-scale data, skewed datasets, However, focus on the techniques(!), not the platform New techniques are required Old techniques (parallel joins) need to be revisited to accommodate the needs of new platforms (such as MapReduce) Window of opportunity to conduct research in an interesting topic 67

Trends Scalable and efficient platforms MapReduce improvements, new platforms (lots of research projects worldwide), etc. Declarative querying instead of programming language Hive, Pig (over MapReduce), Trident (over Storm) Real-time and interactive processing Data visualization Main-memory databases 68

References [Blanas et al. SIGMOD 10] A comparison of join algorithms for log processing in MapReduce. [Yang et al. SIGMOD 07] Map-reduce-merge: simplied relational data processing on large clusters. [Jiang et al. TKDE 11] MAP-JOIN-REDUCE: Toward scalable and efficient data analysis on large clusters. [Lin et al. SIGMOD 11] Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. [Okcan et al. SIGMOD 11] Processing Theta-Joins using MapReduce. [Zhang et al.vldb 12] Efficient Multi-way Theta-Join Processing using MapReduce. [Koumarelas et al.edbtw 14] Binary Theta-Joins using MapReduce: Efficiency Analysis and Improvements. [Candan et al. IEEE Multimedia 2011] RanKloud: Scalable Multimedia Data Processing in Server Clusters. [Doulkeridis et al. CloudI 12] On Saying "Enough Already!" in MapReduce. [Ntarmos et al. VLDB 14] Rank Join Queries in NoSQL Databases [Zhang et al. EDBT 12] Efficient Parallel knn Joins for Large Data in MapReduce [Lu et al.vldb 12] Efficient Processing of k Nearest Neighbor Joins using MapReduce 69

Thank you for your attention More info: http://www.ds.unipi.gr/cdoulk/ Contact: cdoulk@unipi.gr https://research.idi.ntnu.no/cloudix/ http://www.ds.unipi.gr/projects/roadrunner/ 70