On Saying Enough Already! in MapReduce

From this document you will learn the answers to the following questions:

How many f2 f2 did the researchers use to sort data?

What did the work on MapReduce use?

What did the work on top - k queries provide?

Similar documents

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

An Approach to High-Performance Scalable Temporal Object Storage

Developing MapReduce Programs

Similarity Search in a Very Large Scale Using Hadoop and HBase

A Survey of Large-Scale Analytical Query Processing in MapReduce

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Processing Joins over Big Data in MapReduce

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Cloud Computing at Google. Architecture

Hadoop & its Usage at Facebook

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop IST 734 SS CHUNG

Apache Hadoop FileSystem Internals

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data With Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Search and Real-Time Analytics on Big Data

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Parallel Processing of cluster by Map Reduce

Big Data Technology CS , Technion, Spring 2013

Chapter 7. Using Hadoop Cluster and MapReduce

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Log Mining Based on Hadoop s Map and Reduce Technique

Analysis of MapReduce Algorithms

A Survey of Top-k Query Processing Techniques in Relational Database Systems

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Design of Electric Energy Acquisition System on Hadoop

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Massive Cloud Auditing using Data Mining on Hadoop

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

NoSQL and Hadoop Technologies On Oracle Cloud

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Apache HBase. Crazy dances on the elephant back

GeoGrid Project and Experiences with Hadoop

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Fast Contextual Preference Scoring of Database Tuples

MS SQL Performance (Tuning) Best Practices:

Hadoop & its Usage at Facebook

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

CLOUD STORAGE USING HADOOP AND PLAY

Data warehousing with PostgreSQL

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Integrating Big Data into the Computing Curricula

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Data: To BI or not to BI?

Snapshots in Hadoop Distributed File System

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

KEYWORD SEARCH IN RELATIONAL DATABASES

Vertica Live Aggregate Projections

Data Mining in the Swamp

Why Big Data in the Cloud?

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

In Memory Accelerator for MongoDB

QuickDB Yet YetAnother Database Management System?

Data Doesn t Communicate Itself Using Visualization to Tell Better Stories

How To Handle Big Data With A Data Scientist

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Actian Vector in Hadoop

2015 The MathWorks, Inc. 1

Data Semantics Aware Cloud for High Performance Analytics

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Management & Analysis of Big Data in Zenith Team

Design and Evolution of the Apache Hadoop File System(HDFS)

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Fault Tolerance in Hadoop for Work Migration

Transcription:

On Saying Enough Already! in MapReduce Christos Doulkeridis and Kjetil Nørvåg Department of Computer and Information Science (IDI) Norwegian University of Science and Technology (NTNU) Trondheim, Norway

Outline Motivation Preliminaries MapReduce Top-k queries and top-k joins Rank-aware query processing in MapReduce Sorted access Intelligent data placement Data synopses Related work Conclusions and outlook 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 2

Motivation Business intelligence (BI) technology is essential for effective decision-making for the enterprise The challenges and needs of BI applications explode in the era of Big Data Data collection from various sources Retail, banking, RFID tags, email, query logs, blogs, reviews, etc. Cloud Intelligence for data analysis of massive data sets The only scalable solution to-date Popularity of MapReduce and its open-source implementation Hadoop Important to support advanced BI operators over MapReduce Focus of this work: efficient rank-aware query processing 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 3

MapReduce Overview Map: (k 1, v 1 ) [(k 2,v 2 )] Reduce: (k 2, [v 2 ]) [v 3 ] Salient features Scalability, fault-tolerance, ease of use, flexibility, Limitations Performance!!! Lack of early termination 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 4

Rank-aware Processing Important tool for BI applications Decision-making based on top-k results matching the user s constraints, and ranked according to user preferences Inspection of a bounded set of k tuples only Rather than retrieval and display of a huge result set Challenging to report the top-k result, without exhaustive access to the underlying input data (early termination) 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 5

Top-k Queries ID NAME CITY PRICE SELECT * FROM hotels WHERE hotels.city= Trondheim ORDER BY hotels.price STOP AFTER k; h1 BEST Trondheim 1400 h2 ABC London 1200 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 h5 HILTON London 1800 h1 BEST Trondheim 1400 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 k = 1 h4 CHEAP Trondheim 1200 h1 BEST Trondheim 1400 h3 HOLIDAY Trondheim 1500 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 6

Top-k Queries A top-k query q k ( f ) is defined by a user-specified scoring function f, which aggregates the objects characteristics into a single score E.g., f = w 1 *X + w 2 *Y where Σw i =1 defines a total ordering Y p 7 Given a positive integer k and a user-defined weighting vector w Find p 3 p 4 the k data points p with the minimum f(p) scores p 1 w p2 p 6 p 5 X 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 7

Top-k Queries SELECT * FROM hotels HOLIDAY WHERE hotels.city = Trondheim ORDER BY 0.5*hotels.price+0.5*hotels.dist STOP AFTER 1; SELECT * FROM hotels CHEAP WHERE hotels.city = Trondheim ORDER BY 0.9*hotels.price+0.1*hotels.dist STOP AFTER 1; ID NAME CITY PRICE DIST h1 BEST Trondheim 1400 2000 h2 ABC London 1200 3000 h3 HOLIDAY Trondheim 1500 1500 h4 CHEAP Trondheim 1200 2200 h5 HILTON London 1800 800 SCORE 1460 1700 1500 1345 1700 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 8

Top-k Joins ID NAME CITY PRICE SELECT * FROM hotels, flights WHERE hotels.city = flights.to_city ORDER BY (0.5*hotels.price + 0.5*flights.price ) STOP AFTER k h1 BEST Trondheim 1400 h2 ABC London 1200 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 h5 HILTON London 1800 ID AIRLINE TO_CITY PRICE f1 KLM Trondheim 5000 f2 KLM London 4200 f3 SAS Trondheim 3000 f4 SAS Trondheim 3500 f5 KLM London 2000 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 9

Top-k Joins ID NAME CITY PRICE h1 BEST Trondheim 1400 h2 ABC London 1200 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 h5 HILTON London 1800 ID AIRLINE TO_CITY PRICE f1 KLM Trondheim 5000 f2 KLM London 4200 f3 SAS Trondheim 3000 f4 SAS Trondheim 3500 f5 KLM London 2000 ID NAME CITY PRICE ID AIRLINE TO_CITY PRICE h1 BEST Trondheim 1400 f1 KLM Trondheim 5000 h1 BEST Trondheim 1400 f3 SAS Trondheim 3000 h1 BEST Trondheim 1400 f4 SAS Trondheim 3500 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 10

Rank-aware Processing in MapReduce Access the complete input data! No support for early termination! 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 11

Rank-aware Processing in MapReduce Our contributions: Sorted access for top-k queries Intelligent data placement Use of data synopses 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 12

1. Sorted Access for Top-k Queries In centralized DBs, efficient processing of topk queries relies on sorted access to data Directly: data is stored sorted on disk Indirectly: provided by a secondary index How can we provide sorted access in MapReduce to support top-k queries q k ( f )? Two alternative techniques Always sort data before top-k processing based on f (query-dependent sorting) Store data sorted based on a scoring function F f (query-independent sorting) 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 13

Query-dependent Sorting Use a separate MR job to sort the data based on f before processing the top-k query q k ( f ) Easy to find the top-k results Simply report the k first tuples High overhead to sort data for each incoming query Sorting is query-dependent on scoring function f MR1 MR2 MR1 MR2 MR1 MR2 DataNodes 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 14

Query-independent Sorting Store data sorted on DataNodes based on a scoring function F (different than f) To process query q k ( f ) it suffices to access only the K first tuples of the stored data (where K > k) Sorting based on F is a one-time cost Again, can be performed using a separate MR job The difference (extra cost) in the number of accessed tuples K-k increases when F differs much from f MR1 MR2 MR3 MR1 MR2 MR3 MR1 MR2 MR3 DataNodes 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 15

2. Data Placement Data placement on DataNodes affects performance Determined by a Partitioner class in Hadoop Existing partitioning schemes (e.g., range or hash-based) used for data placement are oblivious to the nature of top-k queries Example: RanKloud [IEEE Multimedia 11] Desiderata Balance the useful work to DataNodes Avoid redundant processing 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 16

Intelligent Data Placement for Top-k Queries Angle-based partitioning [SIGMOD 08] (proposed by our group in the context of skyline queries) Splits the useful work fairly Splits the region near the origin of the axes to all partitions Advantages More intuitive for top-k queries Easy to generalize in higher dimensions Can be combined with sorting Sort based on distance to origin 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 17

3. Data Synopses Main idea Create and store metadata together with data Exploit the metadata during query processing to access only those blocks that may contain top-k results Metadata in the form of multidimensional histograms 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 18

Metadata for Top-k Queries Number of tuples Block IDs Can be constructed seamlessly during data upload Score range for d 2 Score range for d 1 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 19

Top-k Query Processing Exploiting Data Synopses Progressively access histogram bins Until it is guaranteed that the top-k tuples are enclosed in the bins Use upper bound on score Retrieve block IDs Use random access to retrieve only these blocks Example (k=5) Access bins up to d 13 and d 22 Total of 34 tuples (> 5) Block IDs = {1,2,3} 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 20

Optimizations Cost model for deciding when the cost of random access of few blocks is smaller than sequential access of many blocks Use optimized histograms (e.g., equi-depth) Use more advanced methods for histogram bin exploration E.g., examine more bins from the dimension that is more promising to produce the top-k results faster The use of data synopses can be combined with sorting and intentional data placement to boost the performance of query processing 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 21

Related Work on MapReduce RanKloud [IEEE Multimedia 11] Proposed for top-k join queries Cannot guarantee retrieval of k results CoHadoop [PVLDB 11] Co-location of files on the same DataNode Useful for joins EARL [PVLDB 12] Mechanism to stop execution of MapReduce jobs on demand 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 22

Conclusions & Outlook An overview of techniques for supporting rankaware processing in MapReduce Sorting, Data placement, Use of data synopses Currently, we evaluate these techniques Future work Analytical cost models Optimal partitioning scheme for top-k queries More complex query functions Extend the techniques to be applicable for intermediate results produced by other MapReduce jobs 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 23

More information: http://www.idi.ntnu.no/~cdoulk/ cdoulk@idi.ntnu.no