YANG, Lin COMP 6311 Spring 2012 CSE HKUST



Similar documents
Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

A Comparison of Approaches to Large-Scale Data Analysis

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Big Data and Apache Hadoop s MapReduce

A Comparison of Approaches to Large-Scale Data Analysis

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

MapReduce: A Flexible Data Processing Tool

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

CSE-E5430 Scalable Cloud Computing Lecture 2

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Cloud Computing at Google. Architecture

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Parallel Databases vs. Hadoop

The Performance of MapReduce: An In-depth Study

Hive A Petabyte Scale Data Warehouse Using Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data With Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

How To Handle Big Data With A Data Scientist

Open source Google-style large scale data analysis with Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Lecture Data Warehouse Systems

Architectures for Big Data Analytics A database perspective

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop Distributed File System. -Kishan Patel ID#

Hadoop vs. Parallel Databases. Juliana Freire!

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

How To Write A Paper On Bloom Join On A Distributed Database

Parallel Processing of cluster by Map Reduce

Large-scale Data Processing on the Cloud

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Daniel J. Adabi. Workshop presentation by Lukas Probst

Using distributed technologies to analyze Big Data

From GWS to MapReduce: Google s Cloud Technology in the Early Days

NetFlow Analysis with MapReduce

MapReduce and Hadoop Distributed File System

Integrating Hadoop and Parallel DBMS

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

JackHare: a framework for SQL to NoSQL translation using MapReduce

Comparison of Different Implementation of Inverted Indexes in Hadoop

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Data Processing with Google s MapReduce. Alexandru Costan

Hadoop IST 734 SS CHUNG

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Data-Intensive Computing with Map-Reduce and Hadoop

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Hadoop and Map-Reduce. Swati Gore

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Advanced Data Management Technologies

CASE STUDY OF HIVE USING HADOOP 1


Radoop: Analyzing Big Data with RapidMiner and Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A Study on Big Data Integration with Data Warehouse

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop & its Usage at Facebook

Introduction to Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to Hadoop

HP Vertica and MicroStrategy 10: a functional overview including recommendations for performance optimization. Presented by: Ritika Rahate

MapReduce with Apache Hadoop Analysing Big Data

SanDisk Solid State Drives (SSDs) for Big Data Analytics Using Hadoop and Hive

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Statistical Analysis of Web Server Logs Using Apache Hive in Hadoop Framework

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Hadoop & its Usage at Facebook

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Networking in the Hadoop Cluster

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Recommended Literature for this Lecture

APACHE HADOOP JERRIN JOSEPH CSU ID#

Workshop on Hadoop with Big Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Transcription:

YANG, Lin COMP 6311 Spring 2012 CSE HKUST 1

Outline Background Overview of Big Data Management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 2

Data-driven World Science Data bases from astronomy, genomics, environmental data, transportation data, Humanities and Social Sciences Scanned books, historical documents, social interactions data, Business & Commerce Corporate sales, stock market transactions, census, airline traffic, Entertainment Internet images, Hollywood movies, MP3 files, Medicine MRI & CT scans, patient records, 3

A Case of Big Data Statistics from 2009 350 Million Named users 175 Million Active users in one day 35 Million Users updating status each day 55 Million Status each day 2.5 Billion Photos/Month 1.6 Million Active pages Growth: 12 TB/day, 2 PB/year Global data volume: 8.7 PB 4

What can we do with these data? What can we do? Scientific breakthroughs Business process efficiencies Realistic special effects Improve quality-of-life: healthcare, transportation, environmental disasters, daily life, Could We Do More? YES: but need major advances in our capability to analyze those data 5

Volume Resources Challenges Basic Requirements Capacity Elasticity Data Capacity Moore Demand Capacity Demand Time Time 6

Challenges(Cont.) More Requirements High performance Fault tolerance Load Balance Cost-efficient parallelization High availability and disaster recovery 7

Outline Background Overview of Big Data Management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 8

The State of The Art Parallel DBMS technologies Proposed in the late eighties Matured over the last two decades Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises MapReduce pioneered by Google popularized by Yahoo! (Hadoop) 9

Parallel DBMS Popularly used for more than two decades Research Projects: Gamma, Grace, Commercial: Multi-billion dollar industry but access to only a privileged few Relational Data Model Indexing SQL interface Advanced query optimization 10

MapReduce Dean et al., OSDI 04 Key contribution: Propose a simple but powerful programming model Parallelization Fault-tolerance Load balancing Implement the framework of this model and provide simple interface to users. 11

MapReduce(Cont.) Data flow of MR Input Data Partition_1 M splits Map M pieces of intermediate output R pieces of outputs Reduce R splits Partition_2 M = input size / 64 MB R = #machine * 2 12

MapReduce(Cont.) Execution overview 13

MapReduce(Cont.) Word counter example Credit: Martijn van Groningen. Introduction to Hadoop. 14

MapReduce(Cont.) Parallelization Map and Reduce Fault tolerance Master pings workers periodically Master write periodic checkpoint Re-execute the unfinished Reduce and all Map tasks of failed machine Load balance Break tasks into small granularity, schedule by master Optimization Locality: Try to schedule Map task to the node has the corresponding data Backup Tasks: When the job is close to completion, run backup executions of remaining tasks, the first completed one wins 15

Introduction to Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hard-ware. Inspired by Google MapReduce and GFS Architecture of Hadoop 16

Trend of Big Data Management Credit: Big Philippe Julio. Big Data Architecture. 17

Outline Background Overview of Big Data Management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 18

Comparison btw PDB & MR Pavlo et al., SIGMOD 09 Candidates Hadoop: an open-source implementation of MapReduce. DBMS-X: a parallel relational database. Vertica: a parallel DBMS which stores data in columnbase format. Task 1 Grep specific-pattern string from large scale of records 5.6 million records, 100 bytes per record Test 1: Fixes the size of the data per node as 535MB Test 2: Fixes the total dataset size as 1TB 19

Comparison(Cont.) Data Loading Observation: 1. Hadoop outperforms both of PDB 2. For DBMS-X, the data was actually loaded on each node sequentially, while the additional housekeeping can be done in parallel across nodes; 20

Comparison(Cont.) Grep Observations: 1. For Figure 4, such little data is being processed that the Hadoop start-up costs become the limiting factor in its performance( 10 25 seconds); 2. For Figure5, as the data volume increased, Hadoop performs almost as fast as PDB; 21

Comparison(Cont.) Task 2: analytical works Data Schema & Volume HTML Doc(600,000 records) url Attributes Contents Type VARCHAR(100) TEXT Rankings(18 M records, 1GB/node) Attributes Type srcip UserVisits(155 M records, 20GB/Node) Attributes dsturl visitdate adrevenue useragent countrycode Type VARCHAR(16) VARCHAR(100) DATE FLOAT VARCHAR(64) VARCHAR(3) pageurl VARCHAR(100) langcode VARCHAR(6) pagerank INT searchword VACHAR(32) avgduration INT Duration INT 22

Comparison(Cont.) Data loading Observations: 1. With the help of index, PDB outperform hadoop significantly; 2. As the number of node increased, the start-up cost would increase too. Selection Find the pageurl in the Rankings table with a pagerank above a userdefined threshold. 23

Comparison(Cont.) Aggregation Observations: 1. PDB outperform hadoop; 2. For PDB, communication cost dominate the execution time 3. Since Vertica is column-store, it could perform better from not reading unused parts of table. Calculate the total adrevenue generated for each srcip in the UserVisits table (20GB/node), grouped by the srcip (Fig.7) or 7-character prefix of srcip(fig.8) 24

Comparison(Cont.) Join Find the srcip that generated the most revenue within a particular date range, then calculate the average pagerank of all the pages visited during this interval. SQL Observations: 1. PDB outperforms Hadoop again! 2. PDB use indexes while Hadoop can only perform complete scan. 3. UserVisits and Rankings in PDB are partitioned by the join key, so PDB could do the join locally on each node without any network overhead SELECT INTO Temp srcip, AVG(pageRank) as AvgRank, SUM(adRevenue) as totalrevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date( 2000-01-15 ) AND Date( 2000-01-22 ) GROUP BY UV.sourceIP; MR 1. Filter on UserVisits and join wth Rankings 2. Compute the total adrevenue and avgrank based on srcip 3. Find the one with largest adrevenu SELECT sourceip, totalrevenue, avgrank FROM Temp ORDER BY totalrevenue DESC LIMIT 1; 25

Comparison(Cont.) Observations: 1. Bottom segment is the time to execute the UDF and upper segment represent the query time; 2. Both DBMS-X and Hadoop have approximately performance UDF Aggregation Scan the HTML documents and search for all the URLs appeared, and count the reference number across the entire set for each unique URL SQL SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) FROM Temp GROUPBY url; MR Similar to Grep task F is a user-defined function, which parses the contents of each record in the Documents table and emits URLs into the database. 26

Quick Summary At the scale of experiences above, parallel DBMS perform much better than Hadoop the result of a number of technologies developed over the past 25 years: B-Tree index, column-store, compression algorithm, sophisticated query engine and etc. Hadoop get a better fault tolerance with performance penalty. Hadoop is more easy to set up and use the PDB at large scale of nodes. PDB don t do a good job in UDF aggregation. The trend of both system is to move toward each other 27

Outline Background Overview of big data management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 28

Hive Thusoo et al., VLDB 09 Problem MapReduce requires developers to write custom programs which are hard to maintain and reuse. Analyst are familiar with SQL Key contribution Proposed an open-source data warehouse solution built on top of Hadoop. Hive provides HiveQL, a SQL-like query language which support select, join, aggregate, union all and sub-queries in from clause. 29

Hive(Cont.) Metastore: system catalog. Thrift Server: a framework for cross-language services. Driver: manages the life cycle of a HiveQL statement during compilation, optimization and execution. 30

Hive(Cont.) FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school 31

Hive(Cont.) Future work Make HiveQL subsume SQL Add cost-based optimizer Columnar storage and more intelligent data placement Performance enhancement Integrate with commercial BI tools Multi-query optimization and generic n-way joins in a single MR job 32

HadoopDB Azza Abouzeid et al. VLDB 09 Problem It s the age of big data. Properties for large data analysis : Performance Flexible query interface Fault tolerance Load balance Parallel database MapReduce Can we have both of them? 33

HadoopDB(Cont.) Key contribution: To build a hybrid system to archive all the required properties. Basic idea: Connect multiple single-node database using Hadoop as task coordinator and net communication layer (Fault tolerance & Load balance) Queries are parallelized across nodes using MapReduce framework Queries are pushed inside of corresponding database note as much as possible (Performance & Multi-interface) 34

HadoopDB(Cont.) Translate HiveQL to MapReduce, and translate some of MR back to SQL Repartition data on a given key or breaking apart single node data into chunks maintains meta information about the databases Connect to database and execute SQL and return key/value pairs 35

HadoopDB(Cont.) SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); 36

HadoopDB(Cont.) 37

Outline Background Overview of big data management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 38

Conclusion It is the age of big Data now Major technologies for big data management Parallel DBMS: Well studied and wildly used MapReduce: Attracting new technology with bright future while exist lots of drawbacks, especially in the aspect of performance Hive and HadoopDB give a good paradigm for build DB solution on the top of MapReduce There are still lots of work could be done in the field of big data management. 39

Reference Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 04 Andrew Pavlo, Erik Paulson, Alexander Rasin. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 09 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive - A Warehousing Solution Over a Map-Reduce Framework. VLDB 09 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 09 Divyakant Agrawal, Sudipto Das, AmrEl Abbadi. Big Data and Cloud Computing: Current State and Future Opportunities. EDBT 11 40

41