YANG, Lin COMP 6311 Spring 2012 CSE HKUST

Transcription

1 YANG, Lin COMP 6311 Spring 2012 CSE HKUST 1

2 Outline Background Overview of Big Data Management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 2

3 Data-driven World Science Data bases from astronomy, genomics, environmental data, transportation data, Humanities and Social Sciences Scanned books, historical documents, social interactions data, Business & Commerce Corporate sales, stock market transactions, census, airline traffic, Entertainment Internet images, Hollywood movies, MP3 files, Medicine MRI & CT scans, patient records, 3

4 A Case of Big Data Statistics from Million Named users 175 Million Active users in one day 35 Million Users updating status each day 55 Million Status each day 2.5 Billion Photos/Month 1.6 Million Active pages Growth: 12 TB/day, 2 PB/year Global data volume: 8.7 PB 4

5 What can we do with these data? What can we do? Scientific breakthroughs Business process efficiencies Realistic special effects Improve quality-of-life: healthcare, transportation, environmental disasters, daily life, Could We Do More? YES: but need major advances in our capability to analyze those data 5

6 Volume Resources Challenges Basic Requirements Capacity Elasticity Data Capacity Moore Demand Capacity Demand Time Time 6

7 Challenges(Cont.) More Requirements High performance Fault tolerance Load Balance Cost-efficient parallelization High availability and disaster recovery 7

9 The State of The Art Parallel DBMS technologies Proposed in the late eighties Matured over the last two decades Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises MapReduce pioneered by Google popularized by Yahoo! (Hadoop) 9

10 Parallel DBMS Popularly used for more than two decades Research Projects: Gamma, Grace, Commercial: Multi-billion dollar industry but access to only a privileged few Relational Data Model Indexing SQL interface Advanced query optimization 10

11 MapReduce Dean et al., OSDI 04 Key contribution: Propose a simple but powerful programming model Parallelization Fault-tolerance Load balancing Implement the framework of this model and provide simple interface to users. 11

12 MapReduce(Cont.) Data flow of MR Input Data Partition_1 M splits Map M pieces of intermediate output R pieces of outputs Reduce R splits Partition_2 M = input size / 64 MB R = #machine * 2 12

13 MapReduce(Cont.) Execution overview 13

14 MapReduce(Cont.) Word counter example Credit: Martijn van Groningen. Introduction to Hadoop. 14

15 MapReduce(Cont.) Parallelization Map and Reduce Fault tolerance Master pings workers periodically Master write periodic checkpoint Re-execute the unfinished Reduce and all Map tasks of failed machine Load balance Break tasks into small granularity, schedule by master Optimization Locality: Try to schedule Map task to the node has the corresponding data Backup Tasks: When the job is close to completion, run backup executions of remaining tasks, the first completed one wins 15

16 Introduction to Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hard-ware. Inspired by Google MapReduce and GFS Architecture of Hadoop 16

17 Trend of Big Data Management Credit: Big Philippe Julio. Big Data Architecture. 17

19 Comparison btw PDB & MR Pavlo et al., SIGMOD 09 Candidates Hadoop: an open-source implementation of MapReduce. DBMS-X: a parallel relational database. Vertica: a parallel DBMS which stores data in columnbase format. Task 1 Grep specific-pattern string from large scale of records 5.6 million records, 100 bytes per record Test 1: Fixes the size of the data per node as 535MB Test 2: Fixes the total dataset size as 1TB 19

20 Comparison(Cont.) Data Loading Observation: 1. Hadoop outperforms both of PDB 2. For DBMS-X, the data was actually loaded on each node sequentially, while the additional housekeeping can be done in parallel across nodes; 20

21 Comparison(Cont.) Grep Observations: 1. For Figure 4, such little data is being processed that the Hadoop start-up costs become the limiting factor in its performance( seconds); 2. For Figure5, as the data volume increased, Hadoop performs almost as fast as PDB; 21

22 Comparison(Cont.) Task 2: analytical works Data Schema & Volume HTML Doc(600,000 records) url Attributes Contents Type VARCHAR(100) TEXT Rankings(18 M records, 1GB/node) Attributes Type srcip UserVisits(155 M records, 20GB/Node) Attributes dsturl visitdate adrevenue useragent countrycode Type VARCHAR(16) VARCHAR(100) DATE FLOAT VARCHAR(64) VARCHAR(3) pageurl VARCHAR(100) langcode VARCHAR(6) pagerank INT searchword VACHAR(32) avgduration INT Duration INT 22

23 Comparison(Cont.) Data loading Observations: 1. With the help of index, PDB outperform hadoop significantly; 2. As the number of node increased, the start-up cost would increase too. Selection Find the pageurl in the Rankings table with a pagerank above a userdefined threshold. 23

24 Comparison(Cont.) Aggregation Observations: 1. PDB outperform hadoop; 2. For PDB, communication cost dominate the execution time 3. Since Vertica is column-store, it could perform better from not reading unused parts of table. Calculate the total adrevenue generated for each srcip in the UserVisits table (20GB/node), grouped by the srcip (Fig.7) or 7-character prefix of srcip(fig.8) 24

25 Comparison(Cont.) Join Find the srcip that generated the most revenue within a particular date range, then calculate the average pagerank of all the pages visited during this interval. SQL Observations: 1. PDB outperforms Hadoop again! 2. PDB use indexes while Hadoop can only perform complete scan. 3. UserVisits and Rankings in PDB are partitioned by the join key, so PDB could do the join locally on each node without any network overhead SELECT INTO Temp srcip, AVG(pageRank) as AvgRank, SUM(adRevenue) as totalrevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date( ) AND Date( ) GROUP BY UV.sourceIP; MR 1. Filter on UserVisits and join wth Rankings 2. Compute the total adrevenue and avgrank based on srcip 3. Find the one with largest adrevenu SELECT sourceip, totalrevenue, avgrank FROM Temp ORDER BY totalrevenue DESC LIMIT 1; 25

26 Comparison(Cont.) Observations: 1. Bottom segment is the time to execute the UDF and upper segment represent the query time; 2. Both DBMS-X and Hadoop have approximately performance UDF Aggregation Scan the HTML documents and search for all the URLs appeared, and count the reference number across the entire set for each unique URL SQL SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) FROM Temp GROUPBY url; MR Similar to Grep task F is a user-defined function, which parses the contents of each record in the Documents table and emits URLs into the database. 26

27 Quick Summary At the scale of experiences above, parallel DBMS perform much better than Hadoop the result of a number of technologies developed over the past 25 years: B-Tree index, column-store, compression algorithm, sophisticated query engine and etc. Hadoop get a better fault tolerance with performance penalty. Hadoop is more easy to set up and use the PDB at large scale of nodes. PDB don t do a good job in UDF aggregation. The trend of both system is to move toward each other 27

28 Outline Background Overview of big data management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 28

29 Hive Thusoo et al., VLDB 09 Problem MapReduce requires developers to write custom programs which are hard to maintain and reuse. Analyst are familiar with SQL Key contribution Proposed an open-source data warehouse solution built on top of Hadoop. Hive provides HiveQL, a SQL-like query language which support select, join, aggregate, union all and sub-queries in from clause. 29

30 Hive(Cont.) Metastore: system catalog. Thrift Server: a framework for cross-language services. Driver: manages the life cycle of a HiveQL statement during compilation, optimization and execution. 30

31 Hive(Cont.) FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds=' ' ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds=' ') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds=' ') SELECT subq1.school, COUNT(1) GROUP BY subq1.school 31

32 Hive(Cont.) Future work Make HiveQL subsume SQL Add cost-based optimizer Columnar storage and more intelligent data placement Performance enhancement Integrate with commercial BI tools Multi-query optimization and generic n-way joins in a single MR job 32

33 HadoopDB Azza Abouzeid et al. VLDB 09 Problem It s the age of big data. Properties for large data analysis : Performance Flexible query interface Fault tolerance Load balance Parallel database MapReduce Can we have both of them? 33

34 HadoopDB(Cont.) Key contribution: To build a hybrid system to archive all the required properties. Basic idea: Connect multiple single-node database using Hadoop as task coordinator and net communication layer (Fault tolerance & Load balance) Queries are parallelized across nodes using MapReduce framework Queries are pushed inside of corresponding database note as much as possible (Performance & Multi-interface) 34

35 HadoopDB(Cont.) Translate HiveQL to MapReduce, and translate some of MR back to SQL Repartition data on a given key or breaking apart single node data into chunks maintains meta information about the databases Connect to database and execute SQL and return key/value pairs 35

36 HadoopDB(Cont.) SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); 36

37 HadoopDB(Cont.) 37

38 Outline Background Overview of big data management Comparison btw PDB and MR DB Solution on MapReduce Conclusion 38

39 Conclusion It is the age of big Data now Major technologies for big data management Parallel DBMS: Well studied and wildly used MapReduce: Attracting new technology with bright future while exist lots of drawbacks, especially in the aspect of performance Hive and HadoopDB give a good paradigm for build DB solution on the top of MapReduce There are still lots of work could be done in the field of big data management. 39

40 Reference Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 04 Andrew Pavlo, Erik Paulson, Alexander Rasin. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 09 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive - A Warehousing Solution Over a Map-Reduce Framework. VLDB 09 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 09 Divyakant Agrawal, Sudipto Das, AmrEl Abbadi. Big Data and Cloud Computing: Current State and Future Opportunities. EDBT 11 40

41 41