A Comparison of Approaches to Large-Scale Data Analysis



Similar documents
YANG, Lin COMP 6311 Spring 2012 CSE HKUST

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

A Comparison of Approaches to Large-Scale Data Analysis

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Hadoop IST 734 SS CHUNG

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Cloud Computing at Google. Architecture

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Architectures for Big Data Analytics A database perspective

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Hadoop and Map-Reduce. Swati Gore

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Recommended Literature for this Lecture

Parallel Databases vs. Hadoop

Can the Elephants Handle the NoSQL Onslaught?

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop vs. Parallel Databases. Juliana Freire!

Introduction to Hadoop

How to Build a High-Performance Data Warehouse By David J. DeWitt, Ph.D.; Samuel Madden, Ph.D.; and Michael Stonebraker, Ph.D.

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

MapReduce: A Flexible Data Processing Tool

The Performance of MapReduce: An In-depth Study

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

How To Scale Out Of A Nosql Database

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Duke University

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Data Mining in the Swamp

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Lecture Data Warehouse Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

BIG DATA What it is and how to use?

The Vertica Analytic Database Technical Overview White Paper. A DBMS Architecture Optimized for Next-Generation Data Warehousing

Data Management in the Cloud -

Big Data and Apache Hadoop s MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce

How to Choose Between Hadoop, NoSQL and RDBMS

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

The Sierra Clustered Database Engine, the technology at the heart of

Apache Hadoop FileSystem and its Usage in Facebook

One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008

Jeffrey D. Ullman slides. MapReduce for data intensive computing

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

NoSQL and Hadoop Technologies On Oracle Cloud

Best Practices for Hadoop Data Analysis with Tableau

Constructing a Data Lake: Hadoop and Oracle Database United!

I/O Considerations in Big Data Analytics

Hadoop Architecture. Part 1

CS54100: Database Systems

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data With Hadoop

Big Data Course Highlights

From GWS to MapReduce: Google s Cloud Technology in the Early Days

A Brief Outline on Bigdata Hadoop

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

White Paper. Optimizing the Performance Of MySQL Cluster

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

CitusDB Architecture for Real-Time Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop & its Usage at Facebook

Using distributed technologies to analyze Big Data

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Data Management in the Cloud

Sentimental Analysis using Hadoop Phase 2: Week 2

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Performance Analysis of Cloud Relational Database Services

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Transcription:

A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009

MapReduce (MR) vs. Databases Our goal: understand performance and architectural differences Both are suitable for large-scale data processing I.e. analytical processing workloads Bulk loads Queries over large amounts of data Not transactional

Why Compare Them? Reason 1: CACM 09: MapReduce is a new way of thinking about programming large distributed systems To DBMS researchers, programming model doesn t feel new Other communities don t know this; important to evangelize Reason 2: Facebook is using Hive (a SQL layer) on Hadoop (MapReduce) to manage 4 TB data warehouse* Database community risk losing hearts and minds Important to show where database systems are the right choice Important to educate other communities E.g., inform Hive developers about how to architect a parallel DBMS * http://www.dbms2.com/2009/05/11/facebook hadoop and hive/

This Talk Comparison of the Two Architectures Benchmark Tasks that either system should execute well Results on two shared nothing DBMSs & Hadoop Navel Gazing

Architectural Differences MapReduce operates on in-situ data, without requiring transformation or loading Schemas: MapReduce doesn t require them, DBMSs do Easy to write simple MR problems No logical data independence Indexes MR provides no built in support

Architectural Differences: Programming Model Common to write multiple MapReduce jobs to perform a complex task Google search index construction > 10 MR jobs Analogous to multiple joins/subqueries in DBMS E.g., select from subquery No built in optimizer in MR to order/unnest these Semantic analysis of arbitrary imperative programs is hard MR intermediate results go to disk, pulled by next task Multiple rounds of redistribution likely slow

Architectural Differences: Expressiveness SQL+UDFs almost as expressive as MR Impedance mismatch is no fun Inelegant programming model DBMS make this much harder than they should UDFs complicate optimization

Architectural Differences: Fault Tolerance MR supports mid-query fault tolerance DBMSs typically don t Only important as the number of nodes gets large 1 failure/mo, 1 hour/query Pr(mid query failure 10 nodes)=1% Pr(mid query failure 100 nodes)=13% Pr(mid query failure 1000 nodes)=75% MR doesn t provide transactions, disaster recovery, etc Not fundamental

The Benchmark Goals Understand differences in load and query time for some common data processing tasks Choose representative set of tasks that: Both should excel at MapReduce should excel at Databases should excel at Ran on 100 node Linux cluster at U. Wisconsin

Software Hadoop 0190 0.19.0 Java 1.6 DBMS-X Parallel l shared nothing row-store from a major vendor Hash partitioned, sorted and indexed beneficially Compression enabled Great deal of manual tuning required Vertica Parallel shared nothing column-oriented database (version 2.6) Compression enabled Default parameters, except hints that only one query runs at a time No secondary indices, tables sorted beneficially

Grep Used in original MapReduce paper Look for a 3 character pattern in 90 byte field of 100 byte records with schema: key VARCHAR(10) PRIMARY KEY field VARCHAR(90) Pattern occurs in.01% of records SELECT * FROM T WHERE field LIKE %XYZ% 1 TB of data spread across 25, 50, or 100 nodes ~10 billion records, 10 40 GB / node Expected Hadoop to perform well

1 TB Grep Load Times e (second ds) 30000 Databases don t scale linearly*; y; Hadoop does 25000 20000 15000 Hadoop Vertica DBMS X Tim 10000 5000 0 25x40GB 50x20GB 100x10GB * Vertica 3.0 adds improved parallel loading; early experiments show linear scalability

1TB p Grep Query Times Vertica s compression works better than DBMS X here 1600 Time (seconds) 1400 Hadoop 1200 1000 800 600 Near linear speedup Vertica DBMS X 400 200 0 25x40GB 50x20GB 100x10GB

Simple web processing schema Analytical Tasks CREATE TABLE Documents ( url VARCHAR(100) PRIMARY KEY, contents TEXT ); 600,000 randomly generated documents /node Contents embeds URLs ~8 GB / node 155 million user visits ii / node ~20 GB / node See paper for complete schema CREATE TABLE UserVisits ( sourceip VARCHAR(16), adrevenue FLOAT, //fields omitted ); Loading performance similar to grep

Aggregation Task Simple aggregation query to find adrevenue by IP prefix SELECT SUBSTR(sourceIP, 1, 7), sum(adrevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 7) Parallel analytics query for DBMSs Compute partial aggregate on each node, merge answers to produce result Yields 2,000 records (24 KB)

Hadoop Code (Map) public void map(text key, Text value, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException { // // Split the value using VALUE_DELIMITER into separate fields // The *third* field should be our revenue // String fields[] = value.tostring().split("\\" + BenchmarkBase.VALUE_DELIMITER); if (fields.length > 0) { try { Double revenue = Double.parseDouble(fields[2]); key = new Text(key.toString().substring(0, 7)); output.collect(key, new DoubleWritable(revenue)); } catch (ArrayIndexOutOfBoundsException ex) { System.err.println("ERROR: Invalid record for key '" + key + "'"); } catch } return; } Reduce code simply sums revenue for assigned key Implicit schema Record parsing and validation (loading)

Performance Results 1400 1200 Hadoop Vertica DBMS X seconds) Time ( 1000 800 600 All systems 400 show linear scaleup DBMS perform better No 200 parsing overheads Compression 0 Vertica wins because of column orientation 1 Node 10 Nodes 25 Nodes 50 Nodes 100 Nodes

Additional DBMS-y Tasks Paper also reports selection and join tasks Hadoop also performs poorly on these Hadoop code for join gets hairy Vertica does very well on some tasks due to column-orientation

Discussion Hadoop much easier to set up Hadoop load times are faster (Loading is basically non-existent) t) Hadoop query times are a lot slower (1 2 orders of magnitude) Parsing and indexing Compression Execution model

When to Choose MapReduce MapReduce is designed for one-off processing tasks Where fast load times are important No repeated access As such, lack of a schema, lack of indexing, etc is not so alarming However, no compelling reason to choose MR over a database for traditional database workloads E.g., scalability appears to be comparable despite claims to the contrary

Conclusion MapReduce and Database Systems fill different niches One-off processing vs repeated re-access MapReduce goodness Ease of use, out of box experience Single programming language Attractive fault tolerance properties Fast load times Database goodness Fast query times Due to indexing and compression Supporting tools Many recently proposed hybrids Call to arms Because Hadoop is open source and massively parallel, people are using it Open source shared nothing parallel DBMS is needed Hive, Pig, Scope, DryadLinq, HadoopDB, etc. Interesting to see how their performance and usability compares How much can be layered on top of MapReduce?

Osprey MapReduce Style Fault Tolerance for DBMSs Christopher Yang, Christine Yen, Ceryen Tan To Appear in ICDE 2010

Motivation Long running database queries on distributed databases may crash mid-flight Especially as more nodes are added Workload skew, machine load, and other factors may lead to imbalanced distribution of work Would be good to allow queries to tolerate faults and slow workers a la MapReduce

Approach OspreyDB is a middleware layer built of top of N Postgres installations ti Tables horizontally partitioned across machines Partitions replicated k times using chained declustering Each node is backup for previous k nodes Queries decomposed into fragments, each of which runs on one partition Coordinator assembles answers to fragments to answer query Job scheduler tracks liveness of replicas, dispatches fragments Fragments of slow and failed nodes reissued Several different scheduling policies i

User Architecture Query Answer Coordinator Query Transformer Result Merger Query Manager and Scheduler Worker 1 Worker 2 Worker N SQL Fragment Result Postgres 1 Part 1, 2 Postgres 2 Part 2, 3 Postgres 3 Part 3,1

Questions: Experiments How well does it scale? How high are the overheads? How well does it balance load? How well does it tolerate faults? Setup: 8 Machines, Running Postgres 8.3; k=1 SSB scale 1 (5.5 GB) (1.1 GB/Node) 512 MB buffer pool per node

Scaleup

Stressing 1 3 Nodes (N=4) Replication factor k

Failed Node

Conclusion Osprey: middleware-based solution for mid-query fault tolerance Decomposes queries into sub-jobs, schedules them with different policies Achieves linear speedup (for small numbers of nodes) and good fault tolerance properties Results are not on MR cluster because we don t use Hadoop Next talk similar idea, slightly different architecture, scales to many more nodes and with a much more robust query processor