Conquering Big Data with BDAS (Berkeley Data Analytics)

Similar documents

Unified Big Data Processing with Apache Spark. Matei

CS 294: Big Data System Research: Trends and Challenges

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Spark: Making Big Data Interactive & Real-Time

Beyond Hadoop with Apache Spark and BDAS

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Unified Big Data Analytics Pipeline. 连城

Big Data Research in the AMPLab: BDAS and Beyond

Moving From Hadoop to Spark

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

How Companies are! Using Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Ali Ghodsi Head of PM and Engineering Databricks

Next-Gen Big Data Analytics using the Spark stack

Scaling Out With Apache Spark. DTL Meeting Slides based on

What s next for the Berkeley Data Analytics Stack?

Big Data Processing. Patrick Wendell Databricks

Hadoop Ecosystem B Y R A H I M A.

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark and the Big Data Library

CSE-E5430 Scalable Cloud Computing Lecture 11

The Berkeley AMPLab - Collaborative Big Data Research

How To Create A Data Visualization With Apache Spark And Zeppelin

The Berkeley Data Analytics Stack: Present and Future

Conquering Big Data with Apache Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Streaming items through a cluster with Spark Streaming

Analytics on Spark &

Brave New World: Hadoop vs. Spark

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

A Brief Introduction to Apache Tez

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Dell In-Memory Appliance for Cloudera Enterprise

Big Data With Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Large scale processing using Hadoop. Ján Vaňo

From Spark to Ignition:

The Internet of Things and Big Data: Intro

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

An Open Source Memory-Centric Distributed Storage System

Hadoop & Spark Using Amazon EMR

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Architectures for massive data management

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Real Time Data Processing using Spark Streaming

Apache Flink Next-gen data analysis. Kostas

Databricks. A Primer

Databricks. A Primer

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Big Data Analytics. Lucas Rego Drumond

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

GraySort on Apache Spark by Databricks

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Why Spark on Hadoop Matters

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Big Data and Scripting Systems beyond Hadoop

Case Study : 3 different hadoop cluster deployments

Apache Spark and Distributed Programming

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Analytics Hadoop and Spark

Big Data Analysis: Apache Storm Perspective

Tachyon: memory-speed data sharing

HiBench Introduction. Carson Wang Software & Services Group

Big Data and Scripting Systems build on top of Hadoop

Shark Installation Guide Week 3 Report. Ankush Arora

Processing NGS Data with Hadoop-BAM and SeqPig

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Big Data Course Highlights

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Open source Google-style large scale data analysis with Hadoop

Integrating Apache Spark with an Enterprise Data Warehouse

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Hadoop: Embracing future hardware

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Spark: Cluster Computing with Working Sets

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

The Inside Scoop on Hadoop

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Transcription:

UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva

Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?» Why is the system slow?» Detect spam, DDoS attacks Decisions, e.g.,» Decide what features to add to a product» Personalized medical treatment» Decide when to change an aircraft engine part» Decide what ads to show Data only as useful as the decisions it enables

What do We Need? Interactive queries: enable faster decisions» E.g., identify why a site is slow and fix it Queries on streaming data: enable decisions on real-time data» E.g., fraud detection, detect DDoS attacks Sophisticated data processing: enable better decisions» E.g., anomaly detection, trend analysis

Our Goal Batch Interactive Single Framework! " Streaming Support batch, streaming, and interactive computations in a unified framework Easy to develop sophisticated algorithms (e.g., graph, ML algos)

The Berkeley AMPLab lgorithms January 2011 2017» 8 faculty» > 40 students» 3 software engineer team Organized for collaboration achines eople AMPCamp3 (August, 2013) 3 day retreats (twice a year) 220 campers (100+ companies)

The Berkeley AMPLab Governmental and industrial funding: Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Data Processing Stack Data Processing Layer Resource Management Layer Storage Layer

Hadoop Stack Hive Pig Data Processing Impala Layer Storm Hadoop MR Resource Hadoop Management Yarn Layer Storage HDFS, S3, Layer

BDAS Stack Streaming BlinkDB Data Processing Layer Shark SQL Resource Management Mesos Layer Tachyon GraphX Storage Layer HDFS, S3, MLBase MLlib

How do BDAS & Hadoop fit together? Streaming BlinkDB GraphX Shark SQL Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3, MLBase MLlib

How do BDAS & Hadoop fit together? Streaming Straming BlinkDB Shark SQL BlinkDB Shark SQL Graph X MLbase ML library GraphX Hive Pig Hadoop MR MLBase MLlib Impala Storm Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3,

How do BDAS & Hadoop fit together? Straming BlinkDB Shark SQL Graph X MLbase ML library Hive Pig Impala Storm Hadoop MR Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3,

Apache Mesos Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, ) Twitter s large scale deployment» 10,000+ servers,» 500+ engineers running jobs on Mesos Third party Mesos schedulers» AirBnB s Chronos» Twitter s Aurora Stream. BlinkDB GraphX Shark Mesos Tachyon HDFS, S3, Mesospehere: startup to commercialize Mesos MLBase MLlib

Apache Stream. BlinkDB GraphX Shark Mesos MLBase MLlib Distributed Execution Engine» Fault-tolerant, efficient in-memory storage» Low-latency large-scale task scheduler» Powerful prog. model and APIs: Python, Java, Scala Fast: up to 100x faster than Hadoop MR» Can run sub-second jobs on hundreds of nodes Easy to use: 2-5x less code than Hadoop MR General: support interactive & iterative apps Tachyon HDFS, S3,

Fault Tolerance Need to achieve» High throughput reads and writes» Efficient memory usage Replication» Writes bottlenecked by network» Inefficient: store multiple replicas Stream. BlinkDB GraphX Shark Mesos Tachyon HDFS, S3, MLBase MLlib Persist update logs» Big data processing can generate massive logs Our solution: Resilient Distributed Datasets (RDDs)» Partitioned collection of immutable records» Use lineage to reconstruct lost data

RDD Example Stream. BlinkDB GraphX Shark Mesos MLBase MLlib Tachyon HDFS, S3, Two-partition RDD A={A 1, A 2 } stored on disk 1) filter and cache à RDD B 2) joinà RDD C 3) aggregate à RDD D A 1 filter B 1 join C 1 RDD A agg. D A 2 filter B 2 join C 2

RDD Example Stream. BlinkDB GraphX Shark Mesos MLBase MLlib Tachyon HDFS, S3, C 1 lost due to node failure before reduce finishes A 1 filter B 1 join C 1 RDD A agg. D A 2 filter B 2 join C 2

RDD Example Stream. BlinkDB GraphX Shark Mesos MLBase MLlib C 1 lost due to node failure before reduce finishes Reconstruct C 1, eventually, on different node Tachyon HDFS, S3, A 1 filter B 1 join C 1 RDD A agg. D A 2 filter B 2 join C 2

Streaming Existing solutions: recordby-record processing BlinkDB GraphX Stream. Shark Mesos Tachyon HDFS, S3, MLBase MLlib Low latency input records filter Hard to» Provide fault tolerance» Mitigate straggler input records node 1 filter agg. node 3 node 2

Streaming BlinkDB GraphX Stream. Shark Mesos MLBase MLlib Implemented as sequence of micro-jobs (<1s)» Fault tolerant» Mitigate stragglers» Ensure exactly one semantics live data stream batches of X seconds Tachyon HDFS, S3, Streaming processed results & Streaming: batch, interactive, and streaming computations

Shark Stream. BlinkDB GraphX Shark Mesos Tachyon HDFS, S3, MLBase MLlib Hive over : full support for HQL and UDFs Up to 100x when input is in memory Up to 5-10x when input is on disk Running on hundreds of nodes at Yahoo!

Not Only General, but Fast Throughput (MB/s/node) 35 30 25 20 15 10 5 0 Storm Streaming Response Time (s) 45 40 35 30 25 20 15 10 5 0 Hive Impala (disk) Impala (mem) Shark (disk) Shark (mem) Interactive (SQL) Time per Iteration (s) 140 120 100 80 60 40 20 0 Hadoop Batch (ML, )

Distribution Includes» (core)» Streaming» GraphX (alpha release)» MLlib In the future:» Shark» Tachyon Stream. BlinkDB Shark GraphX MLBase MLlib Mesos Tachyon HDFS, S3,

Explosive Growth Stream. BlinkDB Shark GraphX MLBase MLlib Mesos 2,500+ meetup users 180+ contributors from 30+ companies 1st Summit» 450+ attendees» 140+ companies 2 nd Summit» June 30 July 2 140! 120! 100! 80! 60! 40! 20! 0! Contributors in past year Giraph! Tachyon HDFS, S3, Storm! Tez!

Explosive Growth Stream. BlinkDB Shark GraphX MLBase MLlib Mesos Databricks: founded in 2013 to commercialize Platform Included in all major Hadoop Distributions» Cloudera» MapR» Hortonworks (technical preview) Tachyon HDFS, S3, Enterprise support: Cloudera, MapR, Datastax and Shark available on Amazon s EMR

BlinkDB Stream. BlinkDB GraphX Shark Mesos MLBase MLlib Trade between query performance and accuracy using sampling Why?» In-memory processing doesn t guarantee interactive processing E.g., ~10 s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing 512GB 40-60GB/s 16 cores Tachyon HDFS, S3, doubles every 18 months doubles every 36 months

Key Insight Computations don t always need exact answers Input often noisy: exact computations do not guarantee exact answers Error often acceptable if small and bounded Best scale ± 200g error Speedometers ± 2.5 % error (edmunds.com) OmniPod Insulin Pump ± 0.96 % error (www.ncbi.nlm.nih.gov/pubmed/22226273)

Approach: Sampling Compute results on samples instead of full data» Typically, error depends on sample size (n) not on original data size, i.e., error α Can trade between answer s latency and accuracy and cost 1/ n

BlinkDB Interface SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt=2012-9-2 WITHIN 1 SECONDS 234.23 ± 15.32

BlinkDB Interface SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt=2012-9-2 WITHIN 2 SECONDS SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt=2012-9- 2 ERROR 0.1 CONFIDENCE 95.0% 234.23 ± 15.32 239.46 ± 4.96

Quick Results Dataset» 365 mil rows, 204GB on disk» 600+ GB in memory (deserialized format) Query: query computing 95-th percentile 25 EC2 instances with» 4 cores» 15GB RAM

1000 900 800 700 600 500 400 300 200 100 0 Query Response Time seconds 1020 10x as response time is dominated by I/O Query Response Time 103 1 10-1 18 10-2 13 10-3 10 10-4 8 10-5 Fraction of full data

1000 900 800 700 600 500 400 300 200 100 0 Query Response Time seconds 1020 (0.02%) Error Bars Query Response Time 103 (0.07%) (1.1%) (3.4%) (11%) 1 10-1 18 10-2 13 10-3 10 10-4 8 10-5 Fraction of full data

BlinkDB Overview TABLE Original Data Sampling Module Offline-sampling: Optimal set of samples across different dimensions (columns or sets of columns) to support ad-hoc exploratory queries

BlinkDB Overview TABLE Original Data Sampling Module Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory. On-Disk Samples In-Memory Samples

BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query Query Plan Sample Selection TABLE Original Data Sampling Module On-Disk Samples In-Memory Samples

BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query TABLE Original Data Sampling Module Query Plan Sample Selection Online sample selection to pick best sample(s) based on query latency and accuracy requirements On-Disk Samples In-Memory Samples

BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query New Query Plan Sample Selection Error Bars & Confidence Intervals Hadoop/ TABLE Original Data Sampling Module On-Disk Samples In-Memory Samples Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across multiple machines.

BlinkDB Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?

BlinkDB Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?

How to Accurately Estimate Error? Close formulas for limited number of operators» E.g., count, mean, percentiles What about user defined functions (UDFs)?» Use bootsrap technique

Bootstrap Quantify accuracy of a sample estimator, f() S = N Distribution X S random sample f (X) f (S) can t compute f (X) as we don t have X what is f(s) s error? sampling with replacement S 1 S k f (S 1 ) f (S k ) use f(s 1 ),, f(s K ) to compute estimator: mean(f(s i ) error, e.g.: stdev(f(s i )) S i = N

Bootstrap for BlinkDB Quantify accuracy of a query on a sample table Original Table T Q(T) Q(T) takes too long! sample S = N S Q(S) what is Q(S) s error? sampling with replacement S 1 S k Q (S 1 ) Q (S k ) use Q(S 1 ),, Q(S K ) to compute estimator quality assessment, e.g., stdev(q(s i )) S i = N

How Do You Know Error Estimation is Correct? Assumption: f() is Hadamard differentiable» How do you know an UDF is Hadamard differentiable?» Sufficient, not necessary condition Only approximations of true error distribution (true for closed formula as well) Previous work doesn t address error estimation correctness

How Bad it Is? Workloads» Conviva: 268 real-world 113 had custom User-Defined Functions» Facebook Closed Forms/Bootstrap fails for» 3 in 10 Conviva Queries» 4 in 10 Facebook Queries Need runtime diagnosis!

Error Diagnosis Compare bootstrapping with ground truth for small samples Check whether error improves as sample size increases

Ground Truth (Approximation) T Original Table

Ground Truth (Approximation) T Original Table S 1 Q (S 1 ) ξ = stdev(q(s j )) S p Q (S p ) Estimator quality assessment

Ground Truth and Bootstrap T Original Table Bootstrap on Individual Samples S 11 Q (S 11 ) S 1 ξ * 1 = stdev(q(s 1 j )) S 1k Q (S 1k ) S p1 Q (S p1 ) S p ξ * p = stdev(q(s pj )) S pk Q (S pk )

Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * ip = stdev(q(s ipj )) Relative Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size

Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * ip = stdev(q(s ipj )) Relative Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size

Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * ip = stdev(q(s ipj )) Relative Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size

How Well Does it Work in Practice? Evaluated on Conviva Query Workload Diagnostic predicted that 207 (77%) queries can be approximated» False Negatives: 18» False Positives: 3 (conditional UDFs)

Overhead Boostrap and Diagnostic overheads can be very large» Diagnostics requires to run 30,000 small queries! Optimization» Pushdown filter» One pass execution Can reduce overhead by orders of magnitude

Query + Bootstrap + Diagnosis" seconds Diagnosis overhead 1020 1000 Bootstrap overhead 900 Query Response Time 800 723 700 Bootsrap + Diagnosis 53x 600 more than query itself! 500 467 400 442 436 426 300 200 100 Fraction 0 1 10-1 10-2 10-3 10-4 10-5 of full data

Optimization: Filter Pushdown Perform filtering before resampling» Can dramatically reduce I/O n filter N resample n filter N resample n resample n filter n resample n Sample N elements Sample N Assume n << N

Optimization: One Pass Exec. [k] Compute all k results in one pass! resample resample resample [k] filter Sample filter Sample Poissoned resampling: construct all k resamples in one pass! For each resample add new column» Specify how many times each row has been selected» Query generate results for each resamples in one pass

Query + Bootstrap + Diagnosis" (with Filter Pushdown and Single Pass) sec. 300 250 200 1020 Diagnosis overhead Bootstrap overhead Query Response Time 150 100 120 Less than 70% overhead (80x reduction) 50 0 26 19 15 13 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data

Open Source BlinkDB is open-sourced and released at http://blinkdb.org

Open Source Used regularly by 100+ engineers at Facebook Inc. 60

Lesson Learned Focus on novel usage scenarios Be paranoid about simplicity» Very hard to build real complex systems in academia Basic functionality first, performance opt. next

Example: Mesos Focus on novel usage scenarios» Let multiple frameworks share same cluster Be paranoid about simplicity» Just enforce allocation policy across frameworks, e.g., fair sharing» Let frameworks decide which slots they accept, which tasks to run, and when to run them» First release: 10K code Basic functionality first, performance opt. next» First, support arbitrary scheduling, but inneficient» Latter added filters to improve performance

Example: Focus on novel usage scenarios» Interactive queries and iterative (ML) algorithms Be paranoid about simplicity» Immutable data; avoid complex consistency protocols» First release: 2K code Basic functionality first, performance opt. next» First, no automatic check-pointing» Latter to add automatic checkpoint

Example: BlinkDB Focus on novel usage scenarios» Approximate computations; error diagnosis Be paranoid about simplicity» Use bootstrap as generic technique; no support for close formula Basic functionality first, performance opt. next» First, straightforward error estimation, very expensive» Latter, optimizations to reduce overhead» First, manual sample generation» Later, automatic sample generation (to do)

Summary BDAS: address next Big Data challenges Unify batch, interactive, and streaming Batch Enable users to trade between» Response time, accuracy, and cost Interactive Streaming Explosive adoption» 30+ companies, 180+ individual contributors ()» platform included in all major Hadoop distros» Enterprise grade support and professional services for both Mesos and platform