The Berkeley Data Analytics Stack: Present and Future

Similar documents
Big Data Research in the AMPLab: BDAS and Beyond

What s next for the Berkeley Data Analytics Stack?

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark: Making Big Data Interactive & Real-Time

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Beyond Hadoop with Apache Spark and BDAS

Unified Big Data Processing with Apache Spark. Matei

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Conquering Big Data with BDAS (Berkeley Data Analytics)

Making Sense at Scale with the Berkeley Data Analytics Stack

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Unified Big Data Analytics Pipeline. 连 城

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Ali Ghodsi Head of PM and Engineering Databricks

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

The Berkeley AMPLab - Collaborative Big Data Research

Moving From Hadoop to Spark

Data Science in the Wild

Next-Gen Big Data Analytics using the Spark stack

Apache Flink Next-gen data analysis. Kostas

How Companies are! Using Spark

Hadoop Ecosystem B Y R A H I M A.

How To Create A Data Visualization With Apache Spark And Zeppelin

Spark and the Big Data Library

Spark: Cluster Computing with Working Sets

Brave New World: Hadoop vs. Spark

Berkeley Data Analytics Stack:! Experience and Lesson Learned

CS 294: Big Data System Research: Trends and Challenges

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Apache Spark and Distributed Programming

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data Processing. Patrick Wendell Databricks

Shark: SQL and Rich Analytics at Scale

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Architectures for massive data management

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Streaming items through a cluster with Spark Streaming

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Introduction to Spark

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Analytics Hadoop and Spark

Analytics on Spark &

Conquering Big Data with Apache Spark

Big Data Analytics. Lucas Rego Drumond

MLlib: Scalable Machine Learning on Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Introduction to Big Data with Apache Spark UC BERKELEY

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Hadoop & Spark Using Amazon EMR

Shark Installation Guide Week 3 Report. Ankush Arora

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Real Time Data Processing using Spark Streaming

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Dell In-Memory Appliance for Cloudera Enterprise

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

The Internet of Things and Big Data: Intro

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data and Scripting Systems build on top of Hadoop

Machine- Learning Summer School

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Why Spark on Hadoop Matters

An Open Source Memory-Centric Distributed Storage System

Introduc8on to Apache Spark

Processing NGS Data with Hadoop-BAM and SeqPig

Large-Scale Data Processing

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

A Brief Introduction to Apache Tez

Big Data With Hadoop

Case Study : 3 different hadoop cluster deployments

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

This is a brief tutorial that explains the basics of Spark SQL programming.

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Tachyon: memory-speed data sharing

Big Data Course Highlights

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Databricks. A Primer

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Unified Batch & Stream Processing Platform

BIG DATA ANALYTICS For REAL TIME SYSTEM

Transcription:

The Berkeley Data Analytics Stack: Present and Future Michael Franklin 27 March 2014 Technion Big Day on Big Data UC BERKELEY

BDAS in the Big Data Context 2

Sources Driving Big Data It s All Happening On-line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing

Beyond Volume, Velocity and Variety Step 1: Run Faster Money Time Massive Massive Diverse and Growing Data Data Answer Quality Step 2: Tradeoffs 4

AMPLab: Integrating Diverse Resources Algorithms Machine Learning, Statistical Methods Prediction, Business Intelligence Machines Clusters and Clouds Warehouse Scale Computing People Crowdsourcing, Human Computation Data Scientists, Analysts

AMPLab Data Launched January 2011, 6 year duration 60+ Students, Postdocs, Faculty and Staff UC BERKELEY In-house Applications Cancer Genomics, Mobile Sensing, IoT (smartphones) Industry Sponsors, Foundations + White House Big Data Prog Franklin DB Jorda n ML Stoic a Sys Patterso n Sys Shenk er Net Rech t ML Katz Sys Josep h Sec Goldber g HCI

Carat: Big Data at Work 715,000+ downloads 7

Open Source Engagement SF Spark MeetUP: 1700+ members Also: Boston, Hyderbad, Bootcamps: Berkeley, Strata, On-line MeetUp talk on MLBase at Twitter HQ (Aug 6, 2013)

Big Data Systems Today Pregel Giraph MapReduce Dremel Impala Storm Drill Tez GraphLab S4 General batch processing Specialized systems (iterative, interactive and streaming apps)

Shark Streaming GraphX MLbase BDAS Philosophy Don t specialize MapReduce Generalize it! Two additions to Hadoop MR can enable all the models on previous slide! 1. General Task DAGs 2. Data Sharing For Users: Fewer Systems Less Copying Spark

For developers: Code Size 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines

For Developers: Code Size 140000 120000 100000 80000 60000 40000 20000 Streaming 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines

For Developers: Code Size 140000 120000 100000 80000 60000 40000 20000 Shark* Streaming 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines * also calls into Hive

For Developers Code Size 140000 120000 100000 80000 60000 40000 20000 GraphX Shark* Streaming 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines * also calls into Hive

Berkeley Data Analytics Stack Applications: Traffic, Carat, Genomics, 3 rd Party Tools: IPython, Visualization, Data Cleaning, AMP Alpha or Soon AMP Released BSD/Apach e 3 rd Party Open Source Shark (SQL) BlinkDB GraphX MLBase Tachyon Spark Streaming ML-lib Processing Apache Spark HDFS / Hadoop Storage Spark-R PySpark Analytics Frameworks and Data Management Storage Apache Mesos YARN Resource Manager Resource

Fast, MapReduce-like engine»in-memory storage abstraction for iterative/interactive queries»general execution graphs»up to 100x faster than Hadoop MR (2-10x even for ondisk) Compatible with Hadoop s storage APIs»Can access HDFS, HBase, S3, SequenceFiles, etc Great example of ML/Systems/DB collaboration

Example: Logistic Regression Goal: find best line separating two sets of points random initial line target

ML and Queries in Hadoop HDFS read query 1 result 1 query 2 result 2 Input query 3 result 3... HDFS read iter. 1 HDFS write HDFS read iter. 2 HDFS write... Input

In-Memory Data Sharing one-time processing query 1 query 2 Input Distributed memory query 3... iter. 1 iter. 2... Input

Running Time (min) Logistic Regression Performance 60 50 110 s / iteration 40 30 Hadoop Spark 20 10 0 1 10 20 30 Number of Iterations first iteration 80 s further iterations 1 s 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Challenge A distributed memory abstraction that is both: efficient and fault-tolerant

Resilient Distributed Datasets (RDDs) API: coarse-grained transformations (map, group-by, join, sort, filter, sample, ) on immutable collections Efficient fault recovery using lineage»log one operation to apply to many elements»recompute lost partitions of RDD on failure»no cost if nothing fails Rich enough to capture many models:»data flow models: MapReduce, Dryad, SQL,»Specialized models: Pregel, Hama, M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012. Best Paper Award

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) cachedmsgs = messages.cache() Base RDD Action Transformed RDD Driver results tasks Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count Worker Cache 3 Worker Block 2 Cache 2 Block 3

Fault Tolerance with RDDs RDDs track the series of transformations used to build them (their lineage) Enables per-node recomputation of lost data messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD path = hdfs:// FilteredRDD func = _.contains(...) MappedRDD func = _.split( )

Spark Status Current release (0.9) includes Java and Python APIs» Apache Top Level Project (as of Feb 2014)» Includes: streaming, MLlib, YARN integration, EC2, GraphX» 100+ Developers, 20 Organizations contributed code Supported by: Cloudera, Wandisco & others TBA Spark Apps to be certified by DataBricks (AMPLab spinout) Sample Use cases:» In-memory analytics on Hive data (Conviva)» Interactive queries on data streams (Quantifind)» Business intelligence (Yahoo!)» DNA sequence alignment (SNAP)

Berkeley Data Analytics Stack Applications: Traffic, Carat, Genomics, 3 rd Party IPython, Visualization, Data Cleaning, AMP Alpha or Soon AMP Released BSD/Apach e 3 rd Party Open Source Shark (SQL) BlinkDB GraphX MLBase Tachyon Spark Streaming ML-lib Processing Apache Spark HDFS / Hadoop Storage Spark-R PySpark Analytics Frameworks and Data Management Storage Apache Mesos YARN Resource Manager Resource

Shark 27

Shark = Spark + Hive Uses Spark s in-memory RDD caching and language»result reuse and low latency»scalable, fault-tolerant, fast Query Compatible with Hive» Run HiveQL queries (w/ UDFs, UDAs ) without modifications» Convert logical query plan generated from Hive into Spark execution graph Data Compatible with Hive» Use existing HDFS data and Hive metadata, without modifications C. Engle, et al, Shark: Fast Data Analysis Using Coarse-grained Distributed Memory, SIGMOD 2012 (system demonstration). Best Demo Award R. Xin et al., Shark: SQL and Rich Analytics at Scale, SIGMOD 2013.

Shark Optimizations Fast task start-up Optimized column-oriented storage Dynamic (mid-query) join algorithm selection based on statistical properties of data Runtime selection of # of reducers Partition pruning using range statistics Controllable table partitioning across

Running Time: Join + Order By SELECT srcip, AVG(pageRank), SUM(adRevenue) as totalrev FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP ORDER BY totalrev DESC LIMIT 1 Time (sec) Time (sec) 485K UserVisits 533M UserVisits https://amplab.cs.berkeley.edu/benchmark/ 30

A Unified System for SQL & ML Deep integration of Shark and Spark Both share the same set of workers and caches Can move seamlessly between SQL and Machine Learning worlds def logregress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextdouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("select * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.maprows { row => new Vector(extractFeature1(row.getInt("age")), extractfeature2(row.getstr("country")),...)} val trainedvector = logregress(features.cache())

GraphX Adding Graphs to the Mix Table View GraphX Unified Representation Graph View Tables and Graphs are composable views of the same physical data Each view has its own operators that exploit the semantics of the view to achieve efficient execution R. Xin, J. Gonzalez, M. Franklin, I. Stoica, GraphX: In-situ Graph Computation Made Easy GRADES Workshop at SIGMOD, June 2013.

Error Speed/Accuracy Trade-off Interactive Queries Time to Execute on Entire Dataset 5 sec Execution Time 30 mins

Fast, approximate answers with error bars by executing queries on small, pre-collected samples of data Compatible with Apache Hive (storage, serdes, UDFs, types, metadata) and HiveQL (with minor modifications) Agarwal et al., BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. ACM EuroSys 2013, Best Paper Award

Query Response Time (Seconds) Sampling Vs. No Sampling 1000 900 800 700 600 500 400 300 200 100 0 1020 103 10x as response time is dominated by I/O 18 13 10 8 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data

Query Response Time (Seconds) Sampling Vs. No Sampling 1000 900 800 700 600 500 400 300 200 100 0 1020 Error Bars (0.02%) 103 (0.07%) (1.1%) (3.4%) (11%) 18 13 10 8 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data

Error Speed/Accuracy Trade-off Interactive Queries Time to Execute on Entire Dataset Pre-Existing Noise 5 sec Execution Time 30 mins

Statistics MetaData People Resources Supporting Data Scientists Interactive Analytics Visual Analytics Collaboration Hybrid Human-Machine Computation Data Cleaning Active Learning Handling the last 5% CrowdSQL Disk 1 Disk 2 Parser Optimizer Executor Files Access Methods UI Creation Results Turker Relationship Manager Form Editor UI Template Manager HIT Manager Franklin, Kossmann et al., CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 Wang et al., CrowdER: Crowdsourcing Entity Resolution, VLDB 2012 38 Trushkowsky et al., Crowdsourcing Enumeration Queries, ICDE 2013 Best Paper Award

Working with the Crowd Incentives Fatigue, Fraud, & other Failure Modes Latency & Prediction Work Conditions Interface Impacts Answer Quality Task Structuring Task Routing 39

Less is More? Data Cleaning + Sampling J. Wang et al., Work in Progress

Other Things We re Working On MLBase: Declarative Scalable Machine Learning OLTP and Serving Workloads MDCC: Mutli Data Center Consistency HAT: Highly-Available Transactions PBS: Probabilistically Bounded Staleness PLANET: Predictive Latency-Aware Networked Transactions Fast Matrix Manipulation Libraries Cold Storage, Partitioning, Distributed Caching Machine Learning Pipelines, GPUs,

Summary The Berkeley AMPLab is integrating Algorithms, Machines and People to make sense of data at scale. BDAS is our main delivery vehicle. Open Source has been a great way to have real industry impact with academic research. Our direction: advanced analytics, extreme elasticity, and support for people in all phases of Big Data analytics.

For More Information UC BERKELEY amplab.cs.berkeley. edu franklin@berkeley.e du Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, and SAP, the Thomas and Stacy Siebel Foundation, and all our industrial sponsors and partners.