Making Sense at Scale with the Berkeley Data Analytics Stack

Similar documents

What s next for the Berkeley Data Analytics Stack?

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

The Berkeley Data Analytics Stack: Present and Future

Unified Big Data Processing with Apache Spark. Matei

Beyond Hadoop with Apache Spark and BDAS

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Analytics Pipeline. 连城

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Research in the AMPLab: BDAS and Beyond

Spark and the Big Data Library

Conquering Big Data with BDAS (Berkeley Data Analytics)

Apache Flink Next-gen data analysis. Kostas

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Ali Ghodsi Head of PM and Engineering Databricks

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Conquering Big Data with Apache Spark

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Moving From Hadoop to Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Data Science in the Wild

The Berkeley AMPLab - Collaborative Big Data Research

CSE-E5430 Scalable Cloud Computing Lecture 11

Architectures for massive data management

Apache Spark and Distributed Programming

MLlib: Scalable Machine Learning on Spark

Big Data Analytics Hadoop and Spark

Brave New World: Hadoop vs. Spark

Streaming items through a cluster with Spark Streaming

Introduc8on to Apache Spark

Big Data Analytics. Lucas Rego Drumond

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Berkeley Data Analytics Stack:! Experience and Lesson Learned

How Companies are! Using Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Machine- Learning Summer School

Big Data Processing. Patrick Wendell Databricks

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Next-Gen Big Data Analytics using the Spark stack

Spark: Cluster Computing with Working Sets

Real Time Data Processing using Spark Streaming

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Hadoop Ecosystem B Y R A H I M A.

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Big Data and Scripting Systems beyond Hadoop

An Open Source Memory-Centric Distributed Storage System

Big Data Analytics with Cassandra, Spark & MLLib

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Introduction to Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

A Brief Introduction to Apache Tez

Big Data and Scripting Systems build on top of Hadoop

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Dell In-Memory Appliance for Cloudera Enterprise

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS 294: Big Data System Research: Trends and Challenges

Analytics on Spark &

Big Data With Hadoop

Databricks. A Primer

Hadoop & Spark Using Amazon EMR

Tachyon: memory-speed data sharing

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Databricks. A Primer

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Shark: SQL and Rich Analytics at Scale

Large-Scale Data Processing

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

This is a brief tutorial that explains the basics of Spark SQL programming.

Why Spark on Hadoop Matters

An Architecture for Fast and General Data Processing on Large Clusters

Challenges for Data Driven Systems

Introduction to Big Data with Apache Spark UC BERKELEY

The Stratosphere Big Data Analytics Platform

Shark Installation Guide Week 3 Report. Ankush Arora

Apache Flink. Fast and Reliable Large-Scale Data Processing

Hybrid Software Architectures for Big

Unlocking the True Value of Hadoop with Open Data Science

Big Data Frameworks Course. Prof. Sasu Tarkoma

Big Data Frameworks: Scala and Spark Tutorial

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

HiBench Introduction. Carson Wang Software & Services Group

Transcription:

Making Sense at Scale with the Berkeley Data Analytics Stack Michael Franklin February 3, 2015 WSDM 2015 Shanghai UC BERKELEY

Agenda A Little Bit on Big Data AMPLab Overview BDAS Philosophy: Unification not Specialization Spark, GraphX, and other BDAS components Wrap Up: Thoughts on Big Data Software

Sources Driving Big Data It s All Happening On- line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing

Big Data A Bad Definition Data sets, typically consisting of billions or trillions of records, that are so vast and complex that they require new and powerful computational resources to process. - Dictionary.com <Big Data as a Problem!>

Big Data as a Resource For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients. But now we know much more: we know that it s 100% effective in 70% to 80% of the patients, and ineffective in the rest. - Tim O Reilly et al. How Data Science is Transforming Health Care With enough of the right data you can determine precisely who the treatment will work for. Thus, even a 1% effective drug could save lives

A Big Data Pattern 750,000+ downloads 6

AMPLab: Integrating 3 Resources Algorithms Machine Learning, Statistical Methods Prediction, Business Intelligence Machines Clusters and Clouds Warehouse Scale Computing People Crowdsourcing, Human Computation Data Scientists, Analysts

AMPLab Overview UC BERKELEY Berkeley s AMPLab has already left an indelible mark on world of information technology, and even the web. But we haven t yet experienced the full impact of the group Not even close. Derrick Harris, GigaOM, Aug 2, 2014 70+ Students, Postdocs, Faculty and Staff from: Databases, Machine Learning, Systems, Security, and Networking Franklin Jordan Stoica Culler Goldberg Joseph Katz Patterson Recht Shenker 50/50 Split: Industry Sponsors and Govt White House Big Data Program: NSF CISE Expeditions in Computing and Darpa XData Fixed Timeline (ends Dec 2016); Collaborative Working Space See Dave Patterson How to Build a Bad Research Center, CACM March, 2014

A Nexus of Industrial Engagement Open Source Software Industrial-Strength Twice-yearly 3-day offsite retreats; AMPCamp training

Open Source Community Building MeetUp on MLBase @Twitter (Aug 6, 2013) Spark Summit SF (June 30, 2014)

Big Data Ecosystem Evolution Pregel Giraph MapReduce General batch" processing Dremel Impala Storm Drill S4 Tez GraphLab Specialized systems (iterative, interactive and" streaming apps)

AMPLab Unification Philosophy Don t specialize MapReduce Generalize it! Two additions to Hadoop MR can enable all the models shown earlier! 1. General Task DAGs 2. Data Sharing For Users: Fewer Systems to Use Less Data Movement SparkSQL Streaming GraphX Spark MLbase

Berkeley Data Analytics Stack (Apache and BSD open source) In-house Apps Access and Interfaces Processing Engine Storage Resource Virtualization

It s only September but it s already clear that 2014 will be the year of Apache Spark -- Datanami, 9/15/14 In-Memory Dataflow System Developed in AMPLab and its predecessor the RADLab Alternative to Hadoop MapReduce 10-100x speedup for ML and interactive queries Central component of the BDAS Stack Graduated to Apache Foundation -> Apache Spark M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.

Apache Spark Contributors: 100 75 50 25 0 2011 2012 2013 2014 400+ contributors to current release

Apache Spark: Compared to Other Projects 2000 1800 1600 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 350000 300000 250000 200000 150000 100000 50000 Activity in past 6 months 0 MapReduce YARN HDFS Storm Spark 2-3x Commits more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia, Lines of Code Changed

Iteration in Map-Reduce Initial Model Map Reduce Learned Model w (0) w (1) Training Data w (2) w (3)

Cost of Iteration in Map-Reduce Initial Model w (0) Map Reduce Learned Model w (1) Training Data Read 2 Repeatedly w (2) load same data w (3)

Cost of Iteration in Map-Reduce Initial Model Map Reduce Learned Model w (0) w (1) Training Data Redundantly save output between w (2) stages w (3)

Dataflow View Map Reduce Training Data (HDFS) Map Reduce Map Reduce

Memory Opt. Dataflow Training Data (HDFS) Cached Load Map Map Reduce Reduce Map Reduce

Memory Opt. Dataflow View Training Data (HDFS) Map Map Reduce Reduce Efficiently move data between stages Map Reduce Spark:10-100 faster than Hadoop MapReduce

Resilient Distributed Datasets (RDDs) API: coarse-grained transformations (map, group-by, join, sort, filter, sample, ) on immutable collections Resilient Distributed Datasets (RDDs)» Collections of objects that can be stored in memory or disk across a cluster» Built via parallel transformations (map, filter, )» Automatically rebuilt on failure Rich enough to capture many models:» Data flow models: MapReduce, Dryad, SQL,» Specialized models: Pregel, Hama, M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012.

Abstraction: Dataflow Operators map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

Fault Tolerance with RDDs RDDs track the series of transformations used to build them (their lineage)» Log one operation to apply to many elements» No cost if nothing fails Enables per-node recomputation of lost data messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD path = hdfs:// FilteredRDD func = _.contains(...) MappedRDD func = _.split( )

A Unified System for SQL & ML Deep integration of SQL and Spark Both share the same set of workers and caches def logregress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextdouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("select * FROM user u JOIN comment c ON c.uid=u.uid") Can move seamlessly between SQL and Machine Learning worlds val features = users.maprows { row => new Vector(extractFeature1(row.getInt("age")), extractfeature2(row.getstr("country")),...)} val trainedvector = logregress(features.cache()) R. Xin, J. Rosen, M. Zaharia, M. Franklin,S. Shenker, I. Stoica, Shark: SQL and Rich Analytics at Scale, SIGMOD 2013.

Spark Streaming Microbatch approach provides lower latency Additional operators provide windowed operations M. Zaharia, et al, Discretized Streams: Fault-Tollerant Streaming Computation at Scale, SOSP 2013.

Batch/Streaming Unification Batch and streaming codes virtually the same» Easy to develop and maintain consistency // count words from a file (batch) val file = sc.textfile("hdfs://.../pagecounts-*.gz") val words = file.flatmap(_.split(" ")) val wordcounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordcounts.print() // count words from a network stream, every 10s (streaming) val ssc = new StreamingContext(args(0), "NetCount", Seconds(10),..) val lines = ssc.sockettextstream("localhost, 3456) val words = lines.flatmap(_.split(" ")) val wordcounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordcounts.print() ssc.start()

Apache Spark v1.2 (12/14) Includes» Spark (core)» Spark Streaming» GraphX (alpha release)» MLlib» SparkSQL Query Processing Wide range of interfaces:» Python / interactive ipython» Scala / interactive scala shell» R / interactive R-shell» Java Now included in all major Hadoop Distributions

Graph Analytics Link Table Hyperlinks PageRank Top 20 Pages Title Link Title PR Raw Wikipedia < / > < / > < / > XML Top Communities Com. PR.. Discussion Table Editor Graph Community Detection User Community User Disc. User Com.

Separate Systems Tables Graphs

Dataflow Systems Separate Systems Graphs Table Row Row Row Result Row

Separate Systems 6. Before Dataflow Systems Graph Systems 7. After 8. After Table Row Dependency Graph Row Row Result Row

Difficult to Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 34

Efficiencies are Possible Hadoop 1340 Spark 354 GraphLab 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) Live-Journal Graph Specialized Graph System can be faster than general MR computation

But Extensive data movement and duplication across the network and file system < / > < / > < / > XML HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages 36

GraphX Adding Graphs to the Mix Table View GraphX Unified Representation Graph View Tables and Graphs are composable views of the same physical data Each view has its own operators that exploit the semantics of the view to achieve efficient execution J. Gonzalez, R. Xin, A. Dave, D. Crankshaw, M. Franklin, I. Stoica, GraphX: Graph Processing in a Distributed Dataflow Framework OSDI Conf., Oct 2014

Representation Join Distributed Graphs Horizontally Partitioned Tables Vertex Programs Dataflow Operators Optimizations Advances in Graph Processing Systems Distributed Join Optimization Materialized View Maintenance

Property Graph Data Model Property Graph B C A D Vertex Property: User Profile Current PageRank Value Edge Property: F E Weights Timestamps

Encoding Property Graphs as Tables Property Graph B C A Vertex Cut F E Part. 1 D Part. 2 Vertex Table (RDD) Machine 1 Machine 2 A B C D E F Routing Table (RDD) A B C D E F 1 2 1 1 1 2 2 2 Edge Table (RDD) A A B C A A E E B C C D E F D F

Graph Operators (Scala) class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pv: (Id, V) => Boolean, pe: Edge[V,E] => Boolean): Graph[V,E] def mapv(m: (Id, V) => T ): Graph[T,E] def mape(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinv(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joine(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrtriplets(mapf: (Edge[V,E]) => List[(Id, T)], reducef: (T, T) => T): Graph[T, E] } 41

Graph Operators (Scala) class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pv: (Id, V) => Boolean, pe: Edge[V,E] => Boolean): Graph[V,E] def mapv(m: (Id, V) => T ): Graph[T,E] def mape(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinv(tbl: Table GraphLab [(Id, T)]): Graph[(V, T), E ] def joine(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrtriplets(mapf: (Edge[V,E]) => List[(Id, T)], reducef: (T, T) => T): Graph[T, E] } capture the Gather-Scatter pattern from 42

Graph System Optimizations Specialized Data-Structures Vertex-Cuts Partitioning Remote Caching / Mirroring Message Combiners Active Set Tracking 43

PageRank Benchmark EC2 Cluster of 16 x m2.4xlarge Nodes + 1GigE Runtime (Seconds) Twitter Graph (42M Vertices,1.5B Edges) 3500 3000 2500 2000 7x 1500 1000 500 0 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 UK-Graph (106M Vertices, 3.7B Edges) 18x GraphX performs comparably to state-of-the-art graph processing systems.

The GraphX Stack (Lines of Code) PageRank (20) Connected Comp. (20) GraphLab/Pregel API (34) K-core (60) Triangle Count (50) LDA (220) SVD++ (110) GraphX (2,500) Spark (30,000) Some algorithms are more naturally expressed using the GraphX primitive operators

Graphs are just one stage. What about a pipeline?

A Small Pipeline in GraphX Raw Wikipedia Hyperlinks PageRank Top 20 Pages < / > < / > < / > XML HDFS HDFS Spark Giraph + Spark GraphLab + Spark GraphX Spark Preprocess 375 342 605 Compute Spark Post. 1492 0 200 400 600 800 1000 1200 1400 1600 Total Runtime (in Seconds) Timed end-to-end GraphX is the fastest

MLBase: Distributed ML Made Easy DB Query Language Analogy: Specify What not How MLBase chooses: Algorithms/Operators Ordering and Physical Placement Parameter and Hyperparameter Settings Featurization Leverages Spark for Speed and Scale T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. Franklin, M. Jordan, MLBase: A Distributed Machine Learning System, CIDR 2013.

ML Pipeline Generation & Optimization Ex: Image Classification Pipeline MLBase Optimizer Focus: Better Resource Utilization Algorithmic Speedups Reduced Model Search Time E. Sparks, S. Venkataraman, M. Franklin, B. Recht, ML Pipelines, In Process. (see AMPLab Blog for an overview)

Introducing Velox: Model Serving Data Training Model Where do models go? Conference Papers Sales Reports Drive Actions 50

Driving Actions Suggesting Items at Checkout Low-Latency Fraud Detection Personalized Cognitive Assistance Internet of Things Rapidly Changing 51

Problem: Separate Systems Offline Analytics Systems 6. Before Online Serving Systems 7. After Sophisticated ML on static data. 8. After MongoDB Low-Latency data serving How do we serve low-latency predictions and train on live data? 52

Velox Model Serving System [CIDR 15] Decompose personalized predictive models: 53

Velox Model Serving System [CIDR 15] Decompose personalized predictive models: Feature Caching Feature Model Personalization Model Online Updates Approx. Features Batch Split Online Active Learning Order-of-magnitude reductions in prediction latencies. 54

Berkeley Data Analytics Stack (Apache and BSD open source) In-house Apps Access and Interfaces Processing Engine Storage Resource Virtualization

Big Data Architecture Open Questions & Research Issues What is the role of a unified stack in a fastchanging software landscape? Single-node vs. Cluster; Elastic Cloud vs HPC GPUs, FPGAs, XYZs,??? New memory hierarchies: SSDs, RDMA, Serving vs. Analytics Workloads What correctness guarantees are really needed?

Summary Big Data yes, there s hype but it is a Big Deal AmpLab project 6 Yrs, Cross-disciplinary team, Industry engagement Open Source development and community building BDAS philosophy: Unification Spark + SQL + Graphs + ML + The Big Data landscape continues to evolve Open Source enables Academic research to play a huge role

To find out more or get involved: UC BERKELEY amplab.berkeley.edu franklin@berkeley.edu Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, and SAP, the Thomas and Stacy Siebel Foundation, all our industrial sponsors and partners, and all the members of the AMPLab Team.