The Berkeley Data Analytics Stack: Present and Future

The Berkeley Data Analytics Stack: Present and Future Michael Franklin 27 March 2014 Technion Big Day on Big Data UC BERKELEY

BDAS in the Big Data Context 2

Sources Driving Big Data It s All Happening On-line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing

Beyond Volume, Velocity and Variety Step 1: Run Faster Money Time Massive Massive Diverse and Growing Data Data Answer Quality Step 2: Tradeoffs 4

AMPLab: Integrating Diverse Resources Algorithms Machine Learning, Statistical Methods Prediction, Business Intelligence Machines Clusters and Clouds Warehouse Scale Computing People Crowdsourcing, Human Computation Data Scientists, Analysts

AMPLab Data Launched January 2011, 6 year duration 60+ Students, Postdocs, Faculty and Staff UC BERKELEY In-house Applications Cancer Genomics, Mobile Sensing, IoT (smartphones) Industry Sponsors, Foundations + White House Big Data Prog Franklin DB Jorda n ML Stoic a Sys Patterso n Sys Shenk er Net Rech t ML Katz Sys Josep h Sec Goldber g HCI

Carat: Big Data at Work 715,000+ downloads 7

Open Source Engagement SF Spark MeetUP: 1700+ members Also: Boston, Hyderbad, Bootcamps: Berkeley, Strata, On-line MeetUp talk on MLBase at Twitter HQ (Aug 6, 2013)

Big Data Systems Today Pregel Giraph MapReduce Dremel Impala Storm Drill Tez GraphLab S4 General batch processing Specialized systems (iterative, interactive and streaming apps)

Shark Streaming GraphX MLbase BDAS Philosophy Don t specialize MapReduce Generalize it! Two additions to Hadoop MR can enable all the models on previous slide! 1. General Task DAGs 2. Data Sharing For Users: Fewer Systems Less Copying Spark

For developers: Code Size 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines

For Developers: Code Size 140000 120000 100000 80000 60000 40000 20000 Streaming 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines

For Developers: Code Size 140000 120000 100000 80000 60000 40000 20000 Shark* Streaming 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines * also calls into Hive

For Developers Code Size 140000 120000 100000 80000 60000 40000 20000 GraphX Shark* Streaming 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines * also calls into Hive

Berkeley Data Analytics Stack Applications: Traffic, Carat, Genomics, 3 rd Party Tools: IPython, Visualization, Data Cleaning, AMP Alpha or Soon AMP Released BSD/Apach e 3 rd Party Open Source Shark (SQL) BlinkDB GraphX MLBase Tachyon Spark Streaming ML-lib Processing Apache Spark HDFS / Hadoop Storage Spark-R PySpark Analytics Frameworks and Data Management Storage Apache Mesos YARN Resource Manager Resource

Fast, MapReduce-like engine»in-memory storage abstraction for iterative/interactive queries»general execution graphs»up to 100x faster than Hadoop MR (2-10x even for ondisk) Compatible with Hadoop s storage APIs»Can access HDFS, HBase, S3, SequenceFiles, etc Great example of ML/Systems/DB collaboration

Example: Logistic Regression Goal: find best line separating two sets of points random initial line target

ML and Queries in Hadoop HDFS read query 1 result 1 query 2 result 2 Input query 3 result 3... HDFS read iter. 1 HDFS write HDFS read iter. 2 HDFS write... Input

In-Memory Data Sharing one-time processing query 1 query 2 Input Distributed memory query 3... iter. 1 iter. 2... Input

Running Time (min) Logistic Regression Performance 60 50 110 s / iteration 40 30 Hadoop Spark 20 10 0 1 10 20 30 Number of Iterations first iteration 80 s further iterations 1 s 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Challenge A distributed memory abstraction that is both: efficient and fault-tolerant

Resilient Distributed Datasets (RDDs) API: coarse-grained transformations (map, group-by, join, sort, filter, sample, ) on immutable collections Efficient fault recovery using lineage»log one operation to apply to many elements»recompute lost partitions of RDD on failure»no cost if nothing fails Rich enough to capture many models:»data flow models: MapReduce, Dryad, SQL,»Specialized models: Pregel, Hama, M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012. Best Paper Award

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) cachedmsgs = messages.cache() Base RDD Action Transformed RDD Driver results tasks Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count Worker Cache 3 Worker Block 2 Cache 2 Block 3

Fault Tolerance with RDDs RDDs track the series of transformations used to build them (their lineage) Enables per-node recomputation of lost data messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD path = hdfs:// FilteredRDD func = _.contains(...) MappedRDD func = _.split( )

Spark Status Current release (0.9) includes Java and Python APIs» Apache Top Level Project (as of Feb 2014)» Includes: streaming, MLlib, YARN integration, EC2, GraphX» 100+ Developers, 20 Organizations contributed code Supported by: Cloudera, Wandisco & others TBA Spark Apps to be certified by DataBricks (AMPLab spinout) Sample Use cases:» In-memory analytics on Hive data (Conviva)» Interactive queries on data streams (Quantifind)» Business intelligence (Yahoo!)» DNA sequence alignment (SNAP)

Berkeley Data Analytics Stack Applications: Traffic, Carat, Genomics, 3 rd Party IPython, Visualization, Data Cleaning, AMP Alpha or Soon AMP Released BSD/Apach e 3 rd Party Open Source Shark (SQL) BlinkDB GraphX MLBase Tachyon Spark Streaming ML-lib Processing Apache Spark HDFS / Hadoop Storage Spark-R PySpark Analytics Frameworks and Data Management Storage Apache Mesos YARN Resource Manager Resource

Shark 27

Shark = Spark + Hive Uses Spark s in-memory RDD caching and language»result reuse and low latency»scalable, fault-tolerant, fast Query Compatible with Hive» Run HiveQL queries (w/ UDFs, UDAs ) without modifications» Convert logical query plan generated from Hive into Spark execution graph Data Compatible with Hive» Use existing HDFS data and Hive metadata, without modifications C. Engle, et al, Shark: Fast Data Analysis Using Coarse-grained Distributed Memory, SIGMOD 2012 (system demonstration). Best Demo Award R. Xin et al., Shark: SQL and Rich Analytics at Scale, SIGMOD 2013.

Shark Optimizations Fast task start-up Optimized column-oriented storage Dynamic (mid-query) join algorithm selection based on statistical properties of data Runtime selection of # of reducers Partition pruning using range statistics Controllable table partitioning across

Running Time: Join + Order By SELECT srcip, AVG(pageRank), SUM(adRevenue) as totalrev FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP ORDER BY totalrev DESC LIMIT 1 Time (sec) Time (sec) 485K UserVisits 533M UserVisits https://amplab.cs.berkeley.edu/benchmark/ 30

A Unified System for SQL & ML Deep integration of Shark and Spark Both share the same set of workers and caches Can move seamlessly between SQL and Machine Learning worlds def logregress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextdouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("select * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.maprows { row => new Vector(extractFeature1(row.getInt("age")), extractfeature2(row.getstr("country")),...)} val trainedvector = logregress(features.cache())

GraphX Adding Graphs to the Mix Table View GraphX Unified Representation Graph View Tables and Graphs are composable views of the same physical data Each view has its own operators that exploit the semantics of the view to achieve efficient execution R. Xin, J. Gonzalez, M. Franklin, I. Stoica, GraphX: In-situ Graph Computation Made Easy GRADES Workshop at SIGMOD, June 2013.

Error Speed/Accuracy Trade-off Interactive Queries Time to Execute on Entire Dataset 5 sec Execution Time 30 mins

Fast, approximate answers with error bars by executing queries on small, pre-collected samples of data Compatible with Apache Hive (storage, serdes, UDFs, types, metadata) and HiveQL (with minor modifications) Agarwal et al., BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. ACM EuroSys 2013, Best Paper Award

Query Response Time (Seconds) Sampling Vs. No Sampling 1000 900 800 700 600 500 400 300 200 100 0 1020 103 10x as response time is dominated by I/O 18 13 10 8 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data

Query Response Time (Seconds) Sampling Vs. No Sampling 1000 900 800 700 600 500 400 300 200 100 0 1020 Error Bars (0.02%) 103 (0.07%) (1.1%) (3.4%) (11%) 18 13 10 8 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data

Error Speed/Accuracy Trade-off Interactive Queries Time to Execute on Entire Dataset Pre-Existing Noise 5 sec Execution Time 30 mins

Statistics MetaData People Resources Supporting Data Scientists Interactive Analytics Visual Analytics Collaboration Hybrid Human-Machine Computation Data Cleaning Active Learning Handling the last 5% CrowdSQL Disk 1 Disk 2 Parser Optimizer Executor Files Access Methods UI Creation Results Turker Relationship Manager Form Editor UI Template Manager HIT Manager Franklin, Kossmann et al., CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 Wang et al., CrowdER: Crowdsourcing Entity Resolution, VLDB 2012 38 Trushkowsky et al., Crowdsourcing Enumeration Queries, ICDE 2013 Best Paper Award

Working with the Crowd Incentives Fatigue, Fraud, & other Failure Modes Latency & Prediction Work Conditions Interface Impacts Answer Quality Task Structuring Task Routing 39

Less is More? Data Cleaning + Sampling J. Wang et al., Work in Progress

Other Things We re Working On MLBase: Declarative Scalable Machine Learning OLTP and Serving Workloads MDCC: Mutli Data Center Consistency HAT: Highly-Available Transactions PBS: Probabilistically Bounded Staleness PLANET: Predictive Latency-Aware Networked Transactions Fast Matrix Manipulation Libraries Cold Storage, Partitioning, Distributed Caching Machine Learning Pipelines, GPUs,

Summary The Berkeley AMPLab is integrating Algorithms, Machines and People to make sense of data at scale. BDAS is our main delivery vehicle. Open Source has been a great way to have real industry impact with academic research. Our direction: advanced analytics, extreme elasticity, and support for people in all phases of Big Data analytics.

For More Information UC BERKELEY amplab.cs.berkeley. edu franklin@berkeley.e du Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, and SAP, the Thomas and Stacy Siebel Foundation, and all our industrial sponsors and partners.