What s next for the Berkeley Data Analytics Stack?



Similar documents
Big Data Research in the AMPLab: BDAS and Beyond

The Berkeley AMPLab - Collaborative Big Data Research

The Berkeley Data Analytics Stack: Present and Future

CS 294: Big Data System Research: Trends and Challenges

Conquering Big Data with BDAS (Berkeley Data Analytics)

How Companies are! Using Spark

Making Sense at Scale with the Berkeley Data Analytics Stack

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

An Open Source Memory-Centric Distributed Storage System

Next-Gen Big Data Analytics using the Spark stack

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Ali Ghodsi Head of PM and Engineering Databricks

Moving From Hadoop to Spark

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

From Spark to Ignition:

How To Create A Data Visualization With Apache Spark And Zeppelin

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Why Spark on Hadoop Matters

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Analytics on Spark &

How To Make Sense Of Data With Altilia

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Databricks. A Primer

Architectures for massive data management

Dell In-Memory Appliance for Cloudera Enterprise

Databricks. A Primer

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Machine- Learning Summer School

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Unified Big Data Processing with Apache Spark. Matei

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Tachyon: memory-speed data sharing

Beyond Hadoop with Apache Spark and BDAS

Challenges for Data Driven Systems

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

BIG DATA ANALYTICS For REAL TIME SYSTEM

Unified Batch & Stream Processing Platform

TECHNOLOGY TRANSFER PRESENTS INTERNATIONAL. Rome, December Residenza di Ripetta Via di Ripetta, 231 CONFERENCE BIG DATA

SAP HANA Vora : Gain Contextual Awareness for a Smarter Digital Enterprise

Big Data Analytics Hadoop and Spark

Making big data simple with Databricks

CSE-E5430 Scalable Cloud Computing Lecture 11

Interactive data analytics drive insights

Spark: Making Big Data Interactive & Real-Time

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Hadoop Ecosystem B Y R A H I M A.

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

Unified Big Data Analytics Pipeline. 连 城

The 4 Pillars of Technosoft s Big Data Practice

Streaming items through a cluster with Spark Streaming

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

BIG DATA IN BUSINESS ENVIRONMENT

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

Apache Spark and Distributed Programming

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Real Time Big Data Processing

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

A Brief Introduction to Apache Tez

Big Data for Big Intel

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Brave New World: Hadoop vs. Spark

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

BIG DATA What it is and how to use?

NoSQL for SQL Professionals William McKnight

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Conquering Big Data with Apache Spark

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

CRITEO INTERNSHIP PROGRAM 2015/2016

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Self-service BI for big data applications using Apache Drill

Customer Case Study. Automatic Labs

Soma: Linked Data Infrastructure

HDP Hadoop From concept to deployment.

Data Warehousing in the Age of Big Data

Enterprise Operational SQL on Hadoop Trafodion Overview

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Large-Scale Data Processing

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Significantly Speed up real world big data Applications using Apache Spark

Transcription:

What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY

AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff Systems, Networking, Databases, and Machine Learning UC BERKELEY Franklin Jordan Stoica Goldberg Joseph Katz Patterson Recht Shenker Industry Sponsors + White House Big Data Program (NSF and Darpa) In-house Apps: Cancer Genomics, Mobile Sensing, Collaborative Energy Saving

AMPLab: Integrating 3 Resources Algorithms Machine Learning, Statistical Methods Prediction, Business Intelligence Machines Clusters and Clouds Warehouse Scale Computing People Crowdsourcing, Human Computation Data Scientists, Analysts

Extreme Elasticity Algorithms Approximate Answers: Trading time for error ML Libraries and Ensemble Methods Active Learning Machines Cloud Computing; Multi- tenancy Deep and dynamic storage hierarchies Relaxed (eventual) consistency/ Multi- version methods People Dynamic Task and Microtask Marketplaces Visual analytics Manipulative interfaces and mixed mode operation

Big BDAS Bets Leverage commodity trends: in-memory processing, elastic compute and scalable storage (HDFS) Optimize for common patterns: MR, graphs, SQL, iterative machine learning, matrix algebra Tradeoff accuracy and runtime with sampling and relaxed consistency Integration: expose a consistent API to support batch, stream and interactive computation Build and support an Open Source community around it on an academic research budget

Berkeley Data Analytics Stack (Spark is just part of what we do) Genomics (ADAM, SNAP), Energy (Carat), Sensing (MM, SDB) BlinkDB Spark Streaming Shark SQL GraphX MLBase SparkR MLlib Apache Spark Tachyon HDFS, S3, Apache Mesos Yarn

WHAT S NEW IN BDAS?

Tachyon In-memory, fault-tolerant storage system Easily and efficiently share data across frameworks Spark (inst. 1) Spark (inst. 2) Hadoop MR Tachyon Off-heap allocation; HDFS compatible API See Haoyuan Li s talk at 3pm this afternoon

Fast, approximate answers with error bars by executing queries on small, pre-collected samples of data Recent focus on diagnostics for reliable error reporting S. Agarwal et al., Knowing When You re Wrong: Building Fast and Reliable Approximate Query Processing Systems ACM SIGMOD Conf., June 2014.

Sampling + Data Cleaning using less data for better answers S. Krishnan, J. Wang et al., A Sample-and-Clean Framework for Fast and Accurate Query Processing over Dirty Data, ACM SIGMOD Conf., June 2014.

GraphX Adding Graphs to the Mix Table View GraphX Unified Representation Graph View Tables and Graphs are composable views of the same physical data Each view has its own operators that exploit the semantics of the view to achieve efficient execution R. Xin, J. Gonzalez, M. Franklin, I. Stoica, GraphX: In-situ Graph Computation Made Easy GRADES Workshop at SIGMOD, June 2013.

Distributed Graphs as Distributed Tables Property Graph Vertex Table Routing Table Edge Table B C Part. 1 A A 1 2 A A B C B B 1 B C A D 2D Vertex Cut Heuristic C D C D 1 1 2 C A D E F E E E 2 A E F D Part. 2 F F 2 E F

GraphX: Graphs à Dataflow 1. Encode graphs as distributed tables 2. Express graph computation in relational ops. 3. Recast graph systems optimizations as: 1. Distributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance parity with specialized systems.

WHAT S NEXT IN BDAS?

MLBase: Distributed ML Made Easy DB Query Language Analogy: Specify What not How MLBase chooses: Algorithms/Operators Ordering and Physical Placement Parameter and Hyperparameter Settings Featurization Leverages Spark for Speed and Scale

ML Pipeline Generation & Optimization Ex: Image Classification Pipeline MLBase Optimizer Focus: Better Resource Utilization Algorithmic Speedups Reduced Model Search Time Work in Progress: See Evan Spark s talk at 1pm tomorrow afternoon

What about Updates? Big Data Analytics assume Append-mostly data Many applications require Point Updates Real-time ML and Analytics require fast Model Updates We have several projects exploring this space MDCC: Mutli Data Center Consistency HAT/RAMP: Highly-Available Transactions/Read Atomicity PBS: Probabilistically Bounded Staleness PLANET: Predictive Latency-Aware Networked Transactions (see SIGMOD 2014 for details on RAMP and PLANET)

The Big Picture: Serving Models & Data BDAS Serving Services Report Spark MLlib + BlinkDB + GraphX Data Models Apps Tachyon + HDFS Apps Enables real-time Analytics and Model Updates

Looking Further The big breakthrough of the Big Data ecosystem isn t scalability It s not really speed either It s flexibility!

The (Traditional) Big Picture 20

Database Philosophy One way in/out: through the SQL door at the top From Mike Carey, UCI 21

Open Source Big Data Stack Notes: Giant byte sequence at the bottom Map, sort, shuffle, reduce layer in middle Possible storage layer in middle as well HLLs now at the top From Mike Carey, UCI 22

The (Traditional) Big Picture Extract Transform Load 23

The Structure Spectrum Structured (schema-first) Semi-Structured (schema-on-read) Unstructured (schema-never) Schema-as-needed ETL Becomes a Continuous/On-Demand Process But, may lose access to some data in the meantime

Where will the next Come From? Easy-to-use Machine Learning (a la MLBase) Integrating Serving and Analytics more than the old OLTP + OLAP story serving Data and Models; supporting Real-time Continuous Data Integration and ETL Renewed focus on Single-Node performance Pushing processing to the Edge e.g., for IoT Tracking new server/data center technologies

To find out or (better, yet) get involved: UC BERKELEY amplab.berkeley.edu franklin@berkeley.edu Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, and SAP, the Thomas and Stacy Siebel Foundation, and all our industrial sponsors and partners.