Big Data Research in the AMPLab: BDAS and Beyond



Similar documents
What s next for the Berkeley Data Analytics Stack?

The Berkeley Data Analytics Stack: Present and Future

The Berkeley AMPLab - Collaborative Big Data Research

CS 294: Big Data System Research: Trends and Challenges

Beyond Hadoop with Apache Spark and BDAS

Conquering Big Data with BDAS (Berkeley Data Analytics)

Ali Ghodsi Head of PM and Engineering Databricks

Next-Gen Big Data Analytics using the Spark stack

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

How To Create A Data Visualization With Apache Spark And Zeppelin

Azure Data Lake Analytics

Moving From Hadoop to Spark

Unified Big Data Processing with Apache Spark. Matei

How Companies are! Using Spark

Unified Big Data Analytics Pipeline. 连 城

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark: Making Big Data Interactive & Real-Time

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Architectures for massive data management

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Databricks. A Primer

Hadoop Ecosystem B Y R A H I M A.

From Spark to Ignition:

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Shark Installation Guide Week 3 Report. Ankush Arora

Streaming items through a cluster with Spark Streaming

Databricks. A Primer

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

How To Handle Big Data With A Data Scientist

Write Once, Run Anywhere Pat McDonough

Hadoop & Spark Using Amazon EMR

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data for Big Intel

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Dell In-Memory Appliance for Cloudera Enterprise

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

An Open Source Memory-Centric Distributed Storage System

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

BIG DATA ANALYTICS For REAL TIME SYSTEM

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Analytics on Spark &

Apache Flink Next-gen data analysis. Kostas

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Unified Batch & Stream Processing Platform

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

More Data in Less Time

Workshop on Hadoop with Big Data

Machine- Learning Summer School

Big Data Spatial Analytics An Introduction

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Processing NGS Data with Hadoop-BAM and SeqPig

Brave New World: Hadoop vs. Spark

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

MLlib: Scalable Machine Learning on Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Brief Introduction to Apache Tez

HiBench Introduction. Carson Wang Software & Services Group

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

SEIZE THE DATA SEIZE THE DATA. 2015

Mambo Running Analytics on Enterprise Storage

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Big Data and Analytics: Challenges and Opportunities

NoSQL for SQL Professionals William McKnight

Implement Hadoop jobs to extract business value from large and varied data sets

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

COMP9321 Web Application Engineering

Big Data Processing. Patrick Wendell Databricks

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Why Spark on Hadoop Matters

Hybrid Software Architectures for Big

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Transcription:

Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY

AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned duration Personnel: ~60 Students, Postdocs, Faculty and Staff Expertise: Systems, Networking, Databases and Machine Learning In-House Apps: Crowdsourcing, Mobile Sensing, Cancer Genomics UC BERKELEY

AMPLab: Integrating Diverse Resources Algorithms Machine Learning, Statistical Methods Prediction, Business Intelligence Machines Clusters and Clouds Warehouse Scale Computing People Crowdsourcing, Human Computation Data Scientists, Analysts

Big Data Landscape Our Corner 4

Berkeley Data Analytics Stack AMP Alpha or Soon AMP Released BSD/Apache Shark (SQL) BlinkDB GraphX MLBase Spark Streaming Apache Spark ML-lib 3 rd Party Open Source Tachyon HDFS / Hadoop Storage Apache Mesos YARN Resource Manager

Our View of the Big Data Challenge Something s gotta give Time Money Massive Diverse and Growing Data Answer Quality 6

Speed/Accuracy Trade- off Interac:ve Queries Error Time to Execute on En:re Dataset 5 sec Execu&on Time 30 mins

Speed/Accuracy Trade- off Interac:ve Queries Error Time to Execute on En:re Dataset Pre- Exis:ng Noise 5 sec Execu&on Time 30 mins

A data analysis (warehouse) system that - builds on Shark and Spark - returns fast, approximate answers with error bars by executing queries on small samples of data - is compatible with Apache Hive (storage, serdes, UDFs, types, metadata) and supports Hive s SQLlike query structure with minor modifications Agarwal et al., BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. ACM EuroSys 2013, Best Paper Award

Sampling Vs. No Sampling Query Response Time (Seconds) 1000 900 800 700 600 500 400 300 200 100 0 1020 103 10x as response &me is dominated by I/O 18 13 10 8 1 10-1 10-2 10-3 10-4 10-5 Frac:on of full data

Sampling Vs. No Sampling Query Response Time (Seconds) 1000 900 800 700 600 500 400 300 200 100 0 1020 Error Bars (0.02%) 103 (0.07%) (1.1%) (3.4%) (11%) 18 13 10 8 1 10-1 10-2 10-3 10-4 10-5 Frac:on of full data

People Resources Hybrid Human-Machine Computation Data Cleaning Active Learning Handling the last 5% Supporting Data Scientists Interactive Analytics Visual Analytics Collaboration CrowdSQL Statistics MetaData Parser Optimizer Executor Files Access Methods Disk 1 Disk 2 Results Turker Relationship Manager UI Creation Form Editor UI Template Manager HIT Manager Franklin et al., CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 Wang et al., CrowdER: Crowdsourcing Entity Resolution, VLDB 2012 Trushkowsky et al., Crowdsourcing Enumeration Queries, ICDE 2013 Best Paper Award 12

Less is More? Data Cleaning + Sampling J. Wang et al., Work in Progress

Working with the Crowd Incentives Fatigue, Fraud, & other Failure Modes Latency & Prediction Work Conditions Interface Impacts Answer Quality Task Structuring Task Routing 14

The 3E s of Big Data: Extreme Elasticity Everywhere Algorithms Approximate Answers ML Libraries and Ensemble Methods Active Learning Machines Cloud Computing esp. Spot Instances Multi- tenancy Relaxed (eventual) consistency/ Multi- version methods People Dynamic Task and Microtask Marketplaces Visual analytics Manipulative interfaces and mixed mode operation

The Research Challenge Integration + Extreme Elasticity + Tradeoffs + More Sophisticated Analytics = Extreme Complexity

Can we Take a Declarative Approach? Can reduce complexity through automa&on End Users tell the system what they want, not how to get it SQL Result MQL Model

Goals of MLbase ML Insights MLbase Systems Insights 1. Easy scalable ML development (ML Developers) 2. Easy/user- friendly ML at scale (End Users) Along the way, we gain insight into data intensive compu&ng

A Declara&ve Approach End Users tell the system what they want, not how to get it Example: Supervised Classifica&on var X = load( als_clinical, 2 to 10) var y = load( als_clinical, 1) var (fn- model, summary) = doclassify(x, y)

MLBase Query Compilation 20

Query Optimizer: A Search Problem System is responsible for searching through model space SVM 5min Opportuni&es for physical op&miza&on Boosting

MLbase: Progress MQL Parser ML Library ML Developer API Released July 2013 (Contracts) Query Planner / Optimizer Runtime initial release: Spring 2014

Other Things We re Working On GraphX: Unifying Graph Parallel & Data Parallel Analytics OLTP and Serving Workloads MDCC: Mutli Data Center Consistency HAT: Highly-Available Transactions PBS: Probabilistically Bounded Staleness PLANET: Predictive Latency-Aware Networked Transactions Fast Matrix Manipulation Libraries Cold Storage, Partitioning, Distributed Caching Machine Learning Pipelines, GPUs,

It s Been a Busy 3 Years

Be Sure to Join us for the Next 3 UC BERKELEY amplab.cs.berkeley.edu @amplab