The Berkeley AMPLab - Collaborative Big Data Research



Similar documents
What s next for the Berkeley Data Analytics Stack?

CS 294: Big Data System Research: Trends and Challenges

Big Data Research in the AMPLab: BDAS and Beyond

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Conquering Big Data with BDAS (Berkeley Data Analytics)

Moving From Hadoop to Spark

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Ali Ghodsi Head of PM and Engineering Databricks

Dell In-Memory Appliance for Cloudera Enterprise

How Companies are! Using Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unified Big Data Processing with Apache Spark. Matei

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

How To Create A Data Visualization With Apache Spark And Zeppelin

An Open Source Memory-Centric Distributed Storage System

Databricks. A Primer

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Databricks. A Primer

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Next-Gen Big Data Analytics using the Spark stack

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Beyond Hadoop with Apache Spark and BDAS

Improving MapReduce Performance in Heterogeneous Environments

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Big Data Analytics Hadoop and Spark

Tachyon: memory-speed data sharing

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

The Berkeley Data Analytics Stack: Present and Future

Shark Installation Guide Week 3 Report. Ankush Arora

Managing large clusters resources

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Spark. Fast, Interactive, Language- Integrated Cluster Computing

CSE-E5430 Scalable Cloud Computing Lecture 11

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Cisco Data Preparation

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Spark: Making Big Data Interactive & Real-Time

Big Data Processing. Patrick Wendell Databricks

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Big Data and Scripting Systems beyond Hadoop

Large scale processing using Hadoop. Ján Vaňo

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Business Intelligence meets Big Data: An Overview on Security and Privacy

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Architectures for massive data management

Dell* In-Memory Appliance for Cloudera* Enterprise

How To Handle Big Data With A Data Scientist

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Azure Data Lake Analytics

Hadoop Ecosystem B Y R A H I M A.

BIG DATA What it is and how to use?

Big Data Analytics. Lucas Rego Drumond

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Client Overview. Engagement Situation. Key Requirements

Streaming items through a cluster with Spark Streaming

Challenges for Data Driven Systems

Customer Case Study. Sharethrough

Brave New World: Hadoop vs. Spark

Big Data Analytics Nokia

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics

From Spark to Ignition:

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data and Industrial Internet

The 4 Pillars of Technosoft s Big Data Practice

The Future of Data Management

Oracle Big Data SQL Technical Update

Comprehensive Analytics on the Hortonworks Data Platform

Big-Data Computing with Smart Clouds and IoT Sensing

The Hidden Extras. The Pricing Scheme of Cloud Computing. Stephane Rufer

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

ISACA Presentation. Cloud, Forensics and Cloud Forensics

S06: Open-Source Stack for Cloud Computing

Hadoop & Spark Using Amazon EMR

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

How To Choose A Data Flow Pipeline From A Data Processing Platform

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Using Data Mining and Machine Learning in Retail

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Concept and Project Objectives

Apache Spark and Distributed Programming

Machine- Learning Summer School

Transcription:

The Berkeley AMPLab - Collaborative Big Data Research UC BERKELEY Anthony D. Joseph LASER Summer School September 2013

About Me Education: MIT SB, MS, PhD Joined Univ. of California, Berkeley in 1998 Current research areas:» Cloud computing (Mesos): http://mesos.apache.org/» Secure Machine Learning (SecML): http://radlab.cs.berkeley.edu/wiki/secml» DETER security testbed: http://deter-project.org/» Intel Science and Technology Center for User Security: http://scrub.cs.berkeley.edu/ Other: Peer-to-Peer networking (Tapestry), Mobile computing, Wireless/Cellular networking 2

Sources Driving Big Data It s All Happening On- line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing

Challenge 1: Data is Big 60 Projected Growth Increase over 2010 50 40 30 20 10 Moore's Law Overall Data Particle Accel. DNA Sequencers 0 2010 2011 2012 2013 2014 2015 Data Grows faster than Moore s Law [IDC report, Kathy Yelick, LBNL]

Challenge 2: Data is Dirty Variety of diverse sources Uncurated No schema Inconsistent syntax and semantics Dirty Data worse than Big Data

Challenge 3: Complex Questions Hard questions» What is the impact on traffic and home prices of building a new ramp? Real-time questions» Is there a cyber attack going on? Open-ended questions» How many supernovae happened last year? Big Data Must Enable Decisions

Requires Multifaceted Approach Three dimensions to improve data analysis» Improving scale, efficiency, and quality of algorithms running in datacenters (Algorithms)» Scaling up datacenters (Machines)» Leverage human activity and intelligence (People) Need to adaptively and flexibly combine all three dimensions 7

Algorithms, Machines, People (AMP) Today s apps: fixed point in solution space Algorithms Watson/IBM search Machines People Need techniques to dynamically pick best operating point 8

The AMP Lab Make sense of data at scale by tightly integrating algorithms, machines, and people Algorithms Watson/IBM search Machines People 9

AMP Lab Faculty» Alex Bayen (mobile sensing platforms)» Armando Fox (systems)» Michael Franklin (databases): Director» Michael Jordan (machine learning): Co-director» Anthony Joseph (secure machine learning & privacy)» Randy Katz (systems)» David Patterson (systems)» Ion Stoica (systems): Co-director» Scott Shenker (networking)

Algorithms State-of-art Machine Learning (ML) algorithms do not scale» Prohibitive to process all data points Estimate" true answer How do you know when to stop? # of data points 11

Algorithms Given any problem, data and a budget» Immediate results with continuous improvement» Calibrate answer: provide error bars Estimate" true answer Error bars on every answer! # of data points 12

Algorithms Given any problem, data and a budget» Immediate results with continuous improvement» Calibrate answer: provide error bars Estimate" true answer Stop when error smaller than a given threshold # of data points time 13

Algorithms Given any problem, data and a time budget» Automatically pick a solution on ML algorithm spectrum Estimate" simple sophisticated true answer error too high pick sophisticated pick simple time 14

Machines The datacenter as a computer still in its infancy» Special purpose clusters, e.g., Hadoop cluster» Highly variable performance» Hard to program» Hard to debug =!? 15

Machines: Problem Rapid innovation in cloud computing Dryad Hypertable Cassandra Pregel No single framework optimal for all applications Want to run multiple frameworks in a single cluster» to maximize utilization» to share data between frameworks 16

Machines: A Solution Apache Mesos: a resource sharing layer supporting diverse frameworks» Fine-grained sharing: Improves utilization, latency, and data locality» Resource offers: Simple, scalable application-controlled scheduling mechanism Hadoop Pregel Hadoop Pregel Mesos Node Node Node Node Node Node Node Node B. Hindman, et al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011. http://mesos.apache.org/ 17

People Humans can make sense of messy data! 18

People Make people an integrated part of the system!» Leverage human activity» Leverage human intelligence (crowdsourcing): Curate and clean dirty data Answer imprecise questions Test and improve algorithms data, activity Challenge» Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) Machines + Algorithms Questions Answers 19

Our Vision: A Necessary Synergy Challenge 1: Data is Big Challenge 3: Questions are complex lgorithms achines eople Challenge 2: Data is Dirty

Berkeley Data Analytics Stack Shark BlinkDB SQL Spark Streaming GraphX MLBase Apache Spark HDFS / Hadoop Storage / Tachyon Apache Mesos / YARN Resource Manager

Big Data in 2020 Almost Certainly: Create a new generation of big data scientist A real datacenter OS ML becoming an engineering discipline If We re Lucky: System will know what to throw away Come up with answers in minutes no one knows People deeply integrated in big data analysis pipeline

Summary Goal: Tame Big Data Problem» Get results with right quality at the right time Approach: Holistically integrate Algorithms, Machines, and People Huge research issues across many domains

My Talks at LASER 2013 1. AMP Lab introduction (this talk) 2. The Datacenter Needs an Operating System 3. Mesos, part one 4. Dominant Resource Fairness 5. Mesos, part two 6. Spark