The Berkeley AMPLab - Collaborative Big Data Research

Size: px

Start display at page:

Download "The Berkeley AMPLab - Collaborative Big Data Research"

Alisha O’Connor’
10 years ago
Views:

1 The Berkeley AMPLab - Collaborative Big Data Research UC BERKELEY Anthony D. Joseph LASER Summer School September 2013

2 About Me Education: MIT SB, MS, PhD Joined Univ. of California, Berkeley in 1998 Current research areas:» Cloud computing (Mesos): Secure Machine Learning (SecML): DETER security testbed: Intel Science and Technology Center for User Security: Other: Peer-to-Peer networking (Tapestry), Mobile computing, Wireless/Cellular networking 2

org/» Secure Machine Learning (SecML): http://radlab.cs.berkeley.

3 Sources Driving Big Data It s All Happening On- line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing

Friend Request Transaction Network message Fault User

4 Challenge 1: Data is Big 60 Projected Growth Increase over Moore's Law Overall Data Particle Accel. DNA Sequencers Data Grows faster than Moore s Law [IDC report, Kathy Yelick, LBNL]

5 Challenge 2: Data is Dirty Variety of diverse sources Uncurated No schema Inconsistent syntax and semantics Dirty Data worse than Big Data

6 Challenge 3: Complex Questions Hard questions» What is the impact on traffic and home prices of building a new ramp? Real-time questions» Is there a cyber attack going on? Open-ended questions» How many supernovae happened last year? Big Data Must Enable Decisions

Real-time questions» Is there a cyber attack going on?

7 Requires Multifaceted Approach Three dimensions to improve data analysis» Improving scale, efficiency, and quality of algorithms running in datacenters (Algorithms)» Scaling up datacenters (Machines)» Leverage human activity and intelligence (People) Need to adaptively and flexibly combine all three dimensions 7

(Algorithms)» Scaling up datacenters (Machines)» Leverage human activity and

8 Algorithms, Machines, People (AMP) Today s apps: fixed point in solution space Algorithms Watson/IBM search Machines People Need techniques to dynamically pick best operating point 8

9 The AMP Lab Make sense of data at scale by tightly integrating algorithms, machines, and people Algorithms Watson/IBM search Machines People 9

10 AMP Lab Faculty» Alex Bayen (mobile sensing platforms)» Armando Fox (systems)» Michael Franklin (databases): Director» Michael Jordan (machine learning): Co-director» Anthony Joseph (secure machine learning & privacy)» Randy Katz (systems)» David Patterson (systems)» Ion Stoica (systems): Co-director» Scott Shenker (networking)

Co-director» Anthony Joseph (secure machine learning & privacy)» Randy Katz

11 Algorithms State-of-art Machine Learning (ML) algorithms do not scale» Prohibitive to process all data points Estimate" true answer How do you know when to stop? # of data points 11

12 Algorithms Given any problem, data and a budget» Immediate results with continuous improvement» Calibrate answer: provide error bars Estimate" true answer Error bars on every answer! # of data points 12

Calibrate answer: provide error bars Estimate"

13 Algorithms Given any problem, data and a budget» Immediate results with continuous improvement» Calibrate answer: provide error bars Estimate" true answer Stop when error smaller than a given threshold # of data points time 13

answer: provide error bars Estimate" true answer Stop

14 Algorithms Given any problem, data and a time budget» Automatically pick a solution on ML algorithm spectrum Estimate" simple sophisticated true answer error too high pick sophisticated pick simple time 14

15 Machines The datacenter as a computer still in its infancy» Special purpose clusters, e.g., Hadoop cluster» Highly variable performance» Hard to program» Hard to debug =!? 15

16 Machines: Problem Rapid innovation in cloud computing Dryad Hypertable Cassandra Pregel No single framework optimal for all applications Want to run multiple frameworks in a single cluster» to maximize utilization» to share data between frameworks 16

all applications Want to run multiple frameworks in a single

17 Machines: A Solution Apache Mesos: a resource sharing layer supporting diverse frameworks» Fine-grained sharing: Improves utilization, latency, and data locality» Resource offers: Simple, scalable application-controlled scheduling mechanism Hadoop Pregel Hadoop Pregel Mesos Node Node Node Node Node Node Node Node B. Hindman, et al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March

scheduling mechanism Hadoop Pregel Hadoop Pregel Mesos Node Node Node Node Node Node Node Node B.

18 People Humans can make sense of messy data! 18

19 People Make people an integrated part of the system!» Leverage human activity» Leverage human intelligence (crowdsourcing): Curate and clean dirty data Answer imprecise questions Test and improve algorithms data, activity Challenge» Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) Machines + Algorithms Questions Answers 19

dirty data Answer imprecise questions Test and improve algorithms data, activity

20 Our Vision: A Necessary Synergy Challenge 1: Data is Big Challenge 3: Questions are complex lgorithms achines eople Challenge 2: Data is Dirty

21 Berkeley Data Analytics Stack Shark BlinkDB SQL Spark Streaming GraphX MLBase Apache Spark HDFS / Hadoop Storage / Tachyon Apache Mesos / YARN Resource Manager

22 Big Data in 2020 Almost Certainly: Create a new generation of big data scientist A real datacenter OS ML becoming an engineering discipline If We re Lucky: System will know what to throw away Come up with answers in minutes no one knows People deeply integrated in big data analysis pipeline

23 Summary Goal: Tame Big Data Problem» Get results with right quality at the right time Approach: Holistically integrate Algorithms, Machines, and People Huge research issues across many domains

24 My Talks at LASER AMP Lab introduction (this talk) 2. The Datacenter Needs an Operating System 3. Mesos, part one 4. Dominant Resource Fairness 5. Mesos, part two 6. Spark

What s next for the Berkeley Data Analytics Stack?

What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff