The Berkeley AMPLab - Collaborative Big Data Research UC BERKELEY Anthony D. Joseph LASER Summer School September 2013
About Me Education: MIT SB, MS, PhD Joined Univ. of California, Berkeley in 1998 Current research areas:» Cloud computing (Mesos): http://mesos.apache.org/» Secure Machine Learning (SecML): http://radlab.cs.berkeley.edu/wiki/secml» DETER security testbed: http://deter-project.org/» Intel Science and Technology Center for User Security: http://scrub.cs.berkeley.edu/ Other: Peer-to-Peer networking (Tapestry), Mobile computing, Wireless/Cellular networking 2
Sources Driving Big Data It s All Happening On- line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M Scientific Computing
Challenge 1: Data is Big 60 Projected Growth Increase over 2010 50 40 30 20 10 Moore's Law Overall Data Particle Accel. DNA Sequencers 0 2010 2011 2012 2013 2014 2015 Data Grows faster than Moore s Law [IDC report, Kathy Yelick, LBNL]
Challenge 2: Data is Dirty Variety of diverse sources Uncurated No schema Inconsistent syntax and semantics Dirty Data worse than Big Data
Challenge 3: Complex Questions Hard questions» What is the impact on traffic and home prices of building a new ramp? Real-time questions» Is there a cyber attack going on? Open-ended questions» How many supernovae happened last year? Big Data Must Enable Decisions
Requires Multifaceted Approach Three dimensions to improve data analysis» Improving scale, efficiency, and quality of algorithms running in datacenters (Algorithms)» Scaling up datacenters (Machines)» Leverage human activity and intelligence (People) Need to adaptively and flexibly combine all three dimensions 7
Algorithms, Machines, People (AMP) Today s apps: fixed point in solution space Algorithms Watson/IBM search Machines People Need techniques to dynamically pick best operating point 8
The AMP Lab Make sense of data at scale by tightly integrating algorithms, machines, and people Algorithms Watson/IBM search Machines People 9
AMP Lab Faculty» Alex Bayen (mobile sensing platforms)» Armando Fox (systems)» Michael Franklin (databases): Director» Michael Jordan (machine learning): Co-director» Anthony Joseph (secure machine learning & privacy)» Randy Katz (systems)» David Patterson (systems)» Ion Stoica (systems): Co-director» Scott Shenker (networking)
Algorithms State-of-art Machine Learning (ML) algorithms do not scale» Prohibitive to process all data points Estimate" true answer How do you know when to stop? # of data points 11
Algorithms Given any problem, data and a budget» Immediate results with continuous improvement» Calibrate answer: provide error bars Estimate" true answer Error bars on every answer! # of data points 12
Algorithms Given any problem, data and a budget» Immediate results with continuous improvement» Calibrate answer: provide error bars Estimate" true answer Stop when error smaller than a given threshold # of data points time 13
Algorithms Given any problem, data and a time budget» Automatically pick a solution on ML algorithm spectrum Estimate" simple sophisticated true answer error too high pick sophisticated pick simple time 14
Machines The datacenter as a computer still in its infancy» Special purpose clusters, e.g., Hadoop cluster» Highly variable performance» Hard to program» Hard to debug =!? 15
Machines: Problem Rapid innovation in cloud computing Dryad Hypertable Cassandra Pregel No single framework optimal for all applications Want to run multiple frameworks in a single cluster» to maximize utilization» to share data between frameworks 16
Machines: A Solution Apache Mesos: a resource sharing layer supporting diverse frameworks» Fine-grained sharing: Improves utilization, latency, and data locality» Resource offers: Simple, scalable application-controlled scheduling mechanism Hadoop Pregel Hadoop Pregel Mesos Node Node Node Node Node Node Node Node B. Hindman, et al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011. http://mesos.apache.org/ 17
People Humans can make sense of messy data! 18
People Make people an integrated part of the system!» Leverage human activity» Leverage human intelligence (crowdsourcing): Curate and clean dirty data Answer imprecise questions Test and improve algorithms data, activity Challenge» Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) Machines + Algorithms Questions Answers 19
Our Vision: A Necessary Synergy Challenge 1: Data is Big Challenge 3: Questions are complex lgorithms achines eople Challenge 2: Data is Dirty
Berkeley Data Analytics Stack Shark BlinkDB SQL Spark Streaming GraphX MLBase Apache Spark HDFS / Hadoop Storage / Tachyon Apache Mesos / YARN Resource Manager
Big Data in 2020 Almost Certainly: Create a new generation of big data scientist A real datacenter OS ML becoming an engineering discipline If We re Lucky: System will know what to throw away Come up with answers in minutes no one knows People deeply integrated in big data analysis pipeline
Summary Goal: Tame Big Data Problem» Get results with right quality at the right time Approach: Holistically integrate Algorithms, Machines, and People Huge research issues across many domains
My Talks at LASER 2013 1. AMP Lab introduction (this talk) 2. The Datacenter Needs an Operating System 3. Mesos, part one 4. Dominant Resource Fairness 5. Mesos, part two 6. Spark