Making Sense at Scale with Algorithms, Machines & People!

Size: px

Start display at page:

Download "Making Sense at Scale with Algorithms, Machines & People!"

George Dennis
8 years ago
Views:

1 UC BERKELEY Making Sense at Scale with Algorithms, Machines & People PI: Michael Franklin University of California, Berkeley Expeditions in Computing PI Meeting May 15, 2013

2 2 The Berkeley AMPLab

3 Sources Driving Big Data It s All Happening On- line Every: Click Ad impression Billing event Fast Forward, pause, Friend Request Transaction Network message Fault User Generated (Web & Mobile).. Internet of Things / M2M ScienCfic CompuCng

Friend Request Transaction Network message Fault User

4 Challenge 1: Data is Big 60 Projected Growth Increase over Moore's Law Overall Data Par8cle Accel. DNA Sequencers Data Grows faster than Moore s Law [IDC report, Kathy Yelick, LBNL]

DNA Sequencers 0 2010 2011 2012 2013 2014 2015 Data Grows

5 Challenge 2: Data is Dirty Variety of diverse sources Uncurated No schema Inconsistent syntax and semantics Dirty Data worse than Big Data

6 Challenge 3: Complex Questions Hard questions What is the impact on traffic and home prices of building a new onramp? Detect real-time events Is there a cyber attack going on? Open-ended questions How many supernovae happened last year?

Detect real-time events Is there a cyber attack going on?

7 Our Vision: A Necessary Synergy lgorithms achines eople Challenge 1: Data is Big Challenge 2: Data is Dirty Challenge 3: Ques8ons are complex

8 The AMPLab Big Bets Traditional intellectual borders hinder Big Data stacks Need Machine Learning/Systems/Database Co-Design Requires Cohabitation and Real Collaboration Now is a unique opportunity to rethink fundamental design points: Changing Latency Demands Changing Consistency Requirements Cloud-based Elastic Resources Huge Desire for New Solutions in the Marketplace Open Source is the key to Tech Transfer in Big Data Need to consider role of people throughout the entire analytics lifecycle 8

Latency Demands Changing Consistency Requirements Cloud-based Elastic Resources Huge Desire for New Solutions in the

9 AMPLab: Collaborative Research An integration of Faculty Interests (*Directors): Alex Bayen (Mobile Sensing) Ken Goldberg (Crowdsourcing) *Michael Franklin (Databases) Anthony Joseph (Sec./ Privacy) Randy Katz (Systems) Dave Pa`erson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Sco` Shenker (Networking) 50+ amazing grad students, post-docs, undergrads, developers, staff & visitors Twice-Yearly Research Retreats (industry & sponsors): 9

/ Privacy) Randy Katz (Systems) Dave Pa`erson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan

10 10 Co-Located for Collaboration

11 Collaboration: Industry + Government AMPLab Launched January 2011 (5 yr plan) Founding Sponsors: Sponsors and Affiliates: Federal Grants and Contracts: 11 Expeditions in Computing XData Program

12 Collaboration: Applications Participatory Sensing Mobile Millenium - Traffic Collective Discovery Opinion Space - Opinions Carat Smartphone energy Urban Planning and Simulation UrbanSim data integration Cancer Genomics/Personalized Medicine (w/ UCSF and UCSC) SNAP: Fast Sequence Alignment Genome Data Warehouse 12

Planning and Simulation UrbanSim data integration Cancer Genomics/Personalized

13 13 Shared Deliverable: Berkeley Data Analytics Stack (BDAS)

14 BDAS: Current Snapshot Spark Streaming Spark Graph ML base BlinkDB Shark HIVE Pig Storm MPI Data Processing Spark Hadoop Tachyon HDFS Data Mgmt. Mesos Resource Mgmt. Released (BDAS) In development (BDAS) Exis8ng open source stack BDAS Components being released under BSD or Apache Open Source License

15 Big Data Landscape Our Corner 15

16 Impact (so far) Open Source Release of BDAS components: Mesos: Cluster Virtualization Business critical services on servers at Twitter see How Twitter Rebuilt Google s Secret Weapon Wired 3/13 Spark: In-memory Computation Framework & Shark: Hive-Compatible SQL Query Engine on Spark in use at large companies, start ups, and govt. agencies 100x Performance Improvement over Hadoop/Apache Hive available on Amazon Elastic Map Reduce 700+ member Meetup group Best Paper Awards: Eurosys 13, ICDE 13, NSDI 12, SIGCOMM 12 and Best Demo Award: SIGMOD 12 Students in high-demand in academia and industry 16

17 Spark: Sys/ML Collaboration at Work Technical Challenge: disk-oriented Hadoop Map Reduce inefficient for iterative Machine Learning iter. 1 iter Research Challenge Addressed: How to design a distributed memory abstraction that is both fault-tolerant and efficient? Logistic Regression Performance Solution: Resilient Distributed Datasets (RDDs) 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

.. Research Challenge Addressed: How to design a distributed memory abstraction that is both

18 Impact: Carat Smartphone App 18 Over 500,000 downloads

19 MLBase Declarative ML Vision: Make Machine Learning usable by mere mortals Allow high-level (declarative) specification of ML tasks Use Database-style query optimization to generate efficient execution strategy 19

specification of ML tasks Use Database-style query

20 Hybrid Human/Machine Systems Use machines for bulk data processing Leverage human activity for data collection and event detection Leverage human knowledge, reasoning and perception for: subjective entity comparisons complex predicates finding missing data disambiguating questions CrowdSQL MetaData Statistics Disk 1 Disk 2 Parser Optimizer Executor Files Access Methods Results Turker Relationship Manager UI Creation Form Editor UI Template Manager HIT Manager e.g., CrowdDB Architecture 20

missing data disambiguating questions CrowdSQL MetaData Statistics Disk 1 Disk 2 Parser Optimizer Executor Files Access

21 Outreach AMPCamp Berkeley, August 2012 AMPCamp Strata Conf., Feb 2013 AMPCamp Berkeley, August 2013 AMPCamp Online: ampcamp.berkeley.edu 21

22 What do we get from Expeditions? Simply put the ability to swing for the fences 22

23 For More Information amplab.cs.berkeley.edu Papers and Project Pages News updates and Blogs Github and Apache 23

The Berkeley AMPLab - Collaborative Big Data Research

The Berkeley AMPLab - Collaborative Big Data Research UC BERKELEY Anthony D. Joseph LASER Summer School September 2013 About Me Education: MIT SB, MS, PhD Joined Univ. of California, Berkeley in 1998 Current