CS 294: Big Data System Research: Trends and Challenges

Size: px

Start display at page:

Download "CS 294: Big Data System Research: Trends and Challenges"

Brooke Parrish
10 years ago
Views:

1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi ( 1

2 Big Data First papers:» 2003: The Google file system paper» 2004: The MapReduce paper Today every major system & networking conference has Big Data sessions

3 Big Data Impact Already helped create new business Already helped disrupt existing businesses» Retail» Rental» Taxi» home appliances»

4 Big Data Stack Data Processing Layer Resource Management Layer Storage Layer

5 Hadoop Stack Hive Pig Data Processing ImpalaLayer Storm Hadoop MR Resource Hadoop Management Yarn Layer Storage HDFS, S3, Layer

6 The Berkeley AMPLab lgorithms January » 8 faculty» > 40 students» 3 software engineer team Organized for collaboration achines eople AMPCamp3 (August, 2013) 3 day retreats (twice a year) 220 campers (100+ companies)

7 The Berkeley AMPLab Governmental and industrial funding: Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

8 Spark Streaming BDAS Stack BlinkDB Data Processing Layer Shark SQL Spark Resource Management Mesos Layer Tachyon GraphX Storage Layer HDFS, S3, MLBase MLlib

9 BDAS & Hadoop fitting together Spark Streaming BlinkDB GraphX Shark SQL Spark Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3, MLBase MLlib

10 How do BDAS & Hadoop fit together? Spark Spark Streaming Straming BlinkDB Shark SQL BlinkDB Shark SQL Spark Graph X MLbase ML library Spark GraphX Hive Pig Hadoop MR MLBase MLlib Impala Storm Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3,

Shark SQL Spark Graph X MLbase ML library Spark GraphX Hive

11 How do BDAS & Hadoop fit together? Spark Straming BlinkDB Shark SQL Graph X MLbase ML library Hive Pig Impala Storm Spark Hadoop MR Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3,

12 This Class Learn about state-of-art research in Big Data Work on an exciting project Hopefully start next generation of impactful projects

13 Grading Project: 60% Class presentations: 40%» Around 2 papers per student» See Randy s guidelines for leading discussion on papers CS294.F07/LeadingPapers.pdf 13

for leading discussion on papers http://bnrg.eecs.

14 Administrative Information Class website: Office Hours (Soda 465D):» TBA Create an (anonymized) blog account for paper reviews if you don t have one yet (e.g., Sent me an by Monday, August 31, with your blog url» Preferred for the class list 14

blog account for paper reviews if you don t have one yet (e.g., www.blogger.

15 Is the problem real? Papers What is the solution s main idea (nugget)? Why is solution different from previous work?» Are system assumptions different?» Is workload different?» Is problem new? Does the paper (or do you) identify any fundamental/hard trade-offs? 15

16 Papers (cont d) Do you think the work will be influential in 10 years?» Why or why not? Predicting the future hard, but worth a try» Look at past examples for inspiration 16

17 Streaming Over TCP Countless papers:» Why cannot be done» New protocols to do it Today» Virtually all streaming over TCP» Trend to stream over HTTP! 17

18 Why did it Succeed? 18

19 Multicast Countless papers:» Why world will come to a standstill without multicast» New protocols to do it Today» Multicast is used only in enterprise settings at best» Overlay multicast widely used in the Internet CDN based, e.g., WorldCup, March Madness, Iinagurations,... P2P, mostly popular outside US (e.g., China) 19

settings at best» Overlay multicast widely used in the Internet CDN based, e.g., WorldCup, March Madness, Iinagurations,.

20 Why Did it Fail? 20

21 Shared Memory Countless papers:» How shared memory simplifies programming parallel computers» Many, many systems proposed and build Today:» Message passing (MPI) took over as the de facto standard for writing parallel applications 21

22 Why Did it Fail? 22

23 Network Computer Big in 90s» Promoted by an alliance of Sun, Oracle, Acorn Promise: many of advantages of cloud computing» Easy to manage» Application sharing» Failed miserably 23

24 Why Did it Fail? 24

25 Coming Back: ChromeOS Will it succeed this time? 25

26 What are Hard/Fundamental Tradeoffs? Brewer s CAP conjecture: Consistency, Availability, Partition-tolerance, you can have only two in a distributed system In a in-order, reliable communication protocol cannot minimize overhead and latency simultaneously Hard to simultaneously maximize evolvability and performance 26

Conquering Big Data with BDAS (Berkeley Data Analytics)

UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»