CS 294: Big Data System Research: Trends and Challenges



Similar documents
Conquering Big Data with BDAS (Berkeley Data Analytics)

Berkeley Data Analytics Stack:! Experience and Lesson Learned

The Berkeley AMPLab - Collaborative Big Data Research

Big Data Research in the AMPLab: BDAS and Beyond

What s next for the Berkeley Data Analytics Stack?

Ali Ghodsi Head of PM and Engineering Databricks

How Companies are! Using Spark

Next-Gen Big Data Analytics using the Spark stack

Beyond Hadoop with Apache Spark and BDAS

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Conquering Big Data with Apache Spark

Hadoop Ecosystem B Y R A H I M A.

The Berkeley Data Analytics Stack: Present and Future

Case Study : 3 different hadoop cluster deployments

Dell In-Memory Appliance for Cloudera Enterprise

Moving From Hadoop to Spark

Why Spark on Hadoop Matters

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unified Big Data Analytics Pipeline. 连 城

How To Create A Data Visualization With Apache Spark And Zeppelin

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

Tachyon: memory-speed data sharing

Unified Big Data Processing with Apache Spark. Matei

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

A Brief Introduction to Apache Tez

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

An Open Source Memory-Centric Distributed Storage System

Hadoop in the Enterprise

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Hadoop Big Data for Processing Data and Performing Workload

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Analytics on Spark &

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Shark Installation Guide Week 3 Report. Ankush Arora

Information Builders Mission & Value Proposition

6.S897 Large-Scale Systems

The Internet of Things and Big Data: Intro

Big Data and Industrial Internet

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Architectures for massive data management

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Big Data Analytics. Lucas Rego Drumond

Oracle Big Data SQL Technical Update

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Enabling High performance Big Data platform with RDMA

Big Data Analytics Hadoop and Spark

Native Connectivity to Big Data Sources in MSTR 10

Big Data Frameworks Course. Prof. Sasu Tarkoma

CSE-E5430 Scalable Cloud Computing Lecture 11

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

A survey on platforms for big data analytics

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

TRAINING PROGRAM ON BIGDATA/HADOOP

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Spark and the Big Data Library

Processing NGS Data with Hadoop-BAM and SeqPig

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Machine- Learning Summer School

Dominik Wagenknecht Accenture

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Data. Lyle Ungar, University of Pennsylvania

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

HiBench Introduction. Carson Wang Software & Services Group

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Large scale processing using Hadoop. Ján Vaňo

Making Sense at Scale with the Berkeley Data Analytics Stack

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop & Spark Using Amazon EMR

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data Processing. Patrick Wendell Databricks

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Workshop on Hadoop with Big Data

Spark Application on AWS EC2 Cluster:

Big Data Realities Hadoop in the Enterprise Architecture

HADOOP VENDOR DISTRIBUTIONS THE WHY, THE WHO AND THE HOW? Guruprasad K.N. Enterprise Architect Wipro BOTWORKS

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Azure Data Lake Analytics

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Self-service BI for big data applications using Apache Drill

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Transcription:

CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/) 1

Big Data First papers:» 2003: The Google file system paper» 2004: The MapReduce paper Today every major system & networking conference has Big Data sessions

Big Data Impact Already helped create new business Already helped disrupt existing businesses» Retail» Rental» Taxi» home appliances»

Big Data Stack Data Processing Layer Resource Management Layer Storage Layer

Hadoop Stack Hive Pig Data Processing ImpalaLayer Storm Hadoop MR Resource Hadoop Management Yarn Layer Storage HDFS, S3, Layer

The Berkeley AMPLab lgorithms January 2011 2017» 8 faculty» > 40 students» 3 software engineer team Organized for collaboration achines eople AMPCamp3 (August, 2013) 3 day retreats (twice a year) 220 campers (100+ companies)

The Berkeley AMPLab Governmental and industrial funding: Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Spark Streaming BDAS Stack BlinkDB Data Processing Layer Shark SQL Spark Resource Management Mesos Layer Tachyon GraphX Storage Layer HDFS, S3, MLBase MLlib

BDAS & Hadoop fitting together Spark Streaming BlinkDB GraphX Shark SQL Spark Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3, MLBase MLlib

How do BDAS & Hadoop fit together? Spark Spark Streaming Straming BlinkDB Shark SQL BlinkDB Shark SQL Spark Graph X MLbase ML library Spark GraphX Hive Pig Hadoop MR MLBase MLlib Impala Storm Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3,

How do BDAS & Hadoop fit together? Spark Straming BlinkDB Shark SQL Graph X MLbase ML library Hive Pig Impala Storm Spark Hadoop MR Hadoop Mesos Yarn Tachyon HDFS, S3, HDFS, S3,

This Class Learn about state-of-art research in Big Data Work on an exciting project Hopefully start next generation of impactful projects

Grading Project: 60% Class presentations: 40%» Around 2 papers per student» See Randy s guidelines for leading discussion on papers http://bnrg.eecs.berkeley.edu/~randy/courses/ CS294.F07/LeadingPapers.pdf 13

Administrative Information Class website: http://www.cs.berkeley.edu/~istoica/classes/cs294/15/ Office Hours (Soda 465D):» TBA Create an (anonymized) blog account for paper reviews if you don t have one yet (e.g., www.blogger.com)» Sent me an e-mail by Monday, August 31, with your blog url» Preferred e-mail for the class e-mail list 14

Is the problem real? Papers What is the solution s main idea (nugget)? Why is solution different from previous work?» Are system assumptions different?» Is workload different?» Is problem new? Does the paper (or do you) identify any fundamental/hard trade-offs? 15

Papers (cont d) Do you think the work will be influential in 10 years?» Why or why not? Predicting the future hard, but worth a try» Look at past examples for inspiration 16

Streaming Over TCP Countless papers:» Why cannot be done» New protocols to do it Today» Virtually all streaming over TCP» Trend to stream over HTTP! 17

Why did it Succeed? 18

Multicast Countless papers:» Why world will come to a standstill without multicast» New protocols to do it Today» Multicast is used only in enterprise settings at best» Overlay multicast widely used in the Internet CDN based, e.g., WorldCup, March Madness, Iinagurations,... P2P, mostly popular outside US (e.g., China) 19

Why Did it Fail? 20

Shared Memory Countless papers:» How shared memory simplifies programming parallel computers» Many, many systems proposed and build Today:» Message passing (MPI) took over as the de facto standard for writing parallel applications 21

Why Did it Fail? 22

Network Computer Big in 90s» Promoted by an alliance of Sun, Oracle, Acorn Promise: many of advantages of cloud computing» Easy to manage» Application sharing» Failed miserably 23

Why Did it Fail? 24

Coming Back: ChromeOS Will it succeed this time? 25

What are Hard/Fundamental Tradeoffs? Brewer s CAP conjecture: Consistency, Availability, Partition-tolerance, you can have only two in a distributed system In a in-order, reliable communication protocol cannot minimize overhead and latency simultaneously Hard to simultaneously maximize evolvability and performance 26