Big Data Processing. Patrick Wendell Databricks

Similar documents
Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unified Big Data Processing with Apache Spark. Matei

Big Data Analytics Hadoop and Spark

Ali Ghodsi Head of PM and Engineering Databricks

Spark: Making Big Data Interactive & Real-Time

Unified Big Data Analytics Pipeline. 连 城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Beyond Hadoop with Apache Spark and BDAS

Conquering Big Data with BDAS (Berkeley Data Analytics)

Architectures for massive data management

Next-Gen Big Data Analytics using the Spark stack

Cloud Computing using MapReduce, Hadoop, Spark

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Hadoop Ecosystem B Y R A H I M A.

Brave New World: Hadoop vs. Spark

How Companies are! Using Spark

Moving From Hadoop to Spark

Spark and the Big Data Library

Streaming items through a cluster with Spark Streaming

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Processing with Google s MapReduce. Alexandru Costan

Real Time Data Processing using Spark Streaming

Machine- Learning Summer School

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Dell In-Memory Appliance for Cloudera Enterprise

Spark Application Carousel. Spark Summit East 2015

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

How To Create A Data Visualization With Apache Spark And Zeppelin

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

An Open Source Memory-Centric Distributed Storage System

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Hadoop IST 734 SS CHUNG

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

The Internet of Things and Big Data: Intro

Improving MapReduce Performance in Heterogeneous Environments

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop & Spark Using Amazon EMR

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Databricks. A Primer

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

GraySort on Apache Spark by Databricks

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Hybrid Software Architectures for Big

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Databricks. A Primer

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Open source Google-style large scale data analysis with Hadoop

Big Data and Scripting Systems beyond Hadoop

Case Study : 3 different hadoop cluster deployments

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Big Data Analytics. Lucas Rego Drumond

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

HiBench Introduction. Carson Wang Software & Services Group

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Large scale processing using Hadoop. Ján Vaňo

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

MLlib: Scalable Machine Learning on Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data Research in the AMPLab: BDAS and Beyond

Bayesian networks - Time-series models - Apache Spark & Scala

Search Engine Architecture

Oracle Big Data Fundamentals Ed 1 NEW

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Hama Design Document v0.6

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Write Once, Run Anywhere Pat McDonough

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

MapReduce with Apache Hadoop Analysing Big Data

Big Data Analytics with Cassandra, Spark & MLLib

MapReduce and Hadoop Distributed File System V I J A Y R A O

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Big-data Analytics: Challenges and Opportunities

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Managing large clusters resources

CSE-E5430 Scalable Cloud Computing Lecture 2

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Workshop on Hadoop with Big Data

Apache Spark and Distributed Programming

Tachyon: memory-speed data sharing

Architectures for Big Data Analytics A database perspective

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Transcription:

Big Data Processing Patrick Wendell Databricks

About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks Focus is on networking and operating systems

Outline Overview of today The Big Data problem The Spark Computing Framework

Straw poll. Have you: - Written code in a numerical programming environment (R/Matlab/Weka/etc)? - Written code in a general programming language (Python/Java/etc)? - Written multi-threaded or distributed computer programs?

Today s workshop Overview of trends in large-scale data analysis Introduction to the Spark cluster-computing engine with a focus on numerical computing Hands-on workshop with TA s covering Scala basics and using Spark for machine learning

Hands-on Exercises You ll be given a cluster of 5 machines* (a) Text mining of full text of Wikipedia (b) Movie recommendations with ALS 5 machines * 200 people = 1,000 machines

Outline Overview of today The Big Data problem The Spark Computing Framework

The Big Data Problem Data is growing faster than computation speeds Growing data sources» Web, mobile, scientific, Cheap storage» Doubling every 18 months Stalling CPU speeds Storage bottlenecks

Examples Facebook s daily logs: 60 TB 1000 genomes project: 200 TB Google web index: 10+ PB Cost of 1 TB of disk: $50 Time to read 1 TB from disk: 6 hours (50 MB/s)

The Big Data Problem Single machine can no longer process or even store all the data! Only solution is to distribute over large clusters

Google Datacenter How do we program this thing?

What s hard about cluster computing? How to divide work across nodes? Must consider network, data locality Moving data may be very expensive How to deal with failures? 1 server fails every 3 years => 10K nodes see 10 faults/day Even worse: stragglers (node is not failed, but slow)

A comparison: How to divide work across nodes? What if certain elements of a matrix/array became 1000 times slower to access than others? How to deal with failures? What if with 10% probability certain variables in your program became undefined?

Outline Overview of today The Big Data problem The Spark Computing Framework

The Spark Computing Framework Provides a programming abstraction and parallel runtime to hide this complexity. Here s an operation, run it on all of the data» I don t care where it runs (you schedule that)» In fact, feel free to run it twice on different nodes

Resilient Distributed Datasets Key programming abstraction in Spark Think parallel data frame Supports collections/set API s # get variance of 5 th field of a tab- delimited dataset rdd = spark.hadoopfile( big- file.txt ) rdd.map(line => line.split( \t )(4)).map(cell => cell.todouble()).variance()

RDD Execution Automatically split work into many small, idempotent tasks Send tasks to nodes based on data locality Load-balance dynamically as tasks finish

Fault Recovery 1. If a task crashes:» Retry on another node OK for a map because it had no dependencies OK for reduce because map outputs are on disk Requires user code to be deterministic

Fault Recovery 2. If a node crashes:» Relaunch its current tasks on other nodes» Relaunch any maps the node previously ran Necessary because their output files were lost

Fault Recovery 3. If a task is going slowly (straggler):» Launch second copy of task on another node» Take the output of whichever copy finishes first, and kill the other one

Spark Compared with Earlier Approaches Higher level, declarative API. Built from the ground up for performance including support for leveraging distributed cluster memory. Optimized for iterative computations (e.g. machine learning)

Spark Platform Spark RDD API

Spark Platform: GraphX graph = Graph(vertexRDD, edgerdd) graph.connectedcomponents() # returns a new RDD RDD- Based Graphs GraphX Graph (alpha) Spark RDD API

Spark Platform: MLLib model = LogisticRegressionWithSGD.train(trainRDD) datardd.map(point => model.predict(point)) RDD- Based Graphs GraphX Graph (alpha) RDD- Based Matrices MLLib machine learning Spark RDD API

Spark Platform: Streaming dstream = spark.networkinputstream() dstream.countbywindow(seconds(30)) RDD- Based Graphs GraphX Graph (alpha) RDD- Based Matrices MLLib machine learning RDD- Based Streams Streaming Spark RDD API

Spark Platform: SQL rdd = sql select * from rdd1 where age > 10 RDD- Based Graphs RDD- Based Matrices RDD- Based Streams Schema RDD s GraphX Graph (alpha) MLLib machine learning Streaming SQL Spark RDD API

Performance Response Time (s) 25 20 15 10 5 0 Impala (disk) Impala (mem) Redshift Shark (disk) SQL[1] Shark (mem) Throughput (MB/s/node) 35 30 25 20 15 10 5 0 Storm [1] https://amplab.cs.berkeley.edu/benchmark/ [2] Discretized Streams: Fault-Tolerant Streaming Computation at Scale. At SOSP 2013. [3] https://amplab.cs.berkeley.edu/publication/graphx-grades/ Spark Streaming[2] Response Time (min) 30 25 20 15 10 5 0 Hadoop Giraph GraphX Graph[3]

Composition Most large-scale data programs involve parsing, filtering, and cleaning data. Spark allows users to compose patterns elegantly. E.g.: Select input dataset with SQL, then run machine learning on the result.

Spark Community One of the largest open source projects in big data 150+ developers contributing 30+ companies contributing 150 Contributors in past year 100 50 0

Community Growth Spark 0.8: 67 contributors Spark 0.9: 83 contributors Spark 0.7:" 31 contributors Spark 0.6: 17 contributors Oct 12 Feb 13 Sept 13 Feb 14

Databricks Primary developers of Spark today Founded by Spark project creators A nexus of several research areas: à OS and computer architecture à Networking and distributed systems à Statistics and machine learning

Today s Speakers Holden Karau Worked on large-scale storage infrastructure @ Google. Intro to Spark and Scala Hossein Falaki Data scientist at Apple. Numerical Programming with Spark Xiangrui Meng PhD from Stanford ICME. Lead contributor on MLLib. Deep dive on Spark s MLLib

Questions?