Conquering Big Data with Apache Spark

Similar documents

Unified Big Data Processing with Apache Spark. Matei

Spark: Making Big Data Interactive & Real-Time

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Conquering Big Data with BDAS (Berkeley Data Analytics)

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Introduc8on to Apache Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Data Science in the Wild

Architectures for massive data management

Beyond Hadoop with Apache Spark and BDAS

Unified Big Data Analytics Pipeline. 连城

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Using Apache Spark Pat McDonough - Databricks

Moving From Hadoop to Spark

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark and the Big Data Library

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Next-Gen Big Data Analytics using the Spark stack

CS 294: Big Data System Research: Trends and Challenges

Big Data Processing. Patrick Wendell Databricks

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Ali Ghodsi Head of PM and Engineering Databricks

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Apache Flink Next-gen data analysis. Kostas

What s next for the Berkeley Data Analytics Stack?

HiBench Introduction. Carson Wang Software & Services Group

Apache Spark and Distributed Programming

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Architectures for massive data management

How To Create A Data Visualization With Apache Spark And Zeppelin

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Research in the AMPLab: BDAS and Beyond

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Streaming items through a cluster with Spark Streaming

Hadoop Ecosystem B Y R A H I M A.

An Open Source Memory-Centric Distributed Storage System

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Tachyon: memory-speed data sharing

CSE-E5430 Scalable Cloud Computing Lecture 11

Real Time Data Processing using Spark Streaming

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Spark: Cluster Computing with Working Sets

This is a brief tutorial that explains the basics of Spark SQL programming.

Apache Flink. Fast and Reliable Large-Scale Data Processing

Berkeley Data Analytics Stack:! Experience and Lesson Learned

How Companies are! Using Spark

Functional Query Optimization with" SQL

Making Sense at Scale with the Berkeley Data Analytics Stack

Big Data Analytics Hadoop and Spark

Big Data Analytics with Cassandra, Spark & MLLib

The Internet of Things and Big Data: Intro

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

The Berkeley Data Analytics Stack: Present and Future

Unlocking the True Value of Hadoop with Open Data Science

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Hadoop & Spark Using Amazon EMR

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Machine- Learning Summer School

Big Data Analytics. Lucas Rego Drumond

The Berkeley AMPLab - Collaborative Big Data Research

Big Data and Scripting Systems beyond Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Introduction to Spark

Significantly Speed up real world big data Applications using Apache Spark

Brave New World: Hadoop vs. Spark

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Architectures for Big Data Analytics A database perspective

Big Data Frameworks: Scala and Spark Tutorial

Open source Google-style large scale data analysis with Hadoop

BIG DATA APPLICATIONS

Can t We All Just Get Along? Spark and Resource Management on Hadoop

The Hadoop Eco System Shanghai Data Science Meetup

Analytics on Spark &

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop IST 734 SS CHUNG

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Dell In-Memory Appliance for Cloudera Enterprise

CS54100: Database Systems

Large scale processing using Hadoop. Ján Vaňo

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Managing large clusters resources

Making big data simple with Databricks

Shark Installation Guide Week 3 Report. Ankush Arora

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

In Memory Accelerator for MongoDB

Big Data With Hadoop

Productionizing a 24/7 Spark Streaming Service on YARN

Transcription:

Conquering Big Data with Apache Spark Ion Stoica November 1 st, 2015 UC BERKELEY

The Berkeley AMPLab January 2011 2017 8 faculty > 50 students 3 software engineer team Organized for collaboration achines lgorithms eople AMPCamp (since 2012) 3 day retreats (twice a year) 400+ campers (100s companies)

The Berkeley AMPLab Governmental and industrial funding: Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Generic Big Data Stack Processing Layer Resource Management Layer Storage Layer

Hadoop Stack Hive Pig HadoopMR Storm Impala Processing Layer Giraph Processing Yarn Resource Management Layer Res. Mgmnt Storage HDFS Layer Storage

BDAS Stack Spark Streaming BlinkDB SparkSQL Sample Clean SparkR Processing Layer Spark Core GraphX Mesos Resource Management Hadoop Layer Yarn Succinct Tachyon BDAS Stack Storage Layer 3 rd party MLBase MLlib Velox Velox HDFS, S3, Ceph, Processing Res. Mgmnt Storage

Today s Talk Spark Streaming BlinkDB SparkSQL Sample Clean SparkR Spark Core GraphX Mesos Resource Management Hadoop Layer Yarn Succinct Tachyon BDAS Stack Storage Layer 3 rd party MLBase MLlib Velox Velox HDFS, S3, Ceph, Today s Talk Processing Res. Mgmnt Storage

Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten

Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten

A Short History Started at UC Berkeley in 2009 Open Source: 2010 Apache Project: 2013 Today: most popular big data project

What Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code than Hadoop MR High level API s in Python, Java, and Scala Fast: up to 100x faster than Hadoop MR Can exploit in-memory when available Low overhead scheduling, optimized engine General: support multiple computation models

Analogy First cellular phones Specialized devices Unified device (smartphone)

Analogy Better Phone First cellular phones Better Games Better GPS Specialized devices Unified device (smartphone)

Analogy Batch processing Specialized systems Unified system

Analogy Real-time analytics Instant fraud detection Better Apps Batch processing Specialized systems Unified system

General Unifies batch, interactive comp. SparkSQL Spark Core

General Unifies batch, interactive, streaming comp. SparkSQL Spark Streaming Spark Core

General Unifies batch, interactive, streaming comp. Easy to build sophisticated applications Support iterative, graph-parallel algorithms Powerful APIs in Scala, Python, Java, R Spark SparkSQL MLlib GraphX SparkR Streaming Spark Core

Easy to Write Code WordCount in 3 lines of Spark WordCount in 50+ lines of Java MR

Fast: Time to sort 100TB 2013 Record: Hadoop 2100 machines 72 minutes 2014 Record: Spark 207 machines 23 minutes Also sorted 1PB in 4 hours Source: Daytona GraySort benchmark, sortbenchmark.org

Community Growth June 2014 June 2015 total contributors 255 730 contributors/month 75 135 lines of code 175,000 400,000

Meetup Groups: January 2015 source: meetup.com

Meetup Groups: October 2015 source: meetup.com

Community Growth Summit Attendees Meetup Members Developers Contributing 3900 42K 600 1100 12K 350 2014 2015 2014 2015 2014 2015

Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk sort record

Spark Ecosystem Distributions Applications

Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten

RDD: Resilient Distributed Datasets Collections of objects distr. across a cluster Stored in RAM or on Disk Automatically rebuilt on failure Operations Transformations Actions Execution model: similar to SIMD

Operations on RDDs Transformations f(rdd) => RDD Lazy (not computed immediately) E.g., map, filter, groupby Actions: Triggers computation E.g. count, collect, saveastextfile

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Driver

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Driver

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD lines = spark.textfile( hdfs://... ) Driver

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) Driver

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) Driver

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver messages.filter(lambda s: mysql in s).count()

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Driver Ac5on

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver Block 1 messages.filter(lambda s: mysql in s).count() Block 2 Block 3

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Driver tasks tasks Block 1 tasks Block 2 Block 3

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Block 3 Driver Read HDFS Block Block 1 Read HDFS Block Block 2 Read HDFS Block

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Block 3 Driver Cache 3 Process & Cache Data Cache 1 Block 1 Process & Cache Data Cache 2 Block 2 Process & Cache Data

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Block 3 Driver results Cache 3 results results Cache 1 Block 1 Cache 2 Block 2

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver Cache 1 Block 1 messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Block 3 Cache 3 Cache 2 Block 2

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Driver Block 3 tasks Cache 3 tasks tasks Cache 1 Block 1 Cache 2 Block 2

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Block 3 Driver Cache 3 Process from Cache Cache 1 Block 1 Process from Cache Cache 2 Block 2 Process from Cache

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Block 3 Driver results Cache 3 results Cache 1 results Block 1 Cache 2 Block 2

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver Cache 1 Block 1 messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Cache your data è Faster Results Full-text search of Wikipedia 60GB on 20 EC2 machines 0.5 sec from mem vs. 20s for on-disk Block 3 Cache 3 Cache 2 Block 2

Language Support Python lines = sc.textfile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textfile(...) lines.filter(x => x.contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count(); Standalone Programs Python, Scala, & Java Interactive Shells Python & Scala Performance Java & Scala are faster due to static typing but Python is often fine

Expressive API map reduce

Expressive API map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

Fault Recovery: Design Alternatives Replication: Slow: need to write data over network Memory inefficient Backup on persistent storage Persistent storage still (much) slower than memory Still need to go over network to protect against machine failures Spark choice: Lineage: track sequence of operations to efficiently reconstruct lost RRD partitions

Fault Recovery Example Two-partition RDD A={A 1, A 2 } stored on disk 1) filter and cache à RDD B 2) joinà RDD C 3) aggregate à RDD D RDD A A 1 filter B 1 join C 1 agg. D A 2 filter B 2 join C 2

Fault Recovery Example C 1 lost due to node failure before reduce finishes RDD A A 1 filter B 1 join C 1 agg. D A 2 filter B 2 join C 2

Fault Recovery Example C 1 lost due to node failure before reduce finishes Reconstruct C 1, eventually, on different node RDD A A 1 filter B 1 join C 1 agg. D A 2 filter B 2 join C 2

Fault Recovery Results Iteratrion time (s) 150 100 50 0 119 Failure happens 81 57 56 58 58 57 59 57 59 1 2 3 4 5 6 7 8 9 10 Iteration

Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten

Spark Streaming: Motivation Process large data streams at second-scale latencies Site statistics, intrusion detection, online ML To build and scale these apps users want: Integration: with offline analytical stack Fault-tolerance: both for crashes and stragglers

Traditional Streaming Systems Event-driven record-at-a-times Each node has mutable state For each record, update state & send new records State is lost if node dies input records input records mutable state node 1 node 3 node 2 Making stateful stream processing be fault-tolerant is challenging

Spark Streaming Data streams are chopped into batches A batch is an RDD holding a few 100s ms worth of data Each batch is processed in Spark data streams receivers batches

How does it work? Data streams are chopped into batches A batch is an RDD holding a few 100s ms worth of data Each batch is processed in Spark Results pushed out in batches Streaming data streams receivers batches results

Streaming Word Count val lines = context.sockettextstream( localhost, 9999) val words = lines.flatmap(_.split(" ")) val wordcounts = words.map(x => (x, 1)).reduceByKey(_ + _) create DStream from data over socket split lines into words count the words wordcounts.print() ssc.start() print some counts on screen start processing the stream

Word Count object NetworkWordCount { def main(args: Array[String]) { val sparkconf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.sockettextstream( localhost, 9999) val words = lines.flatmap(_.split(" ")) val wordcounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() } }

public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { Word Count Spark Streaming object NetworkWordCount { def main(args: Array[String]) { val sparkconf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.sockettextstream( localhost, 9999) val words = lines.flatmap(_.split(" ")) val wordcounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() } } Storm public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareoutputfields(outputfieldsdeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getcomponentconfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(tuple tuple, BasicOutputCollector collector) { String word = tuple.getstring(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareoutputfields(outputfieldsdeclarer declarer) { declarer.declare(new Fields("word", "count")); } } public static void main(string[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setspout("spout", new RandomSentenceSpout(), 5); builder.setbolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setbolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setdebug(true); if (args!= null && args.length > 0) { conf.setnums(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createtopology()); } else { conf.setmaxtaskparallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submittopology("word-count", conf, builder.createtopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Machine Learning Pipelines tokenizer = Tokenizer(inputCol="text", outputcol="words ) hashingtf = HashingTF(inputCol="words", outputcol="features ) lr = LogisticRegression(maxIter=10, regparam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingtf, lr]) df = sqlctx.load("/path/to/data") model = pipeline.fit(df) lr df0 tokenizer df1 hashingtf df2 lr.model df3 Pipeline Model

Powerful Stack Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) non-test, non-example source lines Giraph (Graph) Spark

Powerful Stack Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark Streaming non-test, non-example source lines

Powerful Stack Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark SparkSQL Streaming non-test, non-example source lines

Powerful Stack Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark GraphX SparkSQL Streaming non-test, non-example source lines

Powerful Stack Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark Your App? GraphX SparkSQL Streaming non-test, non-example source lines

Benefits for Users High performance data sharing Data sharing is the bottleneck in many environments RDD s provide in-place sharing through memory Applications can compose models Run a SQL query and then PageRank the results ETL your data and then run graph/ml on it Benefit from investment in shared functionality E.g. re-usable components (shell) and performance optimizations

Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten

Beyond Hadoop Users Spark early adopters Users Understands MapReduce & functional APIs Data Engineers Data Scientists Statisticians R users PyData

DataFrames in Spark Distributed collection of data grouped into named columns (i.e. RDD with schema) Domain-specific functions designed for common tasks Metadata Sampling Project, filter, aggregation, join, UDFs Available in Python, Scala, Java, and R

Spark DataFrame Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn > head(filter(df, df$waiting < 50)) # an example in R ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48

Spark RDD Execution Java/Scala frontend opaque closures (user-defined functions) Python frontend JVM backend Python backend

Spark DataFrame Execution DataFrame frontend Logical Plan Intermediate representation for computation Catalyst optimizer Physical execution

Spark DataFrame Execution Python DF Java/Scala DF R DF Simple wrappers to create logical plan Logical Plan Intermediate representation for computation Catalyst optimizer Physical execution

Benefit of Logical Plan: Simpler Frontend Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add new language bindings (Julia, Clojure, )

Performance RDD Python Java/Scala 0 2 4 6 8 10 Runtime for an example aggregation workload

Benefit of Logical Plan: Performance Parity Across Languages SQL DataFrame R Python Java/Scala RDD Python Java/Scala 0 2 4 6 8 10 Runtime for an example aggregation workload (secs)

Overview 1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten

Hardware Trends Storage Network CPU

Hardware Trends 2010 Storage Network 50+MB/s (HDD) 1Gbps CPU ~3GHz

Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz

Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10X Network 1Gbps 10Gbps 10X CPU ~3GHz ~3GHz L

Project Tungsten Substantially speed up execution by optimizing CPU efficiency, via: (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management

From DataFrame to Tungsten Python DF Java/Scala DF R DF Logical Plan Initial phase in Spark 1.5 Tungsten Execution More work coming in 2016

Project Tungsten: Fully Managed Memory Spark s core API uses raw Java objects for aggregations and joins GC overhead Memory overhead: 4-8x more memory than serialized format Computation overhead: little memory locality DataFrame s use custom binary format and off-heap managed memory GC free No memory overhead Cache locality

Example: Hash Table Data Structure Keep data closure to CPU cache

Example: Aggregation Operation

Unified API, One Engine, Automatically Optimized language frontend SQL Python Java/Scala R DataFrame Logical Plan Tungsten backend JVM LLVM SIMD 3D XPoint

Refactoring Spark Core SQL Python SparkR Streaming Advanced Analytics DataFrame (& Dataset) Tungsten Execution

Summary General engine with libraries for many data analysis tasks Access to diverse data sources Streaming SQL ML Graph Simple, unified API Major focus going forward: Easy of use (DataFrames) Performance (Tungsten)