Spark: Making Big Data Interactive & Real-Time



Similar documents
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Beyond Hadoop with Apache Spark and BDAS

Data Science in the Wild

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark and the Big Data Library

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Unified Big Data Analytics Pipeline. 连 城

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Conquering Big Data with BDAS (Berkeley Data Analytics)

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How Companies are! Using Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Conquering Big Data with Apache Spark

Big Data Processing. Patrick Wendell Databricks

Moving From Hadoop to Spark

Introduc8on to Apache Spark

The Berkeley Data Analytics Stack: Present and Future

Brave New World: Hadoop vs. Spark

Spark: Cluster Computing with Working Sets

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Ali Ghodsi Head of PM and Engineering Databricks

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Real Time Data Processing using Spark Streaming

Streaming items through a cluster with Spark Streaming

Tachyon: memory-speed data sharing

Big Data Analytics. Lucas Rego Drumond

Functional Query Optimization with" SQL

An Architecture for Fast and General Data Processing on Large Clusters

Introduction to Spark

Making Sense at Scale with the Berkeley Data Analytics Stack

Big Data and Scripting Systems beyond Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Next-Gen Big Data Analytics using the Spark stack

Machine- Learning Summer School

Processing Big Data with Small Programs

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Hadoop Ecosystem B Y R A H I M A.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Apache Spark and Distributed Programming

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop & Spark Using Amazon EMR

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Shark: SQL and Rich Analytics at Scale

Architectures for massive data management

CSE-E5430 Scalable Cloud Computing Lecture 11

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Big Data Research in the AMPLab: BDAS and Beyond

Big Data and Scripting Systems build on top of Hadoop

Using Apache Spark Pat McDonough - Databricks

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Shark Installation Guide Week 3 Report. Ankush Arora

Improving MapReduce Performance in Heterogeneous Environments

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Analytics Hadoop and Spark

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data With Hadoop

HiBench Introduction. Carson Wang Software & Services Group

Beyond Hadoop MapReduce Apache Tez and Apache Spark

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Analytics on Spark &

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

This is a brief tutorial that explains the basics of Spark SQL programming.

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Frameworks: Scala and Spark Tutorial

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Challenges for Data Driven Systems

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop IST 734 SS CHUNG

Dell In-Memory Appliance for Cloudera Enterprise

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hybrid Software Architectures for Big

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Internet of Things and Big Data: Intro

Case Study : 3 different hadoop cluster deployments

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

GraySort on Apache Spark by Databricks

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Putting Apache Kafka to Use!

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Introduction to Big Data with Apache Spark UC BERKELEY

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Transcription:

Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org

What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency through:» General execution graphs» In-memory storage Improves usability through:» Rich APIs in Scala, Java, Python» Interactive shell Up to 10 faster on disk, 100 in memory 2-5 less code

Project History Spark started in 2009, open sourced 2010 In use at Yahoo!, Intel, Adobe, Quantifind, Conviva, Ooyala, Bizo and others 24 companies now contributing code spark.incubator.apache.org

A Growing Stack Shark SQL Spark Streaming real-time GraphX graph MLbase machine learning Spark Part of the Berkeley Data Analytics Stack (BDAS) project to build an open source next-gen analytics system

This Talk Spark introduction & use cases Shark: SQL on Spark Spark Streaming The power of unification

Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more:» More complex, multi-pass analytics (e.g. ML, graph)» More interactive ad-hoc queries» More real-time stream processing All 3 need faster data sharing across parallel jobs

Data Sharing in MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input HDFS read query 1 query 2 result 1 result 2 Input query 3... result 3 Slow due to replication, serialization, and disk IO

Data Sharing in Spark Input iter. 1 iter. 2... one-time processing query 1 query 2 Input Distributed memory query 3... 10-100 faster than network and disk

Spark Programming Model Key idea: resilient distributed datasets (RDDs)» Distributed collections of objects that can be cached in memory across cluster» Manipulated through parallel operators» Automatically recomputed on failure Programming interface» Functional APIs in Scala, Java, Python» Interactive use from Scala & Python shells

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base lines = spark.textfile( hdfs://... ) Transformed RDD RDD results errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks Driver messages.cache() Worker Block 1 Cache 1 messages.filter(_.contains( foo )).count messages.filter(_.contains( bar )).count... Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 sec in <1 (vs sec 170 (vs sec 20 for sec on-disk for on-disk data) data) Action Cache 3 Worker Block 3 Worker Block 2 Cache 2

Fault Tolerance RDDs track lineage information to rebuild on failure file.map(rec => (rec.type, 1)).reduceByKey(_ + _).filter((type, count) => count > 10) map reduce filter Input file

Fault Tolerance RDDs track lineage information to rebuild on failure file.map(rec => (rec.type, 1)).reduceByKey(_ + _).filter((type, count) => count > 10) map reduce filter Input file

Example: Logistic Regression Goal: find best line separating two sets of points random initial line target

Example: Logistic Regression val data = spark.textfile(...).map(readpoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("final w: " + w)

Logistic Regression Performance 4000 Running Time (s) 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s

Supported Operators map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

Spark in Python and Java // Python: lines = sc.textfile(...) lines.filter(lambda x: ERROR in x).count() // Java: JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count();

User Community 1000+ meetup members 80+ contributors 24 companies contributing

Generality of RDDs RDD model was general enough to let us implement several previous models:» MapReduce, Dryad» Pregel [200 LOC]» Iterative MapReduce [200 LOC]» GraphLab [600 LOC]» SQL (Shark) [6000 LOC] Allows apps to intermix these models

This Talk Spark introduction & use cases Shark: SQL over Spark Spark Streaming The power of unification

Conventional Wisdom MPP databases 10-100x faster than MapReduce

Why Databases Were Faster Data representation» Schema-aware, column-oriented, etc» Compare with MapReduce s record model Indexing Lack of mid-query fault tolerance» MR s pull model costly compared to push

Shark Provides fast SQL queries on a MapReduce-like engine (Spark), and retains full fault tolerance 10-100x faster than Hive Within 2-3x of Amazon Redshift 1.7 TB data 100 nodes

But What About Data representation?» Can do it inside MapReduce records» In Shark, a record is a set of rows» Implemented columnar storage within these Indexing?» Can still read indices in MR Mid-query fault tolerance?» We have it, but we still perform OK» Also enables elasticity, straggler mitigation

Implementation Port of Apache Hive» Supports unmodified Hive data and warehouses Efficient mixing with Spark code (e.g. machine learning functions) in the same system

This Talk Spark introduction & use cases Shark: SQL over Spark Spark Streaming The power of unification

Motivation Many important apps must process large data streams at second-scale latencies» Site statistics, intrusion detection, online ML To scale these apps to 100s of nodes, require:» Fault-tolerance: both for crashes and stragglers» Efficiency: low cost beyond base processing Current steaming systems don t meet both goals

Traditional Streaming Systems Continuous operator model» Each node has mutable state» For each record, update state & send new records mutable state input records node 1 push node 3 input records node 2

Traditional Streaming Systems Fault tolerance via replication or upstream backup: input node 1 node 3 input node 1 node 3 input node 2 synchronization input node 2 node 1 node 3 standby node 2

Traditional Streaming Systems Fault tolerance via replication or upstream backup: input node 1 node 3 input node 1 node 3 input node 2 synchronization input node 2 node 1 node 3 Fast recovery, but 2x node hardware 2 cost standby Only need 1 standby, but slow to recover

Traditional Streaming Systems Fault tolerance via replication or upstream backup: input node 1 node 3 input node 1 node 3 input node 2 synchronization input node 2 node 1 node 3 standby node 2 Neither approach can handle stragglers

Observation Batch processing models, like MapReduce, provide fault tolerance efficiently» Divide job into deterministic tasks» Rerun failed/slow tasks in parallel on other nodes Idea: run streaming computations as a series of small, deterministic batch jobs» Same recovery schemes at much smaller timescale» To make latency low, store state in RDDs

Discretized Stream Processing t = 1: input pull batch operation immutable dataset (stored reliably) t = 2: input immutable dataset (output or state); stored in memory as RDD stream 1 stream 2

Parallel Fault Recovery Checkpoint state RDDs periodically If a node fails or straggles, rebuild lost dataset partitions in parallel on other nodes map input dataset output dataset Faster recovery than upstream backup, without the cost of replication

How Fast Can It Go? Can process over 60M records/s (6 GB/s) on 100 nodes at sub-second latency Records/s (millions) 80 60 40 20 0 Grep 0 50 100 Nodes in Cluster Records/s (millions) 30 20 10 0 Top K Words 0 50 100 Nodes in Cluster Maximum throughput for latency under 1 sec

Comparison with Storm MB/s per node 80 60 40 20 0 Grep 100 1000 Record Size (bytes) Spark Storm MB/s per node 30 20 10 0 Top K Words 100 1000 Record Size (bytes) Spark Storm Storm limited to 100K records/s/node Commercial systems report O(500K) total

How Fast Can It Recover? Recovers from faults/stragglers within 1 second 1,5 Fault happens Interval Processing Time (s) 1 0,5 0 0 5 10 15 20 25 Time (s) Sliding WordCount on 20 nodes with 10s checkpoint interval

Programming Interface Simple functional API t = 1: views ones counts views = readstream("http:...", "1s") ones = views.map(ev => (ev.url, 1)) counts = ones.runningreduce(_ + _) t = 2: map reduce Interoperates with RDDs... // Join stream with static RDD counts.join(historiccounts).map(...) = RDD = partition // Ad-hoc queries on stream state counts.slice( 21:00, 21:05 ).topk(10)

This Talk Spark introduction & use cases Shark: SQL over Spark Spark Streaming The power of unification

The Power of Unification One of the things that makes Spark unique is unification of multiple programming models» SQL, graphs, streaming, etc on the same engine This had powerful benefits:» For the engine» For users Shark Streaming GraphX MLbase Spark

Engine Perspective Let s compare Spark with leading frameworks in» Code size» Performance

Code Size 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Impala (SQL) Storm (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Code Size 140000 120000 100000 80000 60000 40000 20000 Shark 0 Hadoop MapReduce Impala (SQL) Storm (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Code Size 140000 120000 100000 80000 60000 40000 20000 Streamin Shark 0 Hadoop MapReduce Impala (SQL) Storm (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Code Size 140000 120000 100000 80000 60000 40000 20000 GraphX Streamin Shark 0 Hadoop MapReduce Impala (SQL) Storm (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Performance Response Time (s) 25 20 15 10 5 Impala (disk) Impala (mem) Redshift Shark (disk) Shark (mem) Throughput (MB/s/node) 35 30 25 20 15 10 5 Storm Spark Response Time (min) 30 25 20 15 10 5 Hadoop Giraph GraphX 0 SQL 0 Streaming 0 Graph

User Perspective Unified engine means that: 1. Apps can easily compose models (e.g. run a SQL query then PageRank on the result) 2. Composition is fast (no writing to disk) 3. All models get Spark shell for free (e.g. use it to inspect your streams state, or your graph)

Long-Term Direction As big data apps grow in complexity, combining processing models is essential» E.g. run SQL query then machine learning on result Data sharing between models is often slower than the computations themselves» E.g. HDFS write versus in-memory iteration Thus, unified engines increasingly important

Getting Started Visit spark-project.org for videos, tutorials, and hands-on exercises Easy to run in local mode, private clusters, EC2 Compatible with any Hadoop storage system Online training camp: ampcamp.berkeley.edu

Conclusion Big data analytics is evolving to include:» More complex analytics (e.g. machine learning)» More interactive ad-hoc queries» More real-time stream processing Spark is a platform that unifies these models, enabling sophisticated apps More info: spark-project.org

Backup Slides

Behavior with Not Enough RAM 100 Iteration time (s) 80 60 40 20 68,8 58,1 40,7 29,7 11,5 0 Cache disabled 25% 50% 75% Fully cached % of working set in memory