Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Similar documents

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Data Science in the Wild

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Beyond Hadoop with Apache Spark and BDAS

Unified Big Data Processing with Apache Spark. Matei

The Berkeley Data Analytics Stack: Present and Future

Spark: Cluster Computing with Working Sets

Apache Spark and Distributed Programming

Spark and the Big Data Library

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Big Data and Scripting Systems build on top of Hadoop

Big Data Analytics Hadoop and Spark

CSE-E5430 Scalable Cloud Computing Lecture 11

Conquering Big Data with BDAS (Berkeley Data Analytics)

Big Data and Scripting Systems beyond Hadoop

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Dell In-Memory Appliance for Cloudera Enterprise

Unified Big Data Analytics Pipeline. 连城

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data Analytics. Lucas Rego Drumond

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Hadoop Ecosystem B Y R A H I M A.

How To Create A Data Visualization With Apache Spark And Zeppelin

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

How Companies are! Using Spark

Making Sense at Scale with the Berkeley Data Analytics Stack

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Flink Next-gen data analysis. Kostas

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Moving From Hadoop to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Brave New World: Hadoop vs. Spark

Making Sense at Scale with Algorithms, Machines & People!

Tachyon: memory-speed data sharing

Big Data Research in the AMPLab: BDAS and Beyond

High-Speed In-Memory Analytics over Hadoop and Hive Data

Processing Big Data with Small Programs

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

MLlib: Scalable Machine Learning on Spark

Hadoop & Spark Using Amazon EMR

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Rakam: Distributed Analytics API

Shark: SQL and Rich Analytics at Scale

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Streaming items through a cluster with Spark Streaming

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

An Architecture for Fast and General Data Processing on Large Clusters

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data With Hadoop

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Introduction to Big Data with Apache Spark UC BERKELEY

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

HiBench Introduction. Carson Wang Software & Services Group

VariantSpark: Applying Spark-based machine learning methods to genomic information

Analytics on Spark &

Introduction to Spark

Architectures for massive data management

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Machine- Learning Summer School

Writing Standalone Spark Programs

This is a brief tutorial that explains the basics of Spark SQL programming.

Integrating Apache Spark with an Enterprise Data Warehouse

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

The Berkeley AMPLab - Collaborative Big Data Research

Big Data Course Highlights

Big Data Analytics with Cassandra, Spark & MLLib

Implementation of Spark Cluster Technique with SCALA

Spark Application on AWS EC2 Cluster:

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Introduc8on to Apache Spark

Hadoop IST 734 SS CHUNG

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Workshop on Hadoop with Big Data

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Certification Study Guide. MapR Certified Spark Developer Study Guide

since His interests include cloud computing, distributed computing, and microeconomic applications in computer science.

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Spark: Cluster Computing with Working Sets

Large scale processing using Hadoop. Ján Vaňo

Real Time Data Processing using Spark Streaming

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

Implement Hadoop jobs to extract business value from large and varied data sets

What s next for the Berkeley Data Analytics Stack?

Big Data Frameworks: Scala and Spark Tutorial

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Hadoop Job Oriented Training Agenda

In Memory Accelerator for MongoDB

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Cloud Computing and Big Data Processing

How To Scale Out Of A Nosql Database

Transcription:

Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin UC Berkeley spark- project.org UC BERKELEY

What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce- like engine» In- memory data storage for very fast iterative queries» General execution graphs and powerful optimizations» Up to 40x faster than Hadoop Compatible with Hadoop s storage APIs» Can read/write to any Hadoop- supported system, including HDFS, HBase, SequenceFiles, etc

What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x

Project History Spark project started in 2009, open sourced 2010 Shark started summer 2011, alpha April 2012 In use at Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research & others 200+ member meetup, 500+ watchers on GitHub

This Talk Spark programming model User applications Shark overview Demo Next major addition: Streaming Spark

Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more:» More complex, multi- stage applications (e.g. iterative graph algorithms and machine learning)» More interactive ad- hoc queries Both multi- stage and interactive apps require faster data sharing across parallel jobs

Data Sharing in MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input HDFS read query 1 query 2 result 1 result 2 Input query 3... result 3 Slow due to replication, serialization, and disk IO

Data Sharing in Spark Input iter. 1 iter. 2... one- time processing query 1 query 2 Input Distributed memory query 3... 10-100 faster than network and disk

Spark Programming Model Key idea: resilient distributed datasets (RDDs)» Distributed collections of objects that can be cached in memory across cluster nodes» Manipulated through various parallel operators» Automatically rebuilt on failure Interface» Clean language- integrated API in Scala» Can be used interactively from Scala console

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base lines = spark.textfile( hdfs://... ) Transformed RDD RDD results errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks Driver cachedmsgs = messages.cache() Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count... Result: scaled full- text to search 1 TB data of Wikipedia in 5-7 sec in <1 (vs sec 170 (vs sec 20 for sec on- disk for on- disk data) data) Action Cache 3 Worker Block 3 Worker Block 2 Cache 2

Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD path = hdfs:// FilteredRDD func = _.contains(...) MappedRDD func = _.split( )

Example: Logistic Regression val data = spark.textfile(...).map(readpoint).cache() var w = Vector.random(D) Load data in memory once Initial parameter vector for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x } ).reduce(_ + _) w -= gradient println("final w: " + w) Repeated MapReduce steps to do gradient descent

Logistic Regression Performance Running Time (s) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s

Supported Operators map filter groupby sort join leftouterjoin rightouterjoin reduce count reducebykey groupbykey first union cross sample cogroup take partitionby pipe save...

Other Engine Features General graphs of operators (e.g. mapreduce- reduce) Hash- based reduces (faster than Hadoop s sort) Controlled data partitioning to lower communication Iteration time (s) 200 150 100 50 0 PageRank Performance 171 72 23 Hadoop Basic Spark Spark + Controlled Partitioning

Spark Users

User Applications In- memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch)...

Conviva GeoReport Hive Spark 20 0.5 Time (hours) 0 5 10 15 20 Group aggregations on many keys w/ same filter 40 gain over Hive from avoiding repeated reading, deserialization and filtering

Mobile Millennium Project Estimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

Shark: Hive on Spark

Motivation Hive is great, but Hadoop s execution engine makes even the smallest queries take minutes Scala is good for programmers, but many data users only know SQL Can we extend Hive to run on Spark?

Hive Architecture Client CLI JDBC Meta store SQL Parser Driver Query Optimizer Physical Plan Execution HDFS MapReduce

Shark Architecture Client CLI JDBC Meta store SQL Parser Driver Query Optimizer Cache Mgr. Physical Plan Execution Spark HDFS [Engle et al, SIGMOD 2012]

Efficient In- Memory Storage Simply caching Hive records as Java objects is inefficient due to high per- object overhead Instead, Shark employs column- oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 2 mike 3.5 1 john 2 3 mike sally 3 sally 6.4 4.1 3.5 6.4

Efficient In- Memory Storage Simply caching Hive records as Java objects is inefficient due to high per- object overhead Instead, Shark employs column- oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 Benefit: similarly compact size to serialized data, 2 mike but 3.5 >5x faster to john access mike sally 3 sally 6.4 4.1 3.5 6.4

Using Shark CREATE TABLE mydata_cached AS SELECT Run standard HiveQL on it, including UDFs» A few esoteric features are not yet supported Can also call from Scala to mix with Spark Early alpha release at shark.cs.berkeley.edu

Benchmark Query 1 SELECT * FROM grep WHERE field LIKE %XYZ% ; Shark (cached) 12s Shark 182s Hive 207s 0 50 100 150 200 250 Execution Time (secs)

Benchmark Query 2 SELECT sourceip, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, uservisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN 1999-01-01 AND 2000-01-01 GROUP BY V.sourceIP ORDER BY earnings DESC LIMIT 1; Shark (cached) 126s Shark 270s Hive 447s 0 100 200 300 400 500 Execution Time (secs)

Demo

What s Next? Recall that Spark s model was motivated by two emerging uses (interactive and multi- stage apps) Another emerging use case that needs fast data sharing is stream processing» Track and update state in memory as events arrive» Large- scale reporting, click analysis, spam filtering, etc

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault- tolerant RDDs Intermix seamlessly with batch and ad- hoc queries map reducebywindow tweetstream.flatmap(_.tolower.split).map(word => (word, 1)).reduceByWindow( 5s, _ + _) T=1 T=2 [Zaharia et al, HotCloud 2012]

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault- tolerant RDDs Intermix seamlessly with batch and ad- hoc queries tweetstream.flatmap(_.tolower.split).map(word => (word, 1)).reduceByWindow(5, _ + _) T=1 T=2 map reducebywindow Result: can process 42 million records/second (4 GB/s) on 100 nodes at sub- second latency [Zaharia et al, HotCloud 2012]

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault- tolerant RDDs Intermix seamlessly with batch and ad- hoc queries tweetstream.flatmap(_.tolower.split).map(word => (word, 1)).reduceByWindow(5, _ + _) T=1 T=2 map Alpha coming this summer reducebywindow [Zaharia et al, HotCloud 2012]

Conclusion Spark and Shark speed up your interactive and complex analytics on Hadoop data Download and docs: www.spark- project.org» Easy to run locally, on EC2, or on Mesos and soon YARN User meetup: meetup.com/spark- users Training camp at Berkeley in August! matei@berkeley.edu / @matei_zaharia

Behavior with Not Enough RAM 100 Iteration time (s) 80 60 40 20 68.8 58.1 40.7 29.7 11.5 0 Cache disabled 25% 50% 75% Fully cached % of working set in memory

Software Stack Shark (Hive on Spark) Bagel (Pregel on Spark) Streaming Spark Spark Local mode EC2 Apache Mesos YARN