Spark. Fast, Interactive, Language- Integrated Cluster Computing

Similar documents
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark: Cluster Computing with Working Sets

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics Hadoop and Spark

Data Science in the Wild

Scaling Out With Apache Spark. DTL Meeting Slides based on

Unified Big Data Processing with Apache Spark. Matei

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Apache Spark and Distributed Programming

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

CSE-E5430 Scalable Cloud Computing Lecture 11

Introduction to Spark

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Beyond Hadoop with Apache Spark and BDAS

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data and Scripting Systems beyond Hadoop

Spark and the Big Data Library

Unified Big Data Analytics Pipeline. 连 城

Spark: Cluster Computing with Working Sets

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Brave New World: Hadoop vs. Spark

The Berkeley Data Analytics Stack: Present and Future

Big Data and Scripting Systems build on top of Hadoop

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Apache Flink Next-gen data analysis. Kostas

Processing Big Data with Small Programs

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Architectures for massive data management

Implementation of Spark Cluster Technique with SCALA

Big Data Analytics. Lucas Rego Drumond

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Introduction to Big Data with Apache Spark UC BERKELEY

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

This is a brief tutorial that explains the basics of Spark SQL programming.

Dell In-Memory Appliance for Cloudera Enterprise

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Conquering Big Data with BDAS (Berkeley Data Analytics)

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

Big Data With Hadoop

Rakam: Distributed Analytics API

Spark Application on AWS EC2 Cluster:

Introduc8on to Apache Spark

VariantSpark: Applying Spark-based machine learning methods to genomic information

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

How To Create A Data Visualization With Apache Spark And Zeppelin

MLlib: Scalable Machine Learning on Spark

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Writing Standalone Spark Programs

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

How Companies are! Using Spark

Hadoop Ecosystem B Y R A H I M A.

An Architecture for Fast and General Data Processing on Large Clusters

Hadoop & Spark Using Amazon EMR

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Oracle Big Data Fundamentals Ed 1 NEW

Big Data Frameworks: Scala and Spark Tutorial

Real Time Data Processing using Spark Streaming

DduP Towards a Deduplication Framework utilising Apache Spark

Spark Application Carousel. Spark Summit East 2015

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Shark: SQL and Rich Analytics at Scale

Managing large clusters resources

Challenges for Data Driven Systems

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Conquering Big Data with Apache Spark

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

The Internet of Things and Big Data: Intro

Introduction to Hadoop

Big Data Analytics with Cassandra, Spark & MLLib

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Xiaoming Gao Hui Li Thilina Gunarathne

Outils informatiques pour le Big Data en astronomie

Big Data Processing with Google s MapReduce. Alexandru Costan

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

In Memory Accelerator for MongoDB

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Transcription:

Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley

Background MapReduce and its variants greatly simplified big data analytics by hiding scaling and faults However, these systems provide a restricted programming model Can we design similarly powerful abstractions for a broader class of applications?

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Map Output Map Reduce

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Benefits of data flow: runtime can decide where Input to run Map tasks and can automatically Output recover from failures Map Reduce

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data:» Iterative algorithms (machine learning, graphs)» Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query

Spark Goal Efficiently support apps with working sets» Let them keep data in memory Retain the attractive properties of MapReduce:» Fault tolerance (for crashes & stragglers)» Data locality» Scalability Solution: extend data flow model with resilient distributed datasets (RDDs)

Outline Spark programming model Applications Implementation Demo

Programming Model Resilient distributed datasets (RDDs)» Immutable, partitioned collections of objects» Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage» Can be cached for efficient reuse Actions on RDDs» Count, reduce, collect, save,

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base lines = spark.textfile( hdfs://... ) Transformed RDD RDD results errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks Driver cachedmsgs = messages.cache() Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count... Result: scaled full- text to search 1 TB data of Wikipedia in 5-7 sec in <1 (vs sec 170 (vs sec 20 for sec on- disk for on- disk data) data) Action Cache 3 Worker Block 3 Worker Block 2 Cache 2

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: cachedmsgs = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)).cache() HdfsRDD path: hdfs:// FilteredRDD func: contains(...) MappedRDD func: split( ) CachedRDD

Example: Logistic Regression Goal: find best line separating two sets of points + + + + + + + + + + random initial line target

Example: Logistic Regression val data = spark.textfile(...).map(readpoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("final w: " + w)

Logistic Regression Performance Running Time (s) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s

Spark Applications Twitter spam classification (Monarch) In- memory analytics on Hive data (Conviva) Traffic prediction using EM (Mobile Millennium) K- means clustering Alternating Least Squares matrix factorization Network simulation

Conviva GeoReport Hive Spark 20 0.5 Time (hours) 0 5 10 15 20 Aggregations on many group keys w/ same WHERE clause Gains come from:» Not re- reading unused columns» Not re- reading filtered records» Not repeated decompression» In- memory storage of deserialized objects

Generality of RDDs RDDs can efficiently express many proposed cluster programming models» MapReduce => map and reducebykey operations» Dryad => Spark runs general DAGs of tasks» Pregel iterative graph processing => Bagel» SQL => Hive on Spark (Hark?) Can also express apps that neither of these can

Implementation Spark runs on the Mesos cluster manager, letting it share resources with Hadoop & other apps Spark Hadoop MPI Mesos Can read from any Hadoop input source (e.g. HDFS) Node Node Node Node ~7000 lines of code; no changes to Scala compiler

Language Integration Scala closures are Serializable Java objects» Serialize on master, load & run on workers Not quite enough» Nested closures may reference entire outer scope» May pull in non- Serializable variables not used inside» Solution: bytecode analysis + reflection Other tricks using custom serialized forms (e.g. accumulators as syntactic sugar for counters)

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes:» Modified wrapper code generation so that each line typed has references to objects for its dependencies» Distribute generated classes over the network Enables in- memory exploration of big data

Demo

Conclusion Spark s resilient distributed datasets are a simple programming model for a wide range of apps Download our open source release at www.spark- project.org matei@berkeley.edu

Related Work DryadLINQ» Build queries through language- integrated SQL operations on lazy datasets» Cannot have a dataset persist across queries Relational databases» Lineage/provenance, logical logging, materialized views Piccolo» Parallel programs with shared distributed hash tables; similar to distributed shared memory Iterative MapReduce (Twister and HaLoop)» Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactively

Related Work Distributed shared memory (DSM)» Very general model allowing random reads/writes, but hard to implement efficiently (needs logging or checkpointing) RAMCloud» In- memory storage system for web applications» Allows random reads/writes and uses logging like DSM Nectar» Caching system for DryadLINQ programs that can reuse intermediate results across jobs» Does not provide caching in memory, explicit support over which data is cached, or control over partitioning SMR (functional Scala API for Hadoop)