Processing Big Data with Small Programs

Similar documents

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Analytics Systems: What Goes Around Comes Around

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Data Science in the Wild

Unified Big Data Processing with Apache Spark. Matei

Spark and the Big Data Library

Apache Spark and Distributed Programming

Big Data Analytics. Lucas Rego Drumond

Spark: Cluster Computing with Working Sets

Beyond Hadoop with Apache Spark and BDAS

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data and Scripting Systems beyond Hadoop

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Unified Big Data Analytics Pipeline. 连城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Big Data Analytics Hadoop and Spark

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Introduction to Big Data with Apache Spark UC BERKELEY

Introduction to Spark

Big Data Processing. Patrick Wendell Databricks

CSE-E5430 Scalable Cloud Computing Lecture 11

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Big Data and Scripting Systems build on top of Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data and Apache Hadoop s MapReduce

Big Data With Hadoop

The Stratosphere Big Data Analytics Platform

Apache Flink Next-gen data analysis. Kostas

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Writing Standalone Spark Programs

Machine- Learning Summer School

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Cloud Computing using MapReduce, Hadoop, Spark

Big Data Frameworks: Scala and Spark Tutorial

Using Apache Spark Pat McDonough - Databricks

Improving MapReduce Performance in Heterogeneous Environments

Dell In-Memory Appliance for Cloudera Enterprise

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Conquering Big Data with BDAS (Berkeley Data Analytics)

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

HiBench Introduction. Carson Wang Software & Services Group

Architectures for massive data management

Introduc8on to Apache Spark

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Ecosystem B Y R A H I M A.

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Spark: Cluster Computing with Working Sets

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Map Reduce & Hadoop Recommended Text:

Moving From Hadoop to Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop & Spark Using Amazon EMR

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Big Data Processing with Google s MapReduce. Alexandru Costan

Rakam: Distributed Analytics API

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache Flink. Fast and Reliable Large-Scale Data Processing

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BIG DATA, MAPREDUCE & HADOOP

VariantSpark: Applying Spark-based machine learning methods to genomic information

Introduction to Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Map Reduce / Hadoop / HDFS

Outils informatiques pour le Big Data en astronomie

Introduction to MapReduce and Hadoop

Brave New World: Hadoop vs. Spark

How Companies are! Using Spark

Introduction to Hadoop

Hadoop IST 734 SS CHUNG

Conquering Big Data with Apache Spark

Real Time Data Processing using Spark Streaming

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop and Map-Reduce. Swati Gore

Spark Application Carousel. Spark Summit East 2015

Transcription:

Processing Big Data with Small Programs Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica spark-project.org UC BERKELEY

Outline The big data problem MapReduce Spark How it works Users' experience

The Big Data Problem Data is growing faster than computation speeds Growing data sources» Web, mobile, scientific, Cheap storage» Doubling every 18 months Stalling CPU speeds» Even multicores not enough

Examples Facebook s daily logs: 60 TB 1000 genomes project: 200 TB Google web index: 10+ PB Cost of 1 TB of disk: $50 Time to read 1 TB from disk: 6 hours (50 MB/s)

The Big Data Problem Single machine can no longer process or even store all the data! Only solution is to distribute over large clusters

Google Datacenter How do we program this thing?

Traditional Network Programming Message-passing between nodes Really hard to do at scale:» How to split problem across nodes? Important to consider network and data locality» How to deal with failures? If a typical server fails every 3 years, a 10,000-node cluster sees 10 faults / day!» Even worse: stragglers (node is not failed, but slow) Almost nobody does this!

Data-Parallel Models Restrict the programming interface so that the system can do more automatically Here s an operation, run it on all of the data» I don t care where it runs (you schedule that)» In fact, feel free to run it twice on different nodes Biggest example: MapReduce

MapReduce First widely popular programming model for data-intensive apps on clusters Published by Google in 2004» Processes 20 PB of data / day Popularized by open-source Hadoop project» 40,000 nodes at Yahoo!, 70 PB at Facebook

MapReduce Programming Model Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out )

Example: Word Count class SplitWords: public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) &&!isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start), "1"); } } }; (Google MapReduce API)

Example: Word Count class Sum: public Reducer { public: virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt(input->value()); input->nextvalue(); } // Emit sum for input->key() Emit(IntToString(value)); } }; (Google MapReduce API)

Word Count Execution Input Map Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse how now brown cow Map Map Map the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 the, 1 brown, 1 fox, 1 ate, 1 mouse, 1 cow, 1 quick, 1 Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1

MapReduce Execution Automatically split work into many small tasks Send map tasks to nodes based on data locality Load-balance dynamically as tasks finish

Fault Recovery If a task fails, re-run it and re-fetch its input If a node fails, re-run its map tasks on others If a task is slow, launch 2 nd copy on other node

Summary By providing a data-parallel model, MapReduce greatly simplified cluster programming:» Automatic division of job into tasks» Locality-aware scheduling» Load balancing» Recovery from failures & stragglers But the story doesn t end here!

When an Abstraction is Useful People want to compose it! Most real applications require multiple MR steps» Google indexing pipeline: 21 steps» Analytics queries (e.g. count clicks & top K): 2-5 steps» Iterative algorithms (e.g. PageRank): 10 s of steps

Problems with MapReduce 1. Programmability» Multi-step jobs create spaghetti code 21 MR steps -> 21 mapper and reducer classes» Lots of boilerplate wrapper code per step» API doesn t provide type safety Can pass the wrong Mapper class for a given data type

Problems with MapReduce 2. Performance» MR only provides acyclic data flow (read from disk -> process -> write to disk)» Expensive for applications that need to reuse data Iterative algorithms (e.g. PageRank) Interactive data mining (repeated queries on same data)» Users often hand-optimize by merging steps together

Spark Aims to address both problems Programmability: embedded DSL in Scala» Functional transformations on collections» Type-safe, automatically optimized» 5-10x less code than MR» Interactive use from Scala shell Performance: in-memory computing primitives» Can run 10-100x faster than MR

Spark Programmability Full Google WordCount: #include "mapreduce/mapreduce.h" // User s map function class SplitWords: public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while (i < n && isspace(text[i])) i++; // Find word end int start = i; while (i < n &&!isspace(text[i])) i++; if (start < i) Emit(text.substr( start,i-start),"1"); } } }; REGISTER_MAPPER(SplitWords); // User s reduce function class Sum: public Reducer { public: virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt( input->value()); input->nextvalue(); } // Emit sum for input->key() Emit(IntToString(value)); } }; REGISTER_REDUCER(Sum); int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i < argc; i++) { MapReduceInput* in= spec.add_input(); in->set_format("text"); in->set_filepattern(argv[i]); in->set_mapper_class("splitwords"); } // Specify the output files MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("sum"); // Do partial sums within map out->set_combiner_class("sum"); // Tuning parameters spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); return 0; }

Spark Programmability Spark WordCount: val file = spark.textfile( hdfs://... ) val counts = file.flatmap(line => line.split( )).map(word => (word, 1)).reduceByKey(_ + _) counts.save( out.txt )

Spark Performance Iterative machine learning apps: K-means Clustering 4.1 121 Spark Hadoop MR 0 50 100 150 sec 0.96 Logistic Regression 80 Spark Hadoop MR 0 50 100 sec

Outline The big data problem MapReduce Spark How it works Users' experience

Spark In More Detail Three concepts:» Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects May be cached in memory for fast reuse» Operations on RDDs Transformations (define RDDs), actions (compute results)» Restricted shared variables (broadcast, accumulators) Goal: make parallel programs look like local ones

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) results tasks messages.cache() Driver Worker Block 1 Cache 1 messages.filter(_.contains( foo )).count messages.filter(_.contains( bar )).count... Result: full-text search 1 search TB data of in Wikipedia 5-7 sec in <1 (vs sec 170 (vssec 20 for sec on-disk for on-disk data) data) Worker Block 3 Action Cache 3 Worker Block 2 Cache 2

Fault Recovery RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex: messages = textfile(...).filter(_.startswith( ERROR )).map(_.split( \t )(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Iteratrion time (s) Fault Recovery Results 140 120 100 80 60 40 20 0 119 Failure happens 81 57 56 58 58 57 59 57 59 1 2 3 4 5 6 7 8 9 10 Iteration

Example: Logistic Regression Goal: find best line separating two sets of points random initial line target

Example: Logistic Regression val data = spark.textfile(...).map(readpoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x } ).reduce(_ + _) w -= gradient println("final w: " + w) w automatically shipped to cluster

Running Time (min) Logistic Regression Performance 60 50 110 s / iteration 40 30 20 10 0 1 10 20 30 Number of Iterations Hadoop Spark first iteration 80 s further iterations 1 s

Other RDD Operations Transformations (define a new RDD) Actions (output a result) map filter sample groupbykey reducebykey cogroup collect reduce take fold flatmap union join cross mapvalues... count saveastextfile saveashadoopfile...

Demo

Beyond RDDs So far we ve seen that RDD operations can use variables from outside their scope By default, each task gets a read-only copy of each variable (no sharing) Good place to enable other sharing patterns!» Broadcast variables» Accumulators

Example: Collaborative Filtering Goal: predict users movie ratings based on past ratings of other movies R = 1?? 4 5? 3?? 3 5?? 3 5? 5??? 1 4???? 2? Movies Users

Model and Algorithm Model R as product of user and movie feature matrices A and B of size U K and M K R = A B T Alternating Least Squares (ALS)» Start with random A & B» Optimize user vectors (A) based on movies» Optimize movie vectors (B) based on users» Repeat until converged

Serial ALS var R = readratingsmatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateuser(i, B, R)) B = (0 until M).map(i => updatemovie(i, A, R)) } Range objects

Naïve Spark ALS var R = readratingsmatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numslices).map(i => updateuser(i, B, R)).collect() B = spark.parallelize(0 until M, numslices).map(i => updatemovie(i, A, R)).collect() } Problem: R re-sent to all nodes in each iteration

Efficient Spark ALS var R = spark.broadcast(readratingsmatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numslices).map(i => updateuser(i, B, R.value)).collect() B = spark.parallelize(0 until M, numslices).map(i => updatemovie(i, A, R.value)).collect() } Result: 3 performance improvement Solution: mark R as broadcast variable

Iteration time (s) Iteration time (s) Scaling Up Broadcast Initial version (HDFS) Cornet P2P broadcast 250 200 Communication Computation 250 200 Communication Computation 150 150 100 100 50 50 0 10 30 60 90 Number of machines 0 10 30 60 90 Number of machines [Chowdhury et al, SIGCOMM 2011]

Accumulators Apart from broadcast, another common sharing pattern is aggregation» Add up multiple statistics about data» Count various events for debugging Spark s reduce operation does aggregation, but accumulators are another nice way to express it

Usage val badrecords = sc.accumulator(0) val badbytes = sc.accumulator(0.0) records.filter(r => { if (isbad(r)) { badrecords += 1 badbytes += r.size false } else { true } }).save(...) Accumulator[Int] Accumulator[Double] printf( Total bad records: %d, avg size: %f\n, badrecords.value, badbytes.value / badrecords.value)

Accumulator Rules Create with SparkContext.accumulator(initialVal) Add to the value with += inside tasks» Each task s effect only counted once Access with.value, but only on master» Exception if you try it on workers Retains efficiency and fault tolerance!

Outline The big data problem MapReduce Spark How it works Users' experience

Components Driver program connects to cluster and schedules tasks Workers run tasks, report results and variable updates Data shared through Hadoop file system (HDFS) user code, broadcast vars Driver HDFS Master tasks, results local cache Workers

Execution Process RDD Objects DAG Scheduler Task Scheduler Worker DAG TaskSet Cluster manager Task Threads Block manager rdd1.join(rdd2).groupby( ).filter( ) split graph into stages of tasks launch tasks via cluster manager execute tasks build operator DAG submit each stage as ready retry failed or straggling tasks store and serve blocks

Job Scheduler Captures RDD dependency graph Pipelines functions into stages Cache-aware for data reuse & locality A: B: Stage 1 groupby C: D: F: map E: join G: Partitioning-aware to avoid shuffles Stage 2 union = cached partition Stage 3

Language Integration No changes to Scala! Scala closures are Serializable Java objects» Contain references to outer variables as fields» Serialize on driver, load & call on workers str: hi val str = hi rdd.map(x => str + x) class Closure1 { String str; String call(string x) { return str + x; } }

Shared Variables Play tricks with Java serialization! class Broadcast<T> { private transient T value; // will not be serialized private int id = Broadcast.newID(); Broadcast(T value) { this.value = value; this.id = BroadcastServer.newId(); BroadcastServer.register(id, value); } } // Called by Java when object is deserialized private void readobject(inputstream s) { s.defaultreadobject(); value = BroadcastFetcher.getOrFetch(id); // local registry }

Shared Variables Play tricks with Java serialization! class Accumulator<T> { private transient T value; // will not be serialized private int id = Accumulator.newID();... // Called by Java when object is deserialized private void readobject(inputstream s) { s.defaultreadobject(); value = zero(); // initialize local value to 0 of type T AccumulatorManager.register(this); // worker will send our // value back to master } } void add(t t) { value.add(t); }

Interactive Shell Scala interpreter compiles each line as, essentially, a separate source file (!) Modifications to allow use with Spark:» Altered code generation to make each line typed capture references to objects it depends on (as fields)» Added server to ship generated classes to workers

Outline The big data problem MapReduce Spark How it works Users' experience

Some Users 500 user meetup, 12 companies contributing code

User Applications Crowdsourced traffic estimation (Mobile Millennium) Video analytics & anomaly detection (Conviva) Ad-hoc queries from web app (Quantifind) Twitter spam classification (Monarch) DNA sequence analysis (SNAP)

Conviva GeoReport Hive 20 Spark 0.5 Time (hours) 0 5 10 15 20 SQL aggregations on many keys w/ same filter 40 gain over Hive from avoiding repeated I/O, deserialization and filtering

What do Users Say? Matei, we are doing predictive analytics on unstructured data and have been a Cascading/Pig shop [Hadoop]. We are a pretty small startup with 12 people and our first client is using our system to do predictions for [ ]. Recently someone tried to compute the same predictions in straight Java and it turned out that a single box could outperform Pig running on 4 machines by several orders of magnitude. Then we started trying Spark and it was even better (the Java code was just a prototype and not optimized). Spark took the 60min Pig job running on 4 servers down to 12s running on 1 server. So needless to say, we are porting our Pig code to Spark (which also shrinks it down to a much smaller size) and will be running out next product version on it.

What do Users Say? For simple queries, writing the query in Spark is harder than writing it in Hive [SQL over Hadoop]. However, it is much easier to write complicated queries in Spark than in Hive. Most real-world queries tend to be quite complex; hence the benefit of Spark. We can leverage the full power of the Scala programming language, rather than relying on the limited syntax offered by SQL. For example, the Hive expression IF(os=1, Windows, IF(os=2, OSX, IF(os=3, Linux, Unknown ))) can be replaced by a simple match clause in Scala. You can also use any Java/Scala library to transform the data.

What do Users Say? Spark jobs are amazingly easy to test Writing a test in Spark is as easy as: class SparkTest { @Test def test() { // this is real code... val sc = new SparkContext("local", "MyUnitTest") // and now some psuedo code... val output = runyourcodethatusesspark(sc) assertagainst(output) } } As a technical aside, this local mode starts up an in-process Spark instance, backed by a thread-pool, and actually opens up a few ports and temp directories, because it s a real, live Spark instance. Granted, this is usually more work than you want to be done in an unit test (which ideally would not hit any file or network I/O), but the redeeming quality is that it s fast. Tests run in ~2 seconds.

Interesting Takeaways Users are as excited about the ease of use as about the performance» Even seasoned distributed programmers Ability to use a full programming language (classes, functions, etc) is appreciated over SQL» Hard to see in small examples, but matters in big apps Embedded nature of DSL helps w/ software eng.» Call Spark in unit tests, call into existing Scala code» Important in any real software eng. setting!

Spark in Java and Python To further expand Spark s usability, we ve now invested substantial effort to add 2 languages Both support all the Scala features (RDDs, accumulators, broadcast vars)

Spark in Java lines.filter(_.contains( error )).count() JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count();

Spark in Python lines = sc.textfile(sys.argv[1]) counts = lines.flatmap(lambda x: x.split(' ')) \.map(lambda x: (x, 1)) \.reducebykey(lambda x, y: x + y) Usable interactively from Python shell Coming out this month

Conclusion Spark makes parallel programs faster to write & run with model based on distributed collections» User API resembles working with local collections» Caching & lineage-based recovery = fast data sharing Gets nice syntax while staying soft. eng. friendly Might be fun to build DSLs on top of! www.spark-project.org