Spark and the Big Data Library

Similar documents

Unified Big Data Processing with Apache Spark. Matei

Spark: Making Big Data Interactive & Real-Time

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Unified Big Data Analytics Pipeline. 连城

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Scaling Out With Apache Spark. DTL Meeting Slides based on

MLlib: Scalable Machine Learning on Spark

How Companies are! Using Spark

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Beyond Hadoop with Apache Spark and BDAS

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Introduc8on to Apache Spark

Brave New World: Hadoop vs. Spark

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark: Cluster Computing with Working Sets

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Data Science in the Wild

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Architectures for massive data management

Moving From Hadoop to Spark

Real Time Data Processing using Spark Streaming

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data Analytics. Lucas Rego Drumond

Streaming items through a cluster with Spark Streaming

Big Data Analytics Hadoop and Spark

Hadoop Ecosystem B Y R A H I M A.

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Introduction to Spark

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Conquering Big Data with BDAS (Berkeley Data Analytics)

Big Data Processing. Patrick Wendell Databricks

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

How To Create A Data Visualization With Apache Spark And Zeppelin

HIGH PERFORMANCE BIG DATA ANALYTICS

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

This is a brief tutorial that explains the basics of Spark SQL programming.

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Apache Spark and Distributed Programming

Ali Ghodsi Head of PM and Engineering Databricks

CSE-E5430 Scalable Cloud Computing Lecture 11

Hybrid Software Architectures for Big

Machine Learning over Big Data

The Internet of Things and Big Data: Intro

HiBench Introduction. Carson Wang Software & Services Group

Big Data and Scripting Systems build on top of Hadoop

Big Data Analytics with Cassandra, Spark & MLLib

Big Data With Hadoop

Apache Flink Next-gen data analysis. Kostas

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Journée Thématique Big Data 13/03/2015

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Big Data and Scripting Systems beyond Hadoop

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

BIG DATA APPLICATIONS

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Spark Application Carousel. Spark Summit East 2015

Shark Installation Guide Week 3 Report. Ankush Arora

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The Stratosphere Big Data Analytics Platform

Challenges for Data Driven Systems

Hadoop & Spark Using Amazon EMR

Machine- Learning Summer School

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

An Architecture for Fast and General Data Processing on Large Clusters

Bayesian networks - Time-series models - Apache Spark & Scala

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Big Data Frameworks: Scala and Spark Tutorial

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

HDFS. Hadoop Distributed File System

Hadoop in the Enterprise

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Conquering Big Data with Apache Spark

Fast Analytics on Big Data with H20

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Map-Reduce for Machine Learning on Multicore

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Big Data and Data Science: Behind the Buzz Words

Apache Hadoop. Alexandru Costan

Functional Query Optimization with" SQL

Shark: SQL and Rich Analytics at Scale

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Managing large clusters resources

Big Data Research in Berlin BBDC and Apache Flink

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Transcription:

Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia

Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and web industry How do we program these things?

Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Machine Learning Example Current State of Spark Ecosystem Built-in Libraries

Data flow vs. traditional network programming

Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How to split problem across nodes? Must consider network & data locality» How to deal with failures? (inevitable at scale)» Even worse: stragglers (node not failed, but slow)» Ethernet networking not fast» Have to write programs for each machine Rarely used in commodity datacenters

Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators» System picks how to split each operator into tasks and where to run each task» Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce

Example MapReduce Algorithms Matrix-vector multiplication Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!

Why Use a Data Flow Engine? Ease of programming» High-level functions instead of message passing Wide deployment» More common than MPI, especially near data Scalability to very largest clusters» Even HPC world is now concerned about resilience Examples: Pig, Hive, Scalding, Storm

Limitations of MapReduce

Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing» State between steps goes to distributed file system» Slow due to replication & disk storage

Example: Iterative Apps Input file system" read file system" write file system" read file system" write iter. 1 iter. 2... file system" read query 1 query 2 result 1 result 2 Input query 3 result 3... Commonly spend 90% of time doing I/O

Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector Neighbors (id, edges) Same file grouped over and over Ranks (id, rank) iteration 1 iteration 2 iteration 3

Result While MapReduce is simple, it can require asymptotically more communication or I/O

Spark computing engine

Spark Computing Engine Extends a programming language with a distributed collection data-structure» Resilient distributed datasets (RDD) Open source at Apache» Most active community in big data, with 50+ companies contributing Clean APIs in Java, Scala, Python, R

!! Resilient Distributed Datasets (RDDs) Main idea: Resilient Distributed Datasets» Immutable collections of objects, spread across cluster» Statically typed: RDD[T] has objects of type T val sc = new SparkContext()! val lines = sc.textfile("log.txt") // RDD[String]! // Transform using standard collection operations! val errors = lines.filter(_.startswith("error"))! val messages = errors.map(_.split( \t )(2))! messages.saveastextfile("errors.txt")! lazily evaluated kicks off a computation

Key Idea Resilient Distributed Datasets (RDDs)» Collections of objects across a cluster with user controlled partitioning & storage (memory, disk,...)» Built via parallel transformations (map, filter, )» The world only lets you make make RDDs such that they can be: Automatically rebuilt on failure

Python, Java, Scala, R // Scala: val lines = sc.textfile(...) lines.filter(x => x.contains( ERROR )).count() // Java: JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count();

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

Partitioning RDDs know their partitioning functions file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) Known to be" hash-partitioned map reduce filter Also known Input file

Machine Learning example

Logistic Regression data = spark.textfile(...).map(readpoint).cache() w = numpy.random.rand(d) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(- p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda a, b: a + b) w - = gradient print Final w: %s % w

Logistic Regression Results 4000 Running Time (s) 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s 100 GB of data on 50 m1.xlarge EC2 machines

Behavior with Less RAM 100 Iteration time (s) 80 60 40 20 68.8 58.1 40.7 29.7 11.5 0 0% 25% 50% 75% 100% % of working set in memory

Benefit for Users Same engine performs data extraction, model training and interactive queries Separate engines DFS read parse DFS write DFS read train DFS write DFS read query DFS write Spark DFS read parse train query DFS

State of the Spark ecosystem

Spark Community Most active open source community in big data 200+ developers, 50+ companies contributing 150 Contributors in past year 100 50 0 Giraph Storm

Project Activity 1600 1400 Spark 350000 300000 Spark 1200 250000 1000 800 600 400 200 MapReduce YARN HDFS Storm 200000 150000 100000 50000 MapReduce YARN HDFS Storm 0 Commits 0 Lines of Code Changed Activity in past 6 months

Continuing Growth Contributors per month to Spark source: ohloh.net

Built-in libraries

Standard Library for Big Data Big data apps lack libraries" of common algorithms Python Scala Java R Spark s generality + support" for multiple languages make it" suitable to offer this SQL ML graph Core Much of future activity will be in these libraries

A General Platform Standard libraries included with Spark Spark SQL structured Spark Streaming" real-time GraphX graph MLlib machine learning Spark Core

Machine Learning Library (MLlib) points = context.sql( select latitude, longitude from tweets )! model = KMeans.train(points, 10)!! 40 contributors in past year

MLlib algorithms classification: logistic regression, linear SVM," naïve Bayes, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

GraphX 36

GraphX General graph processing library Build graph using RDDs of nodes and edges Large library of graph algorithms with composable steps 37

GraphX Algorithms Collaborative Filtering» Alternating Least Squares» Stochastic Gradient Descent» Tensor Factorization Community Detection» Triangle-Counting» K-core Decomposition» K-Truss Structured Prediction» Loopy Belief Propagation» Max-Product Linear Programs» Gibbs Sampling Semi-supervised ML» Graph SSL» CoEM Graph Analytics» PageRank» Personalized PageRank» Shortest Path» Graph Coloring Classification» Neural Networks 38

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD opera;ons Finally, the processed results of the RDD opera;ons are returned in batches live data stream batches of X seconds processed results Spark Streaming Spark 39

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch sizes as low as ½ second, latency ~ 1 second Poten;al for combining batch processing and streaming processing in the same system live data stream batches of X seconds processed results Spark Streaming Spark 40

! Spark SQL // Run SQL statements! val teenagers = context.sql(! "SELECT name FROM people WHERE age >= 13 AND age <= 19")! // The results of SQL queries are RDDs of Row objects! val names = teenagers.map(t => "Name: " + t(0)).collect()!

Spark SQL Enables loading & querying structured data in Spark From Hive: c = HiveContext(sc)! rows = c.sql( select text, year from hivetable )! rows.filter(lambda r: r.year > 2013).collect()! From JSON: c.jsonfile( tweets.json ).registerastable( tweets )! c.sql( select text, user.name from tweets )! tweets.json { text : hi, user : { name : matei, id : 123 }}

Conclusions

Spark and Research Spark has all its roots in research, so we hope to keep incorporating new ideas!

Conclusion Data flow engines are becoming an important platform for numerical algorithms While early models like MapReduce were inefficient, new ones like Spark close this gap More info: spark.apache.org

Class Schedule

Schedule Today and tomorrow Hands-on exercises, download course materials and slides: http://stanford.edu/~rezab/sparkclass/ " Friday Advanced talks on Spark libraries and uses