Advanced Data Science on Spark

Similar documents
Spark and the Big Data Library

Unified Big Data Processing with Apache Spark. Matei

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Unified Big Data Analytics Pipeline. 连 城

MLlib: Scalable Machine Learning on Spark

Spark: Cluster Computing with Working Sets

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Introduc8on to Apache Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Brave New World: Hadoop vs. Spark

Architectures for massive data management

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Beyond Hadoop with Apache Spark and BDAS

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data Analytics Hadoop and Spark

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Introduction to Spark

Streaming items through a cluster with Spark Streaming

Moving From Hadoop to Spark

Big Data Analytics. Lucas Rego Drumond

Big Data Processing. Patrick Wendell Databricks

Spark Application Carousel. Spark Summit East 2015

Apache Spark and Distributed Programming

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop Ecosystem B Y R A H I M A.

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Hybrid Software Architectures for Big

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Journée Thématique Big Data 13/03/2015

Real Time Data Processing using Spark Streaming

Big Data With Hadoop

How Companies are! Using Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

This is a brief tutorial that explains the basics of Spark SQL programming.

Conquering Big Data with BDAS (Berkeley Data Analytics)

Apache Flink Next-gen data analysis. Kostas

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Internet of Things and Big Data: Intro

Data Science in the Wild

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Fast Analytics on Big Data with H20

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Big Data Analytics with Cassandra, Spark & MLLib

Mammoth Scale Machine Learning!

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Map-Reduce for Machine Learning on Multicore

Bayesian networks - Time-series models - Apache Spark & Scala

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

The Stratosphere Big Data Analytics Platform

Machine Learning over Big Data

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Big Data and Scripting Systems beyond Hadoop

Challenges for Data Driven Systems

What s Cooking in KNIME

Ali Ghodsi Head of PM and Engineering Databricks

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop Architecture. Part 1

Integrating Apache Spark with an Enterprise Data Warehouse

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Parallel Computing. Benson Muite. benson.

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Apache Hadoop. Alexandru Costan

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Cloud Computing at Google. Architecture

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Machine- Learning Summer School

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Big Data and Scripting Systems build on top of Hadoop

Advanced In-Database Analytics

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CSE-E5430 Scalable Cloud Computing Lecture 2

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

HIGH PERFORMANCE BIG DATA ANALYTICS

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Shark: SQL and Rich Analytics at Scale

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

HiBench Introduction. Carson Wang Software & Services Group

Making big data simple with Databricks

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Transcription:

Advanced Data Science on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com

Data Science Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and web industry How do we program these things?

Use a Cluster Convex Optimization Matrix Factorization Machine Learning Numerical Linear Algebra Large Graph analysis Streaming and online algorithms Following lectures on http://stanford.edu/~rezab/dao Slides at http://stanford.edu/~rezab/slides/sparksummit2015

Outline Data Flow Engines and Spark The Three Dimensions of Machine Learning Built-in Libraries MLlib + {Streaming, GraphX, SQL} Future of MLlib

Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How to split problem across nodes? Must consider network & data locality» How to deal with failures? (inevitable at scale)» Even worse: stragglers (node not failed, but slow)» Ethernet networking not fast» Have to write programs for each machine Rarely used in commodity datacenters

Disk vs Memory L1 cache reference: 0.5 ns L2 cache reference: 7 ns Mutex lock/unlock: 100 ns Main memory reference: 100 ns Disk seek: 10,000,000 ns

Network vs Local Send 2K bytes over 1 Gbps network: Read 1 MB sequentially from memory: 20,000 ns 250,000 ns Round trip within same datacenter: 500,000 ns Read 1 MB sequentially from network: 10,000,000 ns Read 1 MB sequentially from disk: 30,000,000 ns Send packet CA->Netherlands->CA: 150,000,000 ns

Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators» System picks how to split each operator into tasks and where to run each task» Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce

Example: Iterative Apps Input file system" read file system" write file system" read file system" write iter. 1 iter. 2... file system" read query 1 query 2 result 1 result 2 Input query 3 result 3... Commonly spend 90% of time doing I/O

MapReduce evolved MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing» State between steps goes to distributed file system» Slow due to replication & disk storage

Verdict MapReduce algorithms research doesn t go to waste, it just gets sped up and easier to use Still useful to study as an algorithmic framework, silly to use directly

Spark Computing Engine Extends a programming language with a distributed collection data-structure» Resilient distributed datasets (RDD) Open source at Apache» Most active community in big data, with 50+ companies contributing Clean APIs in Java, Scala, Python Community: SparkR, being released in 1.4!

Key Idea Resilient Distributed Datasets (RDDs)» Collections of objects across a cluster with user controlled partitioning & storage (memory, disk,...)» Built via parallel transformations (map, filter, )» The world only lets you make make RDDs such that they can be: Automatically rebuilt on failure

!! Resilient Distributed Datasets (RDDs) Main idea: Resilient Distributed Datasets» Immutable collections of objects, spread across cluster» Statically typed: RDD[T] has objects of type T val sc = new SparkContext()! val lines = sc.textfile("log.txt") // RDD[String]! // Transform using standard collection operations! val errors = lines.filter(_.startswith("error"))! val messages = errors.map(_.split( \t )(2))! messages.saveastextfile("errors.txt")! lazily evaluated kicks off a computation

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

Partitioning RDDs know their partitioning functions file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) Known to be" hash-partitioned map reduce filter Also known Input file

MLlib: Available algorithms classification: logistic regression, linear SVM," naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

The Three Dimensions

ML Objectives Almost all machine learning objectives are optimized using this update w is a vector of dimension d" we re trying to find the best w via optimization

Scaling 1) Data size 2) Number of models 3) Model size

Logistic Regression Goal: find best line separating two sets of points + + + + + + + + + + random initial line target

Data Scaling data = spark.textfile(...).map(readpoint).cache() w = numpy.random.rand(d) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(- p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda a, b: a + b) w - = gradient print Final w: %s % w

Separable Updates Can be generalized for» Unconstrained optimization» Smooth or non-smooth» LBFGS, Conjugate Gradient, Accelerated Gradient methods,

Logistic Regression Results 4000 Running Time (s) 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s 100 GB of data on 50 m1.xlarge EC2 machines

Behavior with Less RAM 100 Iteration time (s) 80 60 40 20 68.8 58.1 40.7 29.7 11.5 0 0% 25% 50% 75% 100% % of working set in memory

Lots of little models Is embarrassingly parallel Most of the work should be handled by data flow paradigm ML pipelines does this

Hyper-parameter Tuning

Model Scaling Linear models only need to compute the dot product of each example with model Use a BlockMatrix to store data, use joins to compute dot products Coming in 1.5

Model Scaling Data joined with model (weight):

Built-in libraries

A General Platform Standard libraries included with Spark Spark SQL structured Spark Streaming" real-time GraphX graph MLlib machine learning Spark Core

Benefit for Users Same engine performs data extraction, model training and interactive queries Separate engines DFS read parse DFS write DFS read train DFS write DFS read query DFS write Spark DFS read parse train query DFS

Machine Learning Library (MLlib) points = context.sql( select latitude, longitude from tweets )! model = KMeans.train(points, 10)!! 70+ contributors in past year

MLlib algorithms classification: logistic regression, linear SVM," naïve Bayes, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

GraphX

GraphX General graph processing library Build graph using RDDs of nodes and edges Run standard algorithms such as PageRank

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD opera;ons Finally, the processed results of the RDD opera;ons are returned in batches live data stream batches of X seconds processed results Spark Streaming Spark

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch sizes as low as ½ second, latency ~ 1 second Poten;al for combining batch processing and streaming processing in the same system live data stream batches of X seconds processed results Spark Streaming Spark

! Spark SQL // Run SQL statements! val teenagers = context.sql(! "SELECT name FROM people WHERE age >= 13 AND age <= 19")! // The results of SQL queries are RDDs of Row objects! val names = teenagers.map(t => "Name: " + t(0)).collect()!

MLlib + {Streaming, GraphX, SQL}

A General Platform Standard libraries included with Spark Spark SQL structured Spark Streaming" real-time GraphX graph MLlib machine learning Spark Core

MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion, k-means as of 1.2 Model weights are updated via SGD, thus amenable to streaming More work needed for decision trees

MLlib + SQL df = context.sql( select latitude, longitude from tweets )! model = pipeline.fit(df)! DataFrames in Spark 1.3! (March 2015) Powerful coupled with new pipeline API

MLlib + GraphX

Future of MLlib

Goals Tighter integration with DataFrame and spark.ml API Accelerated gradient methods & Optimization interface Model export: PMML (current export exists in Spark 1.3, but not PMML, which lacks distributed models) Scaling: Model scaling