MLlib: Scalable Machine Learning on Spark



Similar documents
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark and the Big Data Library

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Unified Big Data Analytics Pipeline. 连 城

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Beyond Hadoop with Apache Spark and BDAS

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark. Fast, Interactive, Language- Integrated Cluster Computing

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark: Cluster Computing with Working Sets

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics with Cassandra, Spark & MLLib

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Map-Reduce for Machine Learning on Multicore

How Companies are! Using Spark

Spark: Making Big Data Interactive & Real-Time

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Architectures for massive data management

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data and Scripting Systems build on top of Hadoop

Brave New World: Hadoop vs. Spark

Big Data Research in the AMPLab: BDAS and Beyond

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Mammoth Scale Machine Learning!

Hadoop Ecosystem B Y R A H I M A.

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Fast Analytics on Big Data with H20

Big Data Analytics Hadoop and Spark

The Stratosphere Big Data Analytics Platform

Big Data Processing. Patrick Wendell Databricks

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Ali Ghodsi Head of PM and Engineering Databricks

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Machine Learning over Big Data

Big Data and Scripting Systems beyond Hadoop

Moving From Hadoop to Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

The Internet of Things and Big Data: Intro

Extreme Machine Learning with GPUs John Canny

Hadoop SNS. renren.com. Saturday, December 3, 11

Analytics on Big Data

The Berkeley Data Analytics Stack: Present and Future

Introduc8on to Apache Spark

Conquering Big Data with BDAS (Berkeley Data Analytics)

MLlib: Machine Learning in Apache Spark

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

What s next for the Berkeley Data Analytics Stack?

Streaming items through a cluster with Spark Streaming

HiBench Introduction. Carson Wang Software & Services Group

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data and Data Science: Behind the Buzz Words

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Apache Flink Next-gen data analysis. Kostas

What s Cooking in KNIME

Making Sense at Scale with the Berkeley Data Analytics Stack

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Journée Thématique Big Data 13/03/2015

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Real Time Data Processing using Spark Streaming

Apache Spark and Distributed Programming

Machine Learning Big Data using Map Reduce

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Advanced In-Database Analytics

Enterprise Data Storage and Analysis on Tim Barr

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Applying Scalable Machine Learning Techniques on Big Data using a Computational Cluster

This is a brief tutorial that explains the basics of Spark SQL programming.

VariantSpark: Applying Spark-based machine learning methods to genomic information

Writing Standalone Spark Programs

Dell In-Memory Appliance for Cloudera Enterprise

Machine- Learning Summer School

Distributed R for Big Data

DduP Towards a Deduplication Framework utilising Apache Spark

Shark Installation Guide Week 3 Report. Ankush Arora

HIGH PERFORMANCE BIG DATA ANALYTICS

Machine Learning using MapReduce

Bayesian networks - Time-series models - Apache Spark & Scala

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Data Science in the Wild

Big Data Frameworks: Scala and Spark Tutorial

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Big Data. Lyle Ungar, University of Pennsylvania

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Spark Application on AWS EC2 Cluster:

Scaling Out With Apache Spark. DTL Meeting Slides based on

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Big Data Analysis: Apache Storm Perspective

Statistical machine learning, high dimension and big data

CSE-E5430 Scalable Cloud Computing Lecture 11

Transcription:

MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph Gonzalez, Michael Franklin, Michael I. Jordan, Tim Kraska, etc. 1

What is MLlib? 2

What is MLlib? MLlib is a Spark subproject providing machine learning primitives: initial contribution from AMPLab, UC Berkeley shipped with Spark since version 0.8 33 contributors 3

What is MLlib? Algorithms:! classification: logistic regression, linear support vector machine (SVM), naive Bayes regression: generalized linear regression (GLM) collaborative filtering: alternating least squares (ALS) clustering: k-means decomposition: singular value decomposition (SVD), principal component analysis (PCA) 4

Why MLlib? 5

scikit-learn? Algorithms:! classification: SVM, nearest neighbors, random forest, regression: support vector regression (SVR), ridge regression, Lasso, logistic regression,! clustering: k-means, spectral clustering, decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), 6

Mahout? Algorithms:! classification: logistic regression, naive Bayes, random forest, collaborative filtering: ALS, clustering: k-means, fuzzy k-means, decomposition: SVD, randomized SVD, 7

LIBLINEAR? Mahout? H2O? Vowpal Wabbit? MATLAB? R? scikit-learn? Weka? 8

Why MLlib? 9

Why MLlib? It is built on Apache Spark, a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala, or Python. 10

Gradient descent w w nx i=1 g(w; x i,y i ) val points = spark.textfile(...).map(parsepoint).cache() var w = Vector.zeros(d) for (i <- 1 to numiterations) { val gradient = points.map { p => (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x ).reduce(_ + _) w -= alpha * gradient } 11

k-means (scala) // Load and parse the data. val data = sc.textfile("kmeans_data.txt") val parseddata = data.map(_.split( ').map(_.todouble)).cache() " // Cluster the data into two classes using KMeans. val clusters = KMeans.train(parsedData, 2, numiterations = 20) " // Compute the sum of squared errors. val cost = clusters.computecost(parseddata) println("sum of squared errors = " + cost) 12

k-means (python) # Load and parse the data data = sc.textfile("kmeans_data.txt") parseddata = data.map(lambda line: array([float(x) for x in line.split(' )])).cache() " # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxiterations = 10, runs = 1, initialization_mode = "kmeans ") " # Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) " cost = parseddata.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("sum of squared error = " + str(cost)) 13

Dimension reduction + k-means // compute principal components val points: RDD[Vector] =... val mat = RowRDDMatrix(points) val pc = mat.computeprincipalcomponents(20) " // project points to a low-dimensional space val projected = mat.multiply(pc).rows " // train a k-means model on the projected data val model = KMeans.train(projected, 10)

Collaborative filtering // Load and parse the data val data = sc.textfile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toint, rate.todouble) }) " // Build the recommendation model using ALS val model = ALS.train(ratings, 1, 20, 0.01) " // Evaluate the model on rating data val usersproducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersproducts) 15

Why MLlib? It ships with Spark as a standard component. 16

Out for dinner?! Search for a restaurant and make a reservation. Start navigation. Food looks good? Take a photo and share. 17

Why smartphone? Out for dinner?! Search for a restaurant and make a reservation. (Yellow Pages?) Start navigation. (GPS?) Food looks good? Take a photo and share. (Camera?) 18

Why MLlib? A special-purpose device may be better at one aspect than a general-purpose device. But the cost of context switching is high: different languages or APIs different data formats different tuning tricks 19

Spark SQL + MLlib // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingtable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userid = e.userid""") " // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib. val training = trainingtable.map { row => val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } " val model = SVMWithSGD.train(training)

Streaming + MLlib // collect tweets using streaming " // train a k-means model val model: KMmeansModel =... " // apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = tweets.map(_.gettext) val filteredtweets = statuses.filter(t => model.predict(featurize(t)) == clusternumber) " // print tweets within this particular cluster filteredtweets.print()

GraphX + MLlib // assemble link graph val graph = Graph(pages, links) val pagerank: RDD[(Long, Double)] = graph.staticpagerank(10).vertices " // load page labels (spam or not) and content features val labelandfeatures: RDD[(Long, (Double, Seq((Int, Double)))] =... val training: RDD[LabeledPoint] = labelandfeatures.join(pagerank).map { case (id, ((label, features), pagerank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pagerank)) } " // train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training)

Why MLlib? Spark is a general-purpose big data platform. Runs in standalone mode, on YARN, EC2, and Mesos, also on Hadoop v1 with SIMR. Reads from HDFS, S3, HBase, and any Hadoop data source. MLlib is a standard component of Spark providing machine learning primitives on top of Spark. MLlib is also comparable to or even better than other libraries specialized in large-scale machine learning. 23

Why MLlib? Spark is a general-purpose big data platform. Runs in standalone mode, on YARN, EC2, and Mesos, also on Hadoop v1 with SIMR. Reads from HDFS, S3, HBase, and any Hadoop data source. MLlib is a standard component of Spark providing machine learning primitives on top of Spark. MLlib is also comparable to or even better than other libraries specialized in large-scale machine learning. 24

Why MLlib? Scalability Performance User-friendly APIs Integration with Spark and its other components 25

Logistic regression 26

Logistic regression - weak scaling relative walltime 10 8 6 4 MLbase MLlib VW Ideal walltime (s) 4000 3000 2000 n=6k, d=160k n=12.5k, d=160k n=25k, d=160k n=50k, d=160k n=100k, d=160k n=200k, d=160k 2 1000 0 0 5 10 15 20 25 30 # machines 0 MLbase MLlib VW Matlab Fig. 6: Weak scaling for logistic regression 35 30 25 MLbase VW Ideal Full dataset: 200K images, 160K dense features. Similar weak scaling. MLlib within a factor of 2 of VW s wall-clock time. speedup 20 15 27

0 0 5 10 15 20 25 30 # machines wal 1000 Logistic regression - strong scaling 0 g. 6: Weak scaling for logistic regression speedup 35 30 25 20 15 MLlib MLbase VW Ideal walltime (s) MLbase VW Matlab Fig. 5: Walltime for weak scaling for logistic regress 1400 1200 1000 800 600 1 Machine 2 Machines 4 Machines 8 Machines 16 Machines 32 Machines 10 5 0 0 5 10 15 20 25 30 # machines 400 200 0 MLbase MLlib VW Matlab Fig. 7: Walltime for strong scaling for logistic regress : Strong scaling Fixed for logistic Dataset: regression 50K images, 160K dense features. MLlib exhibits better scaling properties. MLlib is faster than VW with 16 and 32 machines. System Lines of Code MLbase 32 GraphLab 383 with respect to computation. In practice, we see comp scaling results as more machines are added. In MATLAB, we implement gradient descent inste SGD, as gradient descent requires roughly the same nu of numeric operations as SGD but does not require an loop to pass over the data. It can thus be implemented 28

Collaborative filtering 29

Collaborative filtering? Recover a ra-ng matrix from a subset of its entries.???? 30

ALS - wall-clock time System Wall- clock /me (seconds) MATLAB 15443 Mahout 4206 GraphLab 291 MLlib 481 Dataset: scaled version of Netflix data (9X in size). Cluster: 9 machines. MLlib is an order of magnitude faster than Mahout. MLlib is within factor of 2 of GraphLab. 31

Implementation of k-means Initialization: random k-means++ k-means

Implementation of k-means Iterations: For each point, find its closest center. l i = arg min j kx i c j k 2 2 Update cluster centers. c j = P i,l i =j x j P i,l i =j 1

Implementation of k-means The points are usually sparse, but the centers are most likely to be dense. Computing the distance takes O(d) time. So the time complexity is O(n d k) per iteration. We don t take any advantage of sparsity on the running time. However, we have kx ck 2 2 = kxk 2 2 + kck 2 2 2hx, ci Computing the inner product only needs non-zero elements. So we can cache the norms of the points and of the centers, and then only need the inner products to obtain the distances. This reduce the running time to O(nnz k + d k) per iteration. " However, is it accurate?

Implementation of ALS broadcast everything data parallel fully parallel 35

Alternating least squares (ALS) 36

Broadcast everything Ratings Movie! Factors User! Factors Master Workers Master loads (small) data file and initializes models. Master broadcasts data and initial models. At each iteration, updated models are broadcast again. Works OK for small data. Lots of communication overhead - doesn t scale well.

Data parallel Ratings Workers load data Movie! Factors User! Factors Ratings Master broadcasts initial models At each iteration, updated models are broadcast again Much better scaling Ratings Works on large datasets Master Workers 38 Ratings Works well for smaller models. (low K)

Fully parallel Ratings Movie! Factors User! Factors Workers load data Ratings Movie! Factors User! Factors Models are instantiated at workers. At each iteration, models are shared via join between workers. Ratings Movie! Factors User! Factors Much better scalability. Master Workers 39 Ratings Movie! Factors User! Factors Works on large datasets

Implementation of ALS broadcast everything data parallel fully parallel block-wise parallel Users/products are partitioned into blocks and join is based on blocks instead of individual user/product. 40

New features for v1.x Sparse data Classification and regression tree (CART) SVD and PCA L-BFGS Model evaluation Discretization 41