Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014



Similar documents
Report: Declarative Machine Learning on MapReduce (SystemML)

Hadoop Ecosystem B Y R A H I M A.

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark and the Big Data Library

The Stratosphere Big Data Analytics Platform

Unified Big Data Analytics Pipeline. 连 城

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Research in Berlin BBDC and Apache Flink

Map-Reduce for Machine Learning on Multicore

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Fast Analytics on Big Data with H20

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Scaling Out With Apache Spark. DTL Meeting Slides based on

Mammoth Scale Machine Learning!

Apache Flink. Fast and Reliable Large-Scale Data Processing

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Analytics Verizon Lab, Palo Alto

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Architectures for Big Data Analytics A database perspective

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Analytics on Big Data

Big-data Analytics: Challenges and Opportunities

Scalable Machine Learning - or what to do with all that Big Data infrastructure

16.1 MAPREDUCE. For personal use only, not for distribution. 333

@Scalding. Based on talk by Oscar Boykin / Twitter

Unified Big Data Processing with Apache Spark. Matei

MLlib: Scalable Machine Learning on Spark

Machine Learning using MapReduce

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

BIG DATA What it is and how to use?

Big Graph Processing: Some Background

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

DduP Towards a Deduplication Framework utilising Apache Spark

Big Data looks Tiny from the Stratosphere

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Distributed Computing and Big Data: Hadoop and MapReduce

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Bringing Big Data Modelling into the Hands of Domain Experts

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Moving From Hadoop to Spark

HIGH PERFORMANCE BIG DATA ANALYTICS

CUDAMat: a CUDA-based matrix class for Python

Introduction to Spark

Big Data and Scripting Systems build on top of Hadoop

Journée Thématique Big Data 13/03/2015

Apache Kylin Introduction Dec 8,

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

High Productivity Data Processing Analytics Methods with Applications

Bayesian networks - Time-series models - Apache Spark & Scala

Challenges for Data Driven Systems

ANALYTICS CENTER LEARNING PROGRAM

Big Data and Analytics: Challenges and Opportunities

Spark: Cluster Computing with Working Sets

RevoScaleR Speed and Scalability

Content-Based Recommendation

Presto/Blockus: Towards Scalable R Data Analysis

Apache Hama Design Document v0.6

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Analytics. Chances and Challenges. Volker Markl

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Operation Count; Numerical Linear Algebra

Advanced Big Data Analytics with R and Hadoop

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

The Data Mining Process

A Brief Introduction to Apache Tez

COMP9321 Web Application Engineering

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

VizARD. Augmented Reality Data Visualization. Project Proposal

Big Data and Scripting Systems beyond Hadoop

Architectures for massive data management

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

LDIF - Linked Data Integration Framework

Parallel Computing for Data Science

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

How To Make Sense Of Data With Altilia

VariantSpark: Applying Spark-based machine learning methods to genomic information

Programming Exercise 3: Multi-class Classification and Neural Networks

2015 The MathWorks, Inc. 1

Chapter 7. Using Hadoop Cluster and MapReduce

Car Insurance. Prvák, Tomi, Havri

Reference Architecture, Requirements, Gaps, Roles

Business Process Services. White Paper. Price Elasticity using Distributed Computing for Big Data

Social Media Mining. Data Mining Essentials

Big Data Analytics with Cassandra, Spark & MLLib

Brave New World: Hadoop vs. Spark

Large Scale Graph Processing with Apache Giraph

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Building Scalable Applications Using Microsoft Technologies

Transcription:

Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24

Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of

Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of

Apache Mahout: History library for scalable machine learning (ML) started six years ago as ML on MapReduce focus on popular ML problems and algorithms Collaborative Filtering find interesting items for users based on past behavior Classification learn to categorize objects Clustering find groups of similar objects Dimensionality Reduction find a low-dimensional representation of the data large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, witter)

Background: MapReduce simple paradigm for distributed processing (proposed by Google) user implements two functions map and reduce system executes program in parallel, scales to clusters with thousands of machines popular open source implementation: Apache Hadoop

Background: MapReduce

Apache Mahout: Problems MapReduce not well suited for ML slow execution, especially for iterations constrained programming model makes code hard to write, read and adjust lack of declarativity lots of handcoded joins necessary Abandonment of MapReduce will reject new MapReduce implementations widely used legacy implementations will be maintained Reboot with a new DSL

Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of

Requirements for an ideal ML environment. R/Matlab-like semantics type system that covers linear algebra and statistics 2. Modern programming language qualities functional programming object oriented programming scriptable and interactive 3. Scalability automatic distribution and parallelization with sensible performance

Requirements for an ideal ML environment. R/Matlab-like semantics type system that covers linear algebra and statistics 2. Modern programming language qualities functional programming object oriented programming scriptable and interactive 3. Scalability automatic distribution and parallelization with sensible performance

Requirements for an ideal ML environment. R/Matlab-like semantics type system that covers linear algebra and statistics 2. Modern programming language qualities functional programming object oriented programming scriptable and interactive 3. Scalability automatic distribution and parallelization with sensible performance

Scala DSL Scala as programming/scripting environment R-like DSL : G BB C C s q s q val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q) Declarativity! Algebraic expression optimizer for distributed linear algebra provides a translation layer to distributed engines currently supports Apache Spark only might support Apache Flink in the future

Data ypes Scalar real values In-memory vectors dense 2 types of sparse In-memory matrices sparse and dense a number of specialized matrices Distributed Row Matrices (DRM) huge matrix, partitioned by rows lives in the main memory of the cluster provides small set of parallelized operations lazily evaluated operation execution val x = 2.367 val v = dvec(,, 5) val w = svec(( -> )::(2 -> 5):: Nil) val A = dense((,, 5), (2,, 4), (4, 3, )) val drma = drmfromhdfs(...)

Features () Matrix, vector, scalar operators: in-memory, out-of-core Slicing operators drma %*% drmb A %*% x A.t %*% drmb A * B A(5 until 2, 3 until 4) A(5, ::); A(5, 5); x(a to b) Assignments (in-memory only) Vector-specific Summaries A(5, ::) := x A *= B A -=: B; /:= x x dot y; x cross y A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm

Features (2) solving linear systems val x = solve(a, b) in-memory decompositions val (inmemq, inmemr) = qr(inmemm) val ch = chol(inmemm) val (inmemv, d) = eigen(inmemm) val (inmemu, inmemv, s) = svd(inmemm) out-of-core decompositions caching of DRMs val (drmq, inmemr) = thinqr(drma) val (drmu, drmv, s) = dssvd(drma, k = 5, q = ) val drma_cached = drma.checkpoint() drma_cached.uncache()

Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of

Cereals Name protein fat carbo sugars rating Apple Cinnamon Cheerios 2 2.5 29.5954 Cap n Crunch 2 2 2 8.4285 Cocoa Puffs 2 3 22.736446 Froot Loops 2 3 32.27582 Honey Graham Ohs 2 2 2.87292 Wheaties Honey Gold 2 6 8 36.87559 Cheerios 6 2 7 5.764999 Clusters 3 2 3 7 4.428 Great Grains Pecan 3 3 3 4 45.876 http://lib.stat.cmu.edu/dasl/datafiles/cereals.html

Linear Regression Assumption: target variable y generated by linear combination of feature matrix with parameter vector β, plus noise ε y Goal: find estimate of the parameter vector β that explains the data well Cereals example = weights of ingredients y = customer rating

Data Ingestion Usually: load dataset as DRM from a distributed filesystem: val drmdata = drmfromhdfs(...) Mimick a large dataset for our example: val drmdata = drmparallelize(dense( (2, 2,.5,, 29.5954), // Apple Cinnamon Cheerios (, 2, 2, 2, 8.4285), // Cap'n'Crunch (,, 2, 3, 22.736446), // Cocoa Puffs (2,,, 3, 32.27582), // Froot Loops (, 2, 2,, 2.87292), // Honey Graham Ohs (2,, 6, 8, 36.87559), // Wheaties Honey Gold (6, 2, 7,, 5.764999), // Cheerios (3, 2, 3, 7, 4.428), // Clusters (3, 3, 3, 4, 45.876)), // Great Grains Pecan numpartitions = 2)

Data Preparation Cereals example: target variable y is customer rating, weights of ingredients are features extract as DRM by slicing, fetch y as in-core vector val drm = drmdata(::, until 4) val y = drmdata.collect(::, 4) 2 2 2 6 3 3 drm 2. 5 2 2 2 2 2 6 2 7 2 3 3 3 2 3 3 8 7 4 y 29. 5954 8. 4285 22. 736446 32. 27582 2. 87292 36. 87559 5. 764999 4. 428 45. 876

Estimating β Ordinary Least Squares: minimizes the sum of residual squares between true target variable and prediction of target variable Closed-form expression for estimation of ß as ˆ ( ) y Computing and y is as simple as typing the formulas: val drmt = drm.t %*% drm val drmty = drm %*% y

Estimating β Solve the following linear system to get least-squares estimate of ß ˆ y Fetch and y onto the driver and use an in-core solver assumes fits into memory uses analogon to R s solve() function val t = drmt.collect val ty = drmty.collect(::, ) val betahat = solve(t, ty)

Estimating β Solve the following linear system to get least-squares estimate of ß ˆ y Fetch and y onto the driver and use an in-memory solver assumes fits into memory uses analogon to R s solve() function val t = drmt.collect val ty = drmty.collect(::, ) val betahat = solve(t, ty) We have implemented distributed linear regression! (would need to add a bias term in a real implementation)

Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of

Underlying systems currently: prototype on Apache Spark fast and expressive cluster computing system general computation graphs, in-memory primitives, rich API, interactive shell future: add Apache Flink database-inspired distributed processing engine emerged from research by U Berlin, HU Berlin, HPI functionality similar to Apache Spark, adds data flow optimization and efficient out-of-core execution

Runtime & Optimization Execution is defered, user composes logical operators Computational actions implicitly trigger optimization (= selection of physical plan) and execution val C =.t %*% I.writeDrm(path); val inmemv = (U %*% M).collect Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths

Optimization Example Computation of in example val drmt = drm.t %*% drm Naïve execution st pass: transpose A (requires repartitioning of A) 2 nd pass: multiply result with A (expensive, potentially requires repartitioning again) Logical optimization: rewrite plan to use specialized logical operator for ranspose-imes-self matrix multiplication

Optimization Example Computation of in example val drmt = drm.t %*% drm Naïve execution st pass: transpose (requires repartitioning of ) 2 nd pass: multiply result with A (expensive, potentially requires repartitioning again) Logical optimization: rewrite plan to use specialized logical operator for ranspose-imes-self matrix multiplication ranspose

Optimization Example Computation of in example val drmt = drm.t %*% drm Naïve execution st pass: transpose (requires repartitioning of ) 2 nd pass: multiply result with (expensive, potentially requires repartitioning again) Logical optimization: rewrite plan to use specialized logical operator for ranspose-imes-self matrix multiplication ranspose MatrixMult

Optimization Example Computation of in example val drmt = drm.t %*% drm Naïve execution st pass: transpose (requires repartitioning of ) 2 nd pass: multiply result with (expensive, potentially requires repartitioning again) Logical optimization ranspose MatrixMult ranspose- imes-self Optimizer rewrites plan to use specialized logical operator for ranspose-imes-self matrix multiplication

ranpose-imes-self Mahout computes via row-outer-product formulation executes in a single pass over row-partitioned m i x i x i

ranpose-imes-self Mahout computes via row-outer-product formulation executes in a single pass over row-partitioned m i x i x i

ranpose-imes-self Mahout computes via row-outer-product formulation executes in a single pass over row-partitioned m i x i x i x

ranpose-imes-self Mahout computes via row-outer-product formulation executes in a single pass over row-partitioned m i x i x i x = x + x x

ranpose-imes-self Mahout computes via row-outer-product formulation executes in a single pass over row-partitioned m i x i x i x = x + x + x x x 2 x 2

ranpose-imes-self Mahout computes via row-outer-product formulation executes in a single pass over row-partitioned m i x i x i x = x + x + x + x x x 2 x 2 x 3 x 3

ranpose-imes-self Mahout computes via row-outer-product formulation executes in a single pass over row-partitioned m i x i x i x = x + x + x + x x x x 2 x 2 x 3 x 3 x 4 x 4

Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation of

Physical operators for ranspose-imes-self wo physical operators (concrete implementations) available for ranspose-imes-self operation standard operator AtA operator AtA_slim, specialized implementation for tall & skinny matrices (many rows, few columns) Optimizer must choose currently: depends on user-defined threshold for number of columns ideally: cost based decision, dependent on estimates of intermediate result sizes ranspose- imes-self

Physical operator AtA

2 Physical operator AtA worker worker 2

2 Physical operator AtA worker worker 2 for st partition for st partition

2 Physical operator AtA worker worker 2 for st partition for st partition

2 Physical operator AtA worker worker 2 for st partition for st partition

2 Physical operator AtA worker worker 2 for st partition for st partition for 2 nd partition for 2 nd partition

A 2 Physical operator AtA A A worker worker 2 for st partition for st partition for 2 nd partition for 2 nd partition

2 Physical operator AtA worker worker 2 for st partition for st partition for 2 nd partition for 2 nd partition

2 Physical operator AtA worker worker 2 for st partition for st partition for 2 nd partition for 2 nd partition

2 Physical operator AtA worker worker 2 for st partition for st partition for 2 nd partition for 2 nd partition 2 2 worker 3 3 2 worker 4

Physical operator AtA_slim

2 Physical operator AtA_slim worker worker 2

2 2 2 Physical operator AtA_slim worker worker 2 2 2 2

2 2 2 Physical operator AtA_slim worker worker 2 + 2 2 driver 2 2 2 3 2 2 2

Summary MapReduce outdated as abstraction for distributed machine learning R/Matlab-like DSL for declarative implementation of algorithms Automatic compilation, optimization and parallelization of programs written in this DSL Execution on novel distributed engines like Apache Spark and Apache Flink

hank you. Questions? utorial for playing with the new Mahout DSL: http://mahout.apache.org/users/sparkbindings/play-with-shell.html Apache Flink Meetup in Berlin: http://www.meetup.com/apache-flink-meetup/ Follow me on twitter @sscdotopen