Map-Reduce for Machine Learning on Multicore



Similar documents
Machine Learning using MapReduce

Machine Learning Big Data using Map Reduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Clustering UE 141 Spring 2013

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Parallel Data Processing

Big Data and Apache Hadoop s MapReduce

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

A Performance Analysis of Distributed Indexing using Terrier

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

A survey on platforms for big data analytics

Analytics on Big Data

Analysis of MapReduce Algorithms

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Apache Hama Design Document v0.6

Hadoop SNS. renren.com. Saturday, December 3, 11

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Cloud Computing based on the Hadoop Platform

Map/Reduce Affinity Propagation Clustering Algorithm

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data. Lyle Ungar, University of Pennsylvania

Architectures for Big Data Analytics A database perspective

Spark and the Big Data Library

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Hadoop Ecosystem B Y R A H I M A.

On the Performance of High Dimensional Data Clustering and Classification Algorithms

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

DATA CLUSTERING USING MAPREDUCE

Advances in Natural and Applied Sciences

ANALYTICS CENTER LEARNING PROGRAM

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Processing with Google s MapReduce. Alexandru Costan

Hadoop and Map-Reduce. Swati Gore

2015 The MathWorks, Inc. 1

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Large scale processing using Hadoop. Ján Vaňo

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Hadoop Design and k-means Clustering

Social Media Mining. Data Mining Essentials

Mining Interesting Medical Knowledge from Big Data

Unified Big Data Processing with Apache Spark. Matei

Big Data Analytics Hadoop and Spark

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Big Data Analytics. Lucas Rego Drumond

HPC ABDS: The Case for an Integrating Apache Big Data Stack

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Bringing Big Data Modelling into the Hands of Domain Experts

Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University

Spark: Cluster Computing with Working Sets

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

International Journal of Emerging Technology & Research

MLlib: Scalable Machine Learning on Spark

Hadoop Architecture. Part 1

Brave New World: Hadoop vs. Spark

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Mammoth Scale Machine Learning!

BIG DATA What it is and how to use?

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Spark. Fast, Interactive, Language- Integrated Cluster Computing

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

High Productivity Data Processing Analytics Methods with Applications

CloudClustering: Toward an iterative data processing pattern on the cloud

HadoopRDF : A Scalable RDF Data Analysis System

Information Processing, Big Data, and the Cloud

Benchmarking of Data Mining Techniques as Applied to Power System Analysis

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

The MapReduce Framework

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

RevoScaleR Speed and Scalability

Report: Declarative Machine Learning on MapReduce (SystemML)

An approach to Machine Learning with Big Data

Classification On The Clouds Using MapReduce

HIGH PERFORMANCE BIG DATA ANALYTICS

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Systems CS 5965/6965 FALL 2015

VariantSpark: Applying Spark-based machine learning methods to genomic information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Fast Analytics on Big Data with H20

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

How Companies are! Using Spark

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to DISC and Hadoop

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Transcription:

Map-Reduce for Machine Learning on Multicore Chu, et al.

Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang, Scala, Occam, Go, among others - Paper discusses other languages, but 1) I dont know any of them 2) These languages are more up to date and useable.

Machine Learning frameworks on multicore How to parallelize algorithms? Highly specialized, non-obvious solutions Often applicable to very few algorithms Example 1: Cargea et al - Restricted to decision trees Example 2: Jin, Agrawal - Shared Memory Machines only

Goal Develop general techniques for parallelizing many machine learning algorithms Goal of throw more cores at it optimization Algorithms must fit the Statistical Query Model - Rather than traditional approach of algorithm specific tweaks

Statistical Query Model (Kearns, 1999) Permit learning algorithm to access learning problem only through a Statistical Oracle Oracle returns estimated expectation of f(x,y), given over all data instances - The expectation averaged over the training/test distribution

All algorithms that computes sufficient statistics can be expressed as a summation over all data points This trait lends itself to batch summations over all data points Why do we care? Sufficient statistics - A statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken. Pr(X=x T(x) = t, Θ) = Pr(X=x T(X) = t) conditional probability not reliant on underlying parameter Θ. (Source wikipedia article)

Multicoring Algorithms Algorithms that sum over data points allows for partitioning Chunk data into subgroups Process and aggregate results

Example: Linear Least Squares Fits model of y = T x Solve = min Letting X m X i=1 R m n ( T x i y i ) 2 be the instances ~y =[y 1,...,y m ] be the target values Solve =(X T X) 1 X T ~y

To Summation form: Two-phase algorithm Compute sufficient statistics Aggregate statistics and solve Solve to get = A 1 b - Compute sufficient statistics by summing over the data

A = X T X b = X T ~y mx mx A = ( ~x i ~x i T ) b = ( ~x i y i ) i=1 i=1 Divide and conquer using the multiple cores What programming architecture can help us do this task?

Map-Reduce Google inspired software framework Specialized for big data on clusters Based on functional programming Map: Give sub-problems to worker nodes Reduce: Combine all sub-problem results to obtain answer - Map and Reduce functions not the same as in the functional programming - Nodes are cores in a single-machine implementation, and both machines and cores in a cluster setup

If mapper/reducer requires scalar information, they can access from the algorithm (1.1.1.1 and 1.1.3.2) You can have multiple reducers as well (Load balancing and fault tolerance) ideal in cluster setting but you have increased communication overhead

Example: k-means Given a set of data points Compute initial k centroids Cluster data points to these centroids Calculate the new centroid Repeat process till convergence We will use Euclidean (easy to think about)

Left image: The dataset we want to run k-means on Right Image: Clustering and calculating the new centroid Old Centroids - Square New Centroids - Triangle

Example: k-means Mappers Cluster: Compute distance between points and centroids Centroid: Split up data, compute partial sums of vectors Reduce: Cluster: Choose the closest centroid and apply it to that data point Centroid: Aggregate partial sums and calculate centroids Iterative Clustering process... Eventually settle and have our clusters

Can anyone think of a different way to run this mapping job? Example Data Point centroid distance D1 1 0.023 D1 2 0.459 D1 3 0.724 Each image here represents a different map job The map job is responsible for calculating distance between each data point and the centroid specified for the mapper Each point will have k different distances for each centroid

Reduce Step I II III Reduce step can take the values for each point, and select the min value

Map Reduce 1 II III IV Different map jobs for calculating the new centroids - Calculate the partial sum for each centroid Reduce jobs aggregate and average these partial sums to form new centroid points

Example: PCA Computing covariance matrix Σ P = 1 m (P m i=1 x ix T i ) µµt With µ = 1 m P m i=1 x i How would we map this? Reduce? Ans: Map each sum to one of the cores. How to map it further?? Reduce: perform final calculation (subtract

Results Theoretical Time Complexity speedup of P on P cores Experiment using two versions of algorithm For dual-core processors at least1.9 times speedup for parallelized algorithms Linear speedup with number of cores Viable method for parallelizing machine learning algorithms P appx. equals P Many different datasets used Original algorithms dont utilize all cpu cycles Sub 1 slope for speed increase - increasing communication overhead

Frameworks Influenced a variety of researchers and developers Drost, Ingersoll - Apache Mahout Ghoting et al. - IBM SystemML Zaharia et al. - Berkeley Spark

- Developed with Hadoop in mind though Recommendation - Given a set of features for a user, find items that a user may like Clustering - Given set of data, group them based on some set of features Classification - Assign unlabeled data based on learned patterns from trained data Frequent item set mining - Given items in a set, find items that are most frequently associated with elements in that item set Frameworks: Mahout Build scalable machine learning libraries Four use cases Recommendation Clustering Classification Frequent item set mining

Still new current release 0.5/0.6 Many algorithms proposed in paper yet to be implemented Ones implemented scale well Compatibility Works with/without Apache Hadoop Works single or multi node setup Many clustering techniques available, some classification techniques, SVD, Similarity, and Recommenders available, with others either open or awaiting integration.

(Relatively) Easy to use with Apache Lucene Text analysis Benefits Algorithms available Plug in data and play Apache Lucene - Full-text search framework Can pass term vectors created in Lucene to Mahout for processing (i.e. running LDA job with the term vectors for topic modeling, or calculating pairwise document similarity with cosine sim)

Compatibility Available to use right now and open source Detractions Setting up Mahout not always easy What you want might not be implemented Very young, and code/usability shows it

Discussion What are some of the drawbacks to Map- Reduce?

http://en.wikipedia.org/wiki/mapreduce http://en.wikipedia.org/wiki/naive_bayes_classifier http://en.wikipedia.org/wiki/expectation-maximization_algorithm http://en.wikipedia.org/wiki/sufficient_statistic http://en.wikipedia.org/wiki/k-means_clustering http://hadoop.apache.org/ http://mahout.apache.org/ https://researcher.ibm.com/researcher/files/us-ytian/systemml.pdf http://www.spark-project.org/ http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf K-means images generated in Matlab