Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Similar documents

Analysis of MapReduce Algorithms

Map-Reduce for Machine Learning on Multicore

Machine Learning Big Data using Map Reduce

Machine Learning using MapReduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Apache Hadoop s MapReduce

Big Data Analytics. Lucas Rego Drumond

Distributed Computing and Big Data: Hadoop and MapReduce

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Big Data and Scripting map/reduce in Hadoop

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Getting Started with Hadoop with Amazon s Elastic MapReduce

Cloud Computing Summary and Preparation for Examination

Spark and the Big Data Library

Hadoop SNS. renren.com. Saturday, December 3, 11

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Big Data Processing with Google s MapReduce. Alexandru Costan

Graph Processing and Social Networks

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Introduction to Hadoop

Analytics on Big Data

Social Media Mining. Data Mining Essentials

Similarity Search in a Very Large Scale Using Hadoop and HBase

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

MapReduce (in the cloud)

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

HiBench Installation. Sunil Raiyani, Jayam Modi

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Classification On The Clouds Using MapReduce

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Content-Based Recommendation

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Scaling Up HBase, Hive, Pegasus

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source Google-style large scale data analysis with Hadoop

Challenges for Data Driven Systems

Predict Influencers in the Social Network

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Advances in Natural and Applied Sciences

Mammoth Scale Machine Learning!

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

Introduction to DISC and Hadoop

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Introduction to Hadoop

Report: Declarative Machine Learning on MapReduce (SystemML)

Fast Analytics on Big Data with H20

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

Big Data Analytics for Healthcare

Unsupervised Data Mining (Clustering)

MapReduce and Hadoop Distributed File System V I J A Y R A O

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Introduction to Parallel Programming and MapReduce

MLlib: Scalable Machine Learning on Spark

High Productivity Data Processing Analytics Methods with Applications

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Practical Graph Mining with R. 5. Link Analysis

Cloud Computing at Google. Architecture

A Comparison of Approaches for Large-Scale Data Mining

Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA. VLDB Cloud Intelligence workshop, 2012

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

MapReduce and Hadoop Distributed File System

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Advanced Data Management Technologies

What s Cooking in KNIME

BIG DATA What it is and how to use?

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Big Data. Lyle Ungar, University of Pennsylvania

MapReduce for Statistical NLP/Machine Learning

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

DATA CLUSTERING USING MAPREDUCE

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

An Introduction to Data Mining

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Transcription:

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Part 2:Mining using MapReduce Mining algorithms using MapReduce Information retrieval Graph algorithms: PageRank Clustering: Canopy clustering, KMeans Classification: knn, Naïve Bayes MapReduce Mining Summary 2

Host 2 Host 4 Host 1 Host 3 MapReduce Interface and Data Flow Map: (K1, V1) list(k2, V2) Combine: (K2, list(v2)) list(k2, V2) Partition: (K2, V2) reducer_id Reduce: (K2, list(v2)) list(k3, V3) id, doc list(w, id) list(unique_w, id) Map Combine Map Combine w1, list(id) Reduce w1, list(unique_id) Map Combine w2, list(id) partition Map Reduce Combine w2, list(unique_id) 3

Information retrieval using MapReduce 4

IR: Distributed Grep Find the doc_id and line# of a matching pattern Map: (id, doc) list(id, line#) Reduce: None Grep data mining Docs 1 2 Map1 Output <1, 123> 3 4 5 Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012> 5

IR: URL Access Frequency Map: (null, log) (URL, 1) Reduce: (URL, 1) (URL, total_count) Logs Map1 Map2 Map output <u1,1> <u1,1> <u2,1> <u3,1> Reduce Reduce output <u1,2> <u2,1> <u3,2> Map3 <u3,1> Also described in Part 1 6

IR: Reverse Web-Link Graph Map: (null, page) (target, source) Reduce: (target, source) (target, list(source)) Pages Map output Reduce output Map1 <t1,s2> Map2 <t2,s3> <t2,s5> Reduce <t1,[s2]> <t2,[s3,s5]> <t3,[s5]> Map3 <t3,s5> 7 It is the same as matrix transpose

IR: Inverted Index Map: (id, doc) list(word, id) Reduce: (word, list(id)) (word, list(id)) Doc Map output Reduce output Map1 <w1,1> Map2 <w2,2> <w3,3> Reduce <w1,[1,5]> <w2,[2]> <w3,[3]> Map3 <w1,5> 8

9 Graph mining using MapReduce

PageRank PageRank vector q is defined as where q = ca T q + 1 c Browsing A is the source-by-destination adjacency matrix, e is all one vector. N is the number of nodes 1 2 N e Teleporting A = c is the weight between 0 and 1 (eg.0.85) PageRank indicates the importance of a page. Algorithm: Iterative powering for finding the first eigen-vector 0 B @ 3 4 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 C A 10

MapReduce: PageRank 1 2 PageRank Map() Input: key = page x, value = (PageRank q x, links[y 1 y m ]) 3 4 Map: distribute PageRank q i q1 q2 q3 q4 Output: key = page x, value = partial x 1. Emit(x, 0) //guarantee all pages will be emitted 2. For each outgoing link y i : Emit(y i, q x /m) PageRank Reduce() Input: key = page x, value = the list of [partial x ] q1 q2 q3 q4 Reduce: update new PageRank Check out Kang et al ICDM 09 Output: key = page x, value = PageRank q x 1. q x = 0 2. For each partial value d in the list: q x += d 3. q x = cq x + (1-c)/N 4. Emit(x, q x ) 11

12 Clustering using MapReduce

Canopy: single-pass clustering Canopy creation Construct overlapping clusters canopies Make sure no two canopies with too much overlaps Key: no canopy centers are too close to each other C1 C2 C3 C4 T1 T2 overlapping clusters too much overlap two thresholds T1>T2 13 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00

Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 14

Canopy creation T1 T2 Strongly marked points C1 C2 Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 15

Canopy creation Other points in the cluster Strongly marked points Canopy center Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 16

Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 17

Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 18

MapReduce - Canopy Map() Canopy creation Map() Input: A set of points P, threshold T1, T2 Output: key = null; value = a list of local canopies (total, count) For each p in P: For each canopy c: Map1 Map2 if dist(p,c)< T1 then c.total+=p, c.count++; if dist(p,c)< T2 then strongbound = true If not strongbound then create canopy at p Close() For each canopy c: Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 19

MapReduce - Canopy Reduce() Reduce() Input: key = null; input values (total, count) Output: key = null; value = cluster centroids For each intermediate values p = total/count For each canopy c: Map1 results Map2 results if dist(p,c)< T1 then c.total+=p, c.count++; if dist(p,c)< T2 then strongbound = true If not strongbound then create canopy at p For simplicity we assume only one reducer Close() For each canopy c: emit(null, c.total/c.count) 20 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00

MapReduce - Canopy Reduce() Reduce() Input: key = null; input values (total, count) Output: key = null; value = cluster centroids For each intermediate values p = total/count For each canopy c: Reducer results if dist(p,c)< T1 then c.total+=p, c.count++; if dist(p,c)< T2 then strongbound = true If not strongbound then create canopy at p Remark: This assumes only one reducer Close() For each canopy c: emit(null, c.total/c.count) 21 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00

Clustering Assignment Clustering assignment For each point p Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 22

MapReduce: Cluster Assignment Cluster assignment Map() Input: Point p; cluster centriods Output: Output key = cluster id Output value =point id currentdist = inf Canopy center For each cluster centroids c If dist(p,c)<currentdist then bestcluster=c, currentdist=dist(p,c); Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 23

KMeans: Multi-pass clustering Traditional AssignCluster() AssignCluster(): For each point p Assign p the closest c UpdateCentroids (): For each cluster Update cluster center Kmeans () While not converge: AssignCluster() UpdateCentroids() UpdateCentroid() 24

MapReduce KMeans Map1 Map2 Initial centroids KmeansIter() Map(p) // Assign Cluster For c in clusters: If dist(p,c)<mindist, then minc=c, mindist = dist(p,c) Emit(minC.id, (p, 1)) Reduce() //Update Centroids For all values (p, c) : total += p; count += c; Emit(key, (total, count)) 25

MapReduce KMeans KmeansIter() 1 2 4 3 Map: assign each p to closest centroids Reduce: update each centroid with its new location (total, count) Map(p) // Assign Cluster For c in clusters: If dist(p,c)<mindist, then minc=c, mindist = dist(p,c) Emit(minC.id, (p, 1)) Reduce() //Update Centroids For all values (p, c) : Map1 Map2 Initial centroids Reduce1 Reduce2 total += p; count += c; Emit(key, (total, count)) 26

27 Classification using MapReduce

MapReduce knn K=3 1 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 Map() Input: All points query point p Output: k nearest neighbors (local) Emit the k closest points to p Map1 Map2 Query point Reduce Reduce() Input: Key: null; values: local neighbors query point p Output: k nearest neighbors (global) Emit the k closest points to p among all local neighbors 28

Naïve Bayes Formulation: Parameter estimation Class prior: ^P(c) = N c N Conditional probability: ^P (wjc) = T cw P w 0 T cw 0 P(cjd) / P(c) Q w2d P(wjc) c is a class label, d is a doc, w is a word where N c is #docs in c, N is #docs T cw is # of occurrences of w in class c d1 d2 d3 d4 d5 d6 d7 Term vector c1: N 1 =3 c2: N 2 =4 T 1w T 2w Goals: 1. total number of docs N 2. number of docs in c:n c 3. word count histogram in c:t cw 4. total word count in c: T cw 30

MapReduce: Naïve Bayes Goals: 1. total number of docs N 2. number of docs in c:n c 3. word count histogram in c:t cw 4. total word count in c: T cw ClassPrior() Map(doc): Emit(class_id, (doc_id, doc_length) Combine()/Reduce() N c = 0; stcw = 0 For each doc_id: N c ++; stcw +=doc_length Emit(c, N c ) Naïve Bayes can be implemented using MapReduce jobs of histogram computation 31 ConditionalProbability() Map(doc): For each word w in doc: Emit(pair(c, w), 1) Combine()/Reduce() T cw = 0 For each value v: T cw += v Emit(pair(c, w), T cw )

32 MapReduce Mining Summary

Taxonomy of MapReduce algorithms One Iteration Multiple Iterations Not good for MapReduce Clustering Canopy KMeans Classification Naïve Bayes, knn Gaussian Mixture SVM Graphs Information Retrieval Inverted Index PageRank Topic modeling (PLSI, LDA) One-iteration algorithms are perfect fits Multiple-iteration algorithms are OK fits but small shared info have to be synchronized across iterations (typically through filesytem) Some algorithms are not good for MapReduce framework Those algorithms typically require large shared info with a lot of synchronization. Traditional parallel framework like MPI is better suited for those. 33

MapReduce for machine learning algorithms The key is to convert into summation form (Statistical Query model [Kearns 94]) y= f(x) where f(x) corresponds to map(), corresponds to reduce(). Naïve Bayes MR job: P(c) MR job: P(w c) Kmeans MR job: split data into subgroups and compute partial sum in Map(), then sum up in Reduce() 34 Map-Reduce for Machine Learning on Multicore [NIPS 06]

Machine learning algorithms using MapReduce Linear Regression: ^y = (X T X) 1 X T y where X 2 R m n and y 2 R m MR job1: MR job2: Finally, solve A = X T X = P i x ix i T b = X T y = P i x iy(i) A^y = b Locally Weighted Linear Regression: A = P i w T ix i x i MR job1: MR job2: b = P i w ix i y(i) 35 Map-Reduce for Machine Learning on Multicore [NIPS 06]

Machine learning algorithms using MapReduce (cont.) Logistic Regression Neural Networks PCA ICA EM for Gaussian Mixture Model 36 Map-Reduce for Machine Learning on Multicore [NIPS 06]

37 MapReduce Mining Resources

Mahout: Hadoop data mining library 38 Mahout: http://lucene.apache.org/mahout/ scalable data mining libraries: mostly implemented in Hadoop Data structure for vectors and matrices Vectors Dense vectors as double[] Sparse vectors as HashMap<Integer, Double> Operations: assign, cardinality, copy, divide, dot, get, havesharedcells, like, minus, normalize, plus, set, size, times, toarray, viewpart, zsum and cross Matrices Dense matrix as a double[][] SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the rows or columns of the matrix in a SparseVector SparseMatrix as a HashMap<Integer, Vector> Operations: assign, assigncolumn, assignrow, cardinality, copy, divide, get, havesharedcells, like, minus, plus, set, size, times, transpose, toarray, viewpart and zsum

MapReduce Mining Papers [Chu et al. NIPS 06] Map-Reduce for Machine Learning on Multicore General framework under MapReduce [Papadimitriou et al ICDM 08] DisCo: Distributed Co-clustering with Map- Reduce Co-clustering [Kang et al. ICDM 09] PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations Graph algorithms [Das et al. WWW 07] Google news personalization: scalable online collaborative filtering PLSI EM [Grossman+Gu KDD 08] Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. KDD 2008 Alternative to Hadoop which supports wide-area data collection and distribution. 39

Summary: algorithms Best for MapReduce: Single pass, keys are uniformly distributed. OK for MapReduce: Multiple pass, intermediate states are small Bad for MapReduce Key distribution is skewed Fine-grained synchronization is required. 40

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook