Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
|
|
|
- Coleen Brown
- 10 years ago
- Views:
Transcription
1 Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
2 Part 2:Mining using MapReduce Mining algorithms using MapReduce Information retrieval Graph algorithms: PageRank Clustering: Canopy clustering, KMeans Classification: knn, Naïve Bayes MapReduce Mining Summary 2
3 Host 2 Host 4 Host 1 Host 3 MapReduce Interface and Data Flow Map: (K1, V1) list(k2, V2) Combine: (K2, list(v2)) list(k2, V2) Partition: (K2, V2) reducer_id Reduce: (K2, list(v2)) list(k3, V3) id, doc list(w, id) list(unique_w, id) Map Combine Map Combine w1, list(id) Reduce w1, list(unique_id) Map Combine w2, list(id) partition Map Reduce Combine w2, list(unique_id) 3
4 Information retrieval using MapReduce 4
5 IR: Distributed Grep Find the doc_id and line# of a matching pattern Map: (id, doc) list(id, line#) Reduce: None Grep data mining Docs 1 2 Map1 Output <1, 123> Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012> 5
6 IR: URL Access Frequency Map: (null, log) (URL, 1) Reduce: (URL, 1) (URL, total_count) Logs Map1 Map2 Map output <u1,1> <u1,1> <u2,1> <u3,1> Reduce Reduce output <u1,2> <u2,1> <u3,2> Map3 <u3,1> Also described in Part 1 6
7 IR: Reverse Web-Link Graph Map: (null, page) (target, source) Reduce: (target, source) (target, list(source)) Pages Map output Reduce output Map1 <t1,s2> Map2 <t2,s3> <t2,s5> Reduce <t1,[s2]> <t2,[s3,s5]> <t3,[s5]> Map3 <t3,s5> 7 It is the same as matrix transpose
8 IR: Inverted Index Map: (id, doc) list(word, id) Reduce: (word, list(id)) (word, list(id)) Doc Map output Reduce output Map1 <w1,1> Map2 <w2,2> <w3,3> Reduce <w1,[1,5]> <w2,[2]> <w3,[3]> Map3 <w1,5> 8
9 9 Graph mining using MapReduce
10 PageRank PageRank vector q is defined as where q = ca T q + 1 c Browsing A is the source-by-destination adjacency matrix, e is all one vector. N is the number of nodes 1 2 N e Teleporting A = c is the weight between 0 and 1 (eg.0.85) PageRank indicates the importance of a page. Algorithm: Iterative powering for finding the first eigen-vector C A 10
11 MapReduce: PageRank 1 2 PageRank Map() Input: key = page x, value = (PageRank q x, links[y 1 y m ]) 3 4 Map: distribute PageRank q i q1 q2 q3 q4 Output: key = page x, value = partial x 1. Emit(x, 0) //guarantee all pages will be emitted 2. For each outgoing link y i : Emit(y i, q x /m) PageRank Reduce() Input: key = page x, value = the list of [partial x ] q1 q2 q3 q4 Reduce: update new PageRank Check out Kang et al ICDM 09 Output: key = page x, value = PageRank q x 1. q x = 0 2. For each partial value d in the list: q x += d 3. q x = cq x + (1-c)/N 4. Emit(x, q x ) 11
12 12 Clustering using MapReduce
13 Canopy: single-pass clustering Canopy creation Construct overlapping clusters canopies Make sure no two canopies with too much overlaps Key: no canopy centers are too close to each other C1 C2 C3 C4 T1 T2 overlapping clusters too much overlap two thresholds T1>T2 13 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00
14 Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 14
15 Canopy creation T1 T2 Strongly marked points C1 C2 Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 15
16 Canopy creation Other points in the cluster Strongly marked points Canopy center Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 16
17 Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 17
18 Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids Put all points into a queue Q While (Q is not empty) p = dequeue(q) For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongbound = true If not strongbound: create canopy at p For all canopy c: Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 18
19 MapReduce - Canopy Map() Canopy creation Map() Input: A set of points P, threshold T1, T2 Output: key = null; value = a list of local canopies (total, count) For each p in P: For each canopy c: Map1 Map2 if dist(p,c)< T1 then c.total+=p, c.count++; if dist(p,c)< T2 then strongbound = true If not strongbound then create canopy at p Close() For each canopy c: Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 19
20 MapReduce - Canopy Reduce() Reduce() Input: key = null; input values (total, count) Output: key = null; value = cluster centroids For each intermediate values p = total/count For each canopy c: Map1 results Map2 results if dist(p,c)< T1 then c.total+=p, c.count++; if dist(p,c)< T2 then strongbound = true If not strongbound then create canopy at p For simplicity we assume only one reducer Close() For each canopy c: emit(null, c.total/c.count) 20 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00
21 MapReduce - Canopy Reduce() Reduce() Input: key = null; input values (total, count) Output: key = null; value = cluster centroids For each intermediate values p = total/count For each canopy c: Reducer results if dist(p,c)< T1 then c.total+=p, c.count++; if dist(p,c)< T2 then strongbound = true If not strongbound then create canopy at p Remark: This assumes only one reducer Close() For each canopy c: emit(null, c.total/c.count) 21 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00
22 Clustering Assignment Clustering assignment For each point p Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 22
23 MapReduce: Cluster Assignment Cluster assignment Map() Input: Point p; cluster centriods Output: Output key = cluster id Output value =point id currentdist = inf Canopy center For each cluster centroids c If dist(p,c)<currentdist then bestcluster=c, currentdist=dist(p,c); Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, KDD 00 23
24 KMeans: Multi-pass clustering Traditional AssignCluster() AssignCluster(): For each point p Assign p the closest c UpdateCentroids (): For each cluster Update cluster center Kmeans () While not converge: AssignCluster() UpdateCentroids() UpdateCentroid() 24
25 MapReduce KMeans Map1 Map2 Initial centroids KmeansIter() Map(p) // Assign Cluster For c in clusters: If dist(p,c)<mindist, then minc=c, mindist = dist(p,c) Emit(minC.id, (p, 1)) Reduce() //Update Centroids For all values (p, c) : total += p; count += c; Emit(key, (total, count)) 25
26 MapReduce KMeans KmeansIter() Map: assign each p to closest centroids Reduce: update each centroid with its new location (total, count) Map(p) // Assign Cluster For c in clusters: If dist(p,c)<mindist, then minc=c, mindist = dist(p,c) Emit(minC.id, (p, 1)) Reduce() //Update Centroids For all values (p, c) : Map1 Map2 Initial centroids Reduce1 Reduce2 total += p; count += c; Emit(key, (total, count)) 26
27 27 Classification using MapReduce
28 MapReduce knn K= Map() Input: All points query point p Output: k nearest neighbors (local) Emit the k closest points to p Map1 Map2 Query point Reduce Reduce() Input: Key: null; values: local neighbors query point p Output: k nearest neighbors (global) Emit the k closest points to p among all local neighbors 28
29 Naïve Bayes Formulation: Parameter estimation Class prior: ^P(c) = N c N Conditional probability: ^P (wjc) = T cw P w 0 T cw 0 P(cjd) / P(c) Q w2d P(wjc) c is a class label, d is a doc, w is a word where N c is #docs in c, N is #docs T cw is # of occurrences of w in class c d1 d2 d3 d4 d5 d6 d7 Term vector c1: N 1 =3 c2: N 2 =4 T 1w T 2w Goals: 1. total number of docs N 2. number of docs in c:n c 3. word count histogram in c:t cw 4. total word count in c: T cw 30
30 MapReduce: Naïve Bayes Goals: 1. total number of docs N 2. number of docs in c:n c 3. word count histogram in c:t cw 4. total word count in c: T cw ClassPrior() Map(doc): Emit(class_id, (doc_id, doc_length) Combine()/Reduce() N c = 0; stcw = 0 For each doc_id: N c ++; stcw +=doc_length Emit(c, N c ) Naïve Bayes can be implemented using MapReduce jobs of histogram computation 31 ConditionalProbability() Map(doc): For each word w in doc: Emit(pair(c, w), 1) Combine()/Reduce() T cw = 0 For each value v: T cw += v Emit(pair(c, w), T cw )
31 32 MapReduce Mining Summary
32 Taxonomy of MapReduce algorithms One Iteration Multiple Iterations Not good for MapReduce Clustering Canopy KMeans Classification Naïve Bayes, knn Gaussian Mixture SVM Graphs Information Retrieval Inverted Index PageRank Topic modeling (PLSI, LDA) One-iteration algorithms are perfect fits Multiple-iteration algorithms are OK fits but small shared info have to be synchronized across iterations (typically through filesytem) Some algorithms are not good for MapReduce framework Those algorithms typically require large shared info with a lot of synchronization. Traditional parallel framework like MPI is better suited for those. 33
33 MapReduce for machine learning algorithms The key is to convert into summation form (Statistical Query model [Kearns 94]) y= f(x) where f(x) corresponds to map(), corresponds to reduce(). Naïve Bayes MR job: P(c) MR job: P(w c) Kmeans MR job: split data into subgroups and compute partial sum in Map(), then sum up in Reduce() 34 Map-Reduce for Machine Learning on Multicore [NIPS 06]
34 Machine learning algorithms using MapReduce Linear Regression: ^y = (X T X) 1 X T y where X 2 R m n and y 2 R m MR job1: MR job2: Finally, solve A = X T X = P i x ix i T b = X T y = P i x iy(i) A^y = b Locally Weighted Linear Regression: A = P i w T ix i x i MR job1: MR job2: b = P i w ix i y(i) 35 Map-Reduce for Machine Learning on Multicore [NIPS 06]
35 Machine learning algorithms using MapReduce (cont.) Logistic Regression Neural Networks PCA ICA EM for Gaussian Mixture Model 36 Map-Reduce for Machine Learning on Multicore [NIPS 06]
36 37 MapReduce Mining Resources
37 Mahout: Hadoop data mining library 38 Mahout: scalable data mining libraries: mostly implemented in Hadoop Data structure for vectors and matrices Vectors Dense vectors as double[] Sparse vectors as HashMap<Integer, Double> Operations: assign, cardinality, copy, divide, dot, get, havesharedcells, like, minus, normalize, plus, set, size, times, toarray, viewpart, zsum and cross Matrices Dense matrix as a double[][] SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the rows or columns of the matrix in a SparseVector SparseMatrix as a HashMap<Integer, Vector> Operations: assign, assigncolumn, assignrow, cardinality, copy, divide, get, havesharedcells, like, minus, plus, set, size, times, transpose, toarray, viewpart and zsum
38 MapReduce Mining Papers [Chu et al. NIPS 06] Map-Reduce for Machine Learning on Multicore General framework under MapReduce [Papadimitriou et al ICDM 08] DisCo: Distributed Co-clustering with Map- Reduce Co-clustering [Kang et al. ICDM 09] PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations Graph algorithms [Das et al. WWW 07] Google news personalization: scalable online collaborative filtering PLSI EM [Grossman+Gu KDD 08] Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. KDD 2008 Alternative to Hadoop which supports wide-area data collection and distribution. 39
39 Summary: algorithms Best for MapReduce: Single pass, keys are uniformly distributed. OK for MapReduce: Multiple pass, intermediate states are small Bad for MapReduce Key distribution is skewed Fine-grained synchronization is required. 40
40 Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, New Territories,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Getting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
Cloud Computing Summary and Preparation for Examination
Basics of Cloud Computing Lecture 8 Cloud Computing Summary and Preparation for Examination Satish Srirama Outline Quick recap of what we have learnt as part of this course How to prepare for the examination
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
MapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
HiBench Installation. Sunil Raiyani, Jayam Modi
HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................
Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol
Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014
Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24 Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses
Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and
Scaling Up HBase, Hive, Pegasus
CSE 6242 A / CS 4803 DVA Mar 7, 2013 Scaling Up HBase, Hive, Pegasus Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Using Map-Reduce for Large Scale Analysis of Graph-Based Data
Using Map-Reduce for Large Scale Analysis of Graph-Based Data NAN GONG KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2011 TRITA-ICT-EX-2011:218 Using Map-Reduce
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Report: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Big Data Analytics for Healthcare
Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1 Healthcare Analytics
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
High Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)
Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Ian Foster Computation Institute Argonne National Lab & University of Chicago 2 3 SQL Overview Structured Query Language
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Practical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
A Comparison of Approaches for Large-Scale Data Mining
A Comparison of Approaches for Large-Scale Data Mining Utilizing MapReduce in Large-Scale Data Mining Amy Xuyang Tan Dept. of Computer Science University of Texas at Dallas [email protected] Valerie
Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA. VLDB Cloud Intelligence workshop, 2012
Integrity Verification of Cloud-hosted Data Analytics Computations (Position paper) Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA 9/4/ 1 Data-Analytics-as-a- Service (DAaS) Outsource:
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Advanced Data Management Technologies
ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information
What s Cooking in KNIME
What s Cooking in KNIME Thomas Gabriel Copyright 2015 KNIME.com AG Agenda Querying NoSQL Databases Database Improvements & Big Data Copyright 2015 KNIME.com AG 2 Querying NoSQL Databases MongoDB & CouchDB
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science
Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: http://www.acsu.buffalo.edu/~mjalimin/
Big Data. Lyle Ungar, University of Pennsylvania
Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -
MapReduce for Statistical NLP/Machine Learning
MapReduce for Statistical NLP/Machine Learning Sameer Maskey Week 7, October 17, 2012 1 Topics for Today MapReduce Running MapReduce jobs in Amazon Hadoop MapReduce for Machine Learning Naïve Bayes, K-Means,
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
DATA CLUSTERING USING MAPREDUCE
DATA CLUSTERING USING MAPREDUCE by Makho Ngazimbi A project submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University March 2009
CloudRank-D:A Benchmark Suite for Private Cloud Systems
CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China HVC tutorial in conjunction
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
