Actionable Mining of Multi-relational Data using Localized Predictive Models

Size: px
Start display at page:

Download "Actionable Mining of Multi-relational Data using Localized Predictive Models"

Transcription

1 Actionable Mining of Multi-relational Data using Localized Predictive Models Joydeep Ghosh Schlumberger Centennial Chaired Professor The University of Texas at Austin joint work with Meghana Deodhar and Aayush Sharma Joydeep Ghosh UT-ECE 1

2 A Single (Affinity) Relation {Users} x {Movies} à ratings DYADIC data Limited side information Missing not at random Time Star Wars Heat Titani c Batm an Alice Bob 5 3 Carol Dave 5 1 5? Top solutions: Ensembles of matrix factorizations (global) + collaborative filtering (local) (scalable? Interpretable?) Netflix 2 would have had richer side-information. 2

3 Multi-Relational From Arindam Banerjee, Sugato Basu, Srujana Merugu, Multi-way Clustering on Relation Graphs, SDM 2007 and richer set of side information: Location, product hierarchies; social networks, 3

4 Overview Predicting Affinities between two (or more) sets of entities Leveraging a variety of extra information Dealing with predictive heterogeneity in large, multirelational data Multiple localized models Side benefits Hard vs. Soft Decompositions 4

5 Simultaneous Co-clustering and Learning (SCOAL) [Deodhar and Ghosh, TKDD 10] Product attributes Simultaneous partitioning and prediction Customer attributes Predictive Model 1 1? ? -1 1? exploits heterogeneity but has chicken and egg problem. 5

6 c Regression Example p Data matrix typically very sparse 6

7 Regression Example After rearranging rows and cols. p c

8 Regression Example c + 2p c p

9 Regression Example c + 2p p c + p c c + 3p 5c + p 9

10 Reconstruction Errors Reconstructed with a single linear model z = c + 1.5p MSE = 21.8 Reconstructed with simultaneous co-clustering and regression MSE =

11 Linear Regression (Global) j e.g., products/ads i e.g., customers/ queries z ij covariates x ij = {c i, p j, a ij } e.g., ratings, CTR customer features product features joint features Binary w uv associated with each matrix entry (0=missing) Model parameters β T = [β 0, β ct, β p T, β c,pt ] Global model zij = β T x ij 11

12 Linear Regressions (Local) ρ: Mapping from m rows to k row clusters γ: Mapping from n columns to l column clusters Total of k * l regression models. Find co-clustering (ρ, γ) and models (β s) that minimize the total (regularized) squared error 2 wuv( zuv zuv) + penalty u, v zˆ uv = ˆ T β ρ ( u) γ ( v) x Clustering Based on the prediction cost, not cluster homogeneity! uv Classification? Change loss function; predictive model. 12

13 SCOAL Meta-Algorithm Input: Data: Z, Weights: W, Attributes: C, P Output: Co-clustering (ρ, γ), models {βs}β} Initialize ρ, γ Iterate until convergence Re-estimate model for each co-cluster Re-estimate the co-clusters Update row clusters: Assign each row to closest row cluster Update col clusters: Assign each col to closest col cluster Return ρ, γ, {βs}β} Guaranteed to converge to locally optimal solution 13

14 Preventing Overfits Regularize Local Models Smooth/share parameters across same row cluster Ditto for columns Shrinkage between global model and local models Model Selection: find the right # of local models e.g. use greedy bisecting algorithm (M-SCOAL) Gives better local minima and faster convergence! 14

15 Results (ERIM) Algorithm Test Error (MSE) Global Model (k=1,l=1) 4.24 (0.06) CC (k=4,l=4) (0.056) Co-Cluster then predict (k=4, l=4) (0.034) Smoothened SCOAL (k=4,l=4) (0.052) M-SCOAL (k=1-2,l=4-6) (0.035) Also handily beats MLP, M5, 15

16 Market Segmentation and Structure (ERIM Data) Coefficients of global model and sample co-cluster models Attribute Global Cust Seg 3, Prod Seg 3 Cust Seg 4, Prod Seg 4 Cust Seg 1, Prod Seg 2 intercept income 0.00 (1.00) (0.00) (0.00) (0.31) (0.00) (0.00) 0.15 (0.00) (0.42) # members 0.03 (0.00) 0.03 (0.74) 0.04 (0.00) (0.06) male head emp (0.42) (0.42) 0.00 (0.87) 0.05 (0.04) female head emp (0.62) (0.31) 0.00 (0.45) 0.02 (0.46) # visits 0.02 (0.00) (0.05) 0.01 (0.06) 0.11 (0.00) total spent 0.10 (0.00) 0.48 (0.00) 0.03 (0.00) 0.09 (0.00) price (0.00) (0.00) (0.00) 0.42 (0.00) market share 0.17 (0.00) 0.43 (0.00) 0.09 (0.00) 0.16 (0.00) # times advertised 0.10 (0.00) 0.48 (0.00) 0.04 (0.00) 0.04 (0.06) Cheapest, most popular products Low market share High income, large # visits 16

17 Leveraging multiple prediction models Interpretability Mining for the most certain predictions (KDD) Robust versions: dropping outlier rows/cols (ICML) Active learning Parallel implementation 17

18 Active Learning with SCOAL [Deodhar, Ghosh and Saar-Tsechansky, Information Systems Research 11] Idea: Acquire labels in regions with poor model fits Incrementally add models as more data available. Global vs. local techniques # Local models learnt by Hierarchical BlockRank 18

19 Issues with SCOAL Hard partitioning double-edged Mixed memberships may be more natural Cold-start problem (new user/product) Need to couple covariates with latent variables Need uniform way of dealing with other side information and missing data Social network Free text Concept hierarchies Solution: Probabilistic Generative models! 19

20 Latent Dirichlet Affinity Aware Bayesian Estimation customers models products z1, z2 indicate cluster memberships (latent variables) 20

21 Generative Model Exponential family Generalized linear model 21

22 Incorporating a Social Network Neighborhood yields an MRF prior on the cluster memberships 22

23 Affinities Not Missing at Random W P(rated ; w=1) à Rating, y à W = binary variable (present/absent) 23

24 Key Properties of Inferencing Direct EM formulation intractable Variational Methods available, providing mean-field approximations Iterative; linear complexity per iteration Dynamic versions smoothen over time (Kalman Filtering etc). Non-parametric Bayesian versions possible, but need sampling. 24

25 Concluding Remarks Simultaneous Decomposition and Prediction (SDaP) Philosophy Simple, fast, interpretable, scalable Useful for many large-scale applications involving multi-relational data Predicting affinities Soft versions: Generative models capture richer info/constraints Still efficient but more opaque A lot more to be done! Explore-exploit strategies; sequential learning Multi-objective optimization and variety 25

26 References D. Agarwal, Recommender Problems for Content Optimization, Talk at MMDS, June 10. Meghana Deodhar and Joydeep Ghosh SCOAL: A Framework for Simultaneous Co-clustering and Learning from Complex Data ACM Transactions on Knowledge Discovery from Data, Oct 2010 Meghana Deodhar and Joydeep Ghosh Mining for the Most Certain Predictions from Dyadic Data. KDD 09 26

27 Backups 27

28 ERIM Marketing Dataset Household panel data collected by A.C. Nielsen 1714 customers, 121 products from 6 product categories (ketchup, sugar, etc.) Customer-product matrix cell values = # units purchased Household Attributes income, # residents, male head employed, female head employed, total visits, total expense Product Attributes market share, price, # times product was advertised Properties Fairly Sparse 74.86% values are 0 Very skewed 99.12% values < 20, rest very high (outliers) 28

29 Simultaneous Co-clustering and Classification Elements of Z are class labels (2 class problem) Logistic regression model relating attributes to class label Log odds modeled as a linear combination of the attributes (= β T x ij ) Find (ρ, γ) and (β s) that minimize the total log loss u, v w uv ln( 1+ exp( z uv β T ρ ( u) γ ( v) xuv)) 29

30 Simultaneous grouping and topics (X. Wang 09)

31 Related Approaches for Modeling Dyadic Data Predictive discrete latent factor modeling [Agarwal and Merugu 07] Response variable modeled as a sum of 1. Function of covariates (global structure) 2. Co-cluster specific constant (local structure) Matrix factorization with side information Regression based latent factor modeling [Agarwal and Chen 09] Cold-start problem Regression based priors on latent user/item factors Spatio-temporal Kalman filtering [Lu et al. 09] MRF prior to capture spatial structure Temporal changes modeled by Kalman filtering 31

32 PDLF Model p( z ij x ij ) k l = I = 1 J = 1 T ij π fφ ( z ; g( µ ) = x β + δ IJ ij ij IJ ) Constrained mixture model k * l components, π IJ : mixture prior of IJ th component Each component is a generalized linear model f φ : exponential family, g: link function Global trends x ijt β shared across the components Each co-cluster/latent factor has an additional offset: δ IJ 32

33 PDLF vs. SCOAL PDLF Single global model + co-cluster constants Robust even when data is limited SCOAL T m n p( zij xij, ρ, γ ) = fφ ( zij; β xij + δ ρ ( i), γ ( j) ),[ i] 1,[ j] 1 k x l co-cluster models Works well when large amount of data is available p ( T m n zij xij, ρ, γ ) = fφ ( zij; β ρ xij),[ i] j i j 1,[ ] ( ), γ ( ) 1 Complementary approaches 33

34 Active Learning Learner selects instances to be labeled such that the generalization accuracy is improved the most Example: Large scale recommender systems Require large number of customers to rate many products Obtaining ratings is expensive Solution: Select those customers and products to query whose feedback improves the model the most 34

35 Hierarchical BlockRank Learning multiple local models involves tuning a lot of parameters May not have enough data Single global model may do better when training data is limited Solution: Increase model complexity (# local models) as more labeled data is acquired Begin with a single co-cluster Perform model selection step every N acquisition steps Increase # co-clusters if validation set error reduces Use greedy, bisecting step to add one row/column cluster 35

36 Evaluation of the BlockRank Policy ERIM Marketing Dataset 36

37 Parallel SCOAL [Joint work with Clinton Jones] Expected to show maximum value on large datasets Large heterogeneity Sufficient data available for learning parameters Distributed implementation using Map-Reduce framework Developed on open source Hadoop platform Experiments on Amazon EC2 clusters SCOAL row/column cluster updates Parallel distance computation for each matrix entry Co-cluster model updates: Can be done in parallel 37

38 Parallel SCOAL Pseudo Code 1. Learn co-cluster models 2. Update row clusters 3. Update col clusters Map Reduce Map Reduce Map Reduce <id, tuple> <cc id, tuple> <cc id, <tuple>> learn cc model <id, tuple> <id, tuple> compute distance <row id, tuple> <row id, <tuple>> agg. dist. and assign <id, tuple> <id, tuple> compute distance <col id, tuple> <col id, <tuple>> agg. dist. and assign <id, tuple> iterate 38

39 Results MovieLens 100K MovieLens 1M Timing results with 10 Map, 1 Reduce job 39

40 Decoupled SCOAL Grid Partitioning with SCOAL Partitioning with D-SCOAL Global objective function requires synchronization at the end of each iteration 40

41 Future Directions Exploiting additional structural information Hierarchy/taxonomy on items Social network on users Spatial structure Sparse co-cluster models and feature selection Interaction between # models and model complexity Robust error functions Better at handling outliers Soft selection of data subset to be modeled 41

42 Row and Column Cluster Updates Objective function is a sum of row/column errors Assign each row to a row cluster that minimizes the row error Row cluster assignment for row u ρ new ( u) = arg min g n v= 1 w uv ( z uv zˆ uv 2 ) u β 11 β 12 β 21 β 22 β 31 β 32 e 1 (u) e 2 (u) e 3 (u) 42

43 Background Difficult classification/regression problems may involve a heterogeneous population Divide and conquer approach Partition the population and model each partition separately E.g., electric load forecasting Advantages Learning models on more homogeneous data Improves accuracy Simpler, more interpretable models 43

44 Motivation Traditionally partitioning is a priori Domain knowledge Clustering algorithm but, a priori partitioning may be suboptimal Solution: Interleaving partitioning and construction of prediction models Hard vs. soft partitioning solutions 44

45 Approaches to Interactive Decomposition Hard partitioning of input space for regression CART [Brieman et al. 84] Model trees (M5') [Wang and Witten 97] Soft partitioning of input space for regression Mixture of Experts [Jacobs and Jordan 91] Hierarchical versions Fuzzy clustering and modeling [Wedel and Steenkamp 89] Output space partitioning for modeling large number of classes Binary Hierarchical Classifier [Kumar et al. 01] 45

46 Incremental Model Selection (M-SCOAL) Top down, bisecting, greedy algorithm Run SCOAL with k=1, l=1 Repeat 1. Split row/column cluster with highest error 2. Initialize SCOAL with current partitioning 3. Accept split if validation error is reduced until no change in k and l products customers Gives better local minimum Fast convergence 46

47 Modeling Ordered Data [Deodhar and Ghosh, Data Mining for Design and Mkt. Workshop, ICDM 08] Dyadic data with implicit ordering along a mode E.g., temporal marketing data Segment the time axis Dynamic programming (quadratic) or greedy (linear) While clustering other entities 47

48 Results on Tensor ERIM Dataset Dollars spent by households per week at 5 stores 1240 households X 50 weeks X 5 stores Algorithm Train R-sq. Test MSE Test R-sq. Global (1.27) 0.44 Cluster Models SCOAL (1.36) (0.99) CoSeg (0.88) 0.55 Prediction error averaged over 10 random % training-test data splits 48

49 Mining for the Most Reliable Predictions [Deodhar and Ghosh, KDD 09] Problem: Rank predictions by estimate of their accuracy E.g., options market player strategy SCOAL based ranking Row-Col Ranking: Rank by estimated mean row error + col error Block Ranking: Rank by co-cluster error 49

50 Comparing Ranking Techniques: MovieLens 50

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Response prediction using collaborative filtering with hierarchies and side-information

Response prediction using collaborative filtering with hierarchies and side-information Response prediction using collaborative filtering with hierarchies and side-information Aditya Krishna Menon 1 Krishna-Prasad Chitrapura 2 Sachin Garg 2 Deepak Agarwal 3 Nagaraj Kota 2 1 UC San Diego 2

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Predictive Discrete Latent Factor Models for Large Scale. Dyadic data

Predictive Discrete Latent Factor Models for Large Scale. Dyadic data Predictive Discrete Latent Factor Models for Large Scale Dyadic Data Deepak Agarwal, Srujana Merugu Yahoo! Research Sunnyvale,CA,USA {dagarwal,srujana}@yahoo-inc.com ABSTRACT We propose a novel statistical

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Multi-way Clustering on Relation Graphs

Multi-way Clustering on Relation Graphs Multi-way Clustering on Relation Graphs Arindam Banerjee Sugato Basu Srujana Merugu Abstract A number of real-world domains such as social networks and e-commerce involve heterogeneous data that describes

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Bayesian Factorization Machines

Bayesian Factorization Machines Bayesian Factorization Machines Christoph Freudenthaler, Lars Schmidt-Thieme Information Systems & Machine Learning Lab University of Hildesheim 31141 Hildesheim {freudenthaler, schmidt-thieme}@ismll.de

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

More information

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University Collective Behavior Prediction in Social Media Lei Tang Data Mining & Machine Learning Group Arizona State University Social Media Landscape Social Network Content Sharing Social Media Blogs Wiki Forum

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

A Data Generator for Multi-Stream Data

A Data Generator for Multi-Stream Data A Data Generator for Multi-Stream Data Zaigham Faraz Siddiqui, Myra Spiliopoulou, Panagiotis Symeonidis, and Eleftherios Tiakas University of Magdeburg ; University of Thessaloniki. [siddiqui,myra]@iti.cs.uni-magdeburg.de;

More information

Recommendations in Mobile Environments. Professor Hui Xiong Rutgers Business School Rutgers University. Rutgers, the State University of New Jersey

Recommendations in Mobile Environments. Professor Hui Xiong Rutgers Business School Rutgers University. Rutgers, the State University of New Jersey 1 Recommendations in Mobile Environments Professor Hui Xiong Rutgers Business School Rutgers University ADMA-2014 Rutgers, the State University of New Jersey Big Data 3 Big Data Application Requirements

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

The primary goal of this thesis was to understand how the spatial dependence of

The primary goal of this thesis was to understand how the spatial dependence of 5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Big Data Techniques Applied to Very Short-term Wind Power Forecasting Big Data Techniques Applied to Very Short-term Wind Power Forecasting Ricardo Bessa Senior Researcher (ricardo.j.bessa@inesctec.pt) Center for Power and Energy Systems, INESC TEC, Portugal Joint work with

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

Why do statisticians "hate" us?

Why do statisticians hate us? Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

Probabilistic Latent Semantic Analysis (plsa)

Probabilistic Latent Semantic Analysis (plsa) Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg Rainer.Lienhart@informatik.uni-augsburg.de www.multimedia-computing.{de,org} References

More information

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher Outline 2 1. Retrospection 2. Stratosphere Plans 3. Comparison with Hadoop 4. Evaluation 5. Outlook Retrospection

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Inductive Learning in Less Than One Sequential Data Scan

Inductive Learning in Less Than One Sequential Data Scan Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Research Hawthorne, NY 10532 {weifan,haixun,psyu}@us.ibm.com Shaw-Hwa Lo Statistics Department,

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information