Actionable Mining of Multi-relational Data using Localized Predictive Models

Transcription

1 Actionable Mining of Multi-relational Data using Localized Predictive Models Joydeep Ghosh Schlumberger Centennial Chaired Professor The University of Texas at Austin joint work with Meghana Deodhar and Aayush Sharma Joydeep Ghosh UT-ECE 1

2 A Single (Affinity) Relation {Users} x {Movies} à ratings DYADIC data Limited side information Missing not at random Time Star Wars Heat Titani c Batm an Alice Bob 5 3 Carol Dave 5 1 5? Top solutions: Ensembles of matrix factorizations (global) + collaborative filtering (local) (scalable? Interpretable?) Netflix 2 would have had richer side-information. 2

3 Multi-Relational From Arindam Banerjee, Sugato Basu, Srujana Merugu, Multi-way Clustering on Relation Graphs, SDM 2007 and richer set of side information: Location, product hierarchies; social networks, 3

4 Overview Predicting Affinities between two (or more) sets of entities Leveraging a variety of extra information Dealing with predictive heterogeneity in large, multirelational data Multiple localized models Side benefits Hard vs. Soft Decompositions 4

5 Simultaneous Co-clustering and Learning (SCOAL) [Deodhar and Ghosh, TKDD 10] Product attributes Simultaneous partitioning and prediction Customer attributes Predictive Model 1 1? ? -1 1? exploits heterogeneity but has chicken and egg problem. 5

6 c Regression Example p Data matrix typically very sparse 6

7 Regression Example After rearranging rows and cols. p c

8 Regression Example c + 2p c p

9 Regression Example c + 2p p c + p c c + 3p 5c + p 9

10 Reconstruction Errors Reconstructed with a single linear model z = c + 1.5p MSE = 21.8 Reconstructed with simultaneous co-clustering and regression MSE =

11 Linear Regression (Global) j e.g., products/ads i e.g., customers/ queries z ij covariates x ij = {c i, p j, a ij } e.g., ratings, CTR customer features product features joint features Binary w uv associated with each matrix entry (0=missing) Model parameters β T = [β 0, β ct, β p T, β c,pt ] Global model zij = β T x ij 11

12 Linear Regressions (Local) ρ: Mapping from m rows to k row clusters γ: Mapping from n columns to l column clusters Total of k * l regression models. Find co-clustering (ρ, γ) and models (β s) that minimize the total (regularized) squared error 2 wuv( zuv zuv) + penalty u, v zˆ uv = ˆ T β ρ ( u) γ ( v) x Clustering Based on the prediction cost, not cluster homogeneity! uv Classification? Change loss function; predictive model. 12

13 SCOAL Meta-Algorithm Input: Data: Z, Weights: W, Attributes: C, P Output: Co-clustering (ρ, γ), models {βs}β} Initialize ρ, γ Iterate until convergence Re-estimate model for each co-cluster Re-estimate the co-clusters Update row clusters: Assign each row to closest row cluster Update col clusters: Assign each col to closest col cluster Return ρ, γ, {βs}β} Guaranteed to converge to locally optimal solution 13

14 Preventing Overfits Regularize Local Models Smooth/share parameters across same row cluster Ditto for columns Shrinkage between global model and local models Model Selection: find the right # of local models e.g. use greedy bisecting algorithm (M-SCOAL) Gives better local minima and faster convergence! 14

15 Results (ERIM) Algorithm Test Error (MSE) Global Model (k=1,l=1) 4.24 (0.06) CC (k=4,l=4) (0.056) Co-Cluster then predict (k=4, l=4) (0.034) Smoothened SCOAL (k=4,l=4) (0.052) M-SCOAL (k=1-2,l=4-6) (0.035) Also handily beats MLP, M5, 15

16 Market Segmentation and Structure (ERIM Data) Coefficients of global model and sample co-cluster models Attribute Global Cust Seg 3, Prod Seg 3 Cust Seg 4, Prod Seg 4 Cust Seg 1, Prod Seg 2 intercept income 0.00 (1.00) (0.00) (0.00) (0.31) (0.00) (0.00) 0.15 (0.00) (0.42) # members 0.03 (0.00) 0.03 (0.74) 0.04 (0.00) (0.06) male head emp (0.42) (0.42) 0.00 (0.87) 0.05 (0.04) female head emp (0.62) (0.31) 0.00 (0.45) 0.02 (0.46) # visits 0.02 (0.00) (0.05) 0.01 (0.06) 0.11 (0.00) total spent 0.10 (0.00) 0.48 (0.00) 0.03 (0.00) 0.09 (0.00) price (0.00) (0.00) (0.00) 0.42 (0.00) market share 0.17 (0.00) 0.43 (0.00) 0.09 (0.00) 0.16 (0.00) # times advertised 0.10 (0.00) 0.48 (0.00) 0.04 (0.00) 0.04 (0.06) Cheapest, most popular products Low market share High income, large # visits 16

17 Leveraging multiple prediction models Interpretability Mining for the most certain predictions (KDD) Robust versions: dropping outlier rows/cols (ICML) Active learning Parallel implementation 17

18 Active Learning with SCOAL [Deodhar, Ghosh and Saar-Tsechansky, Information Systems Research 11] Idea: Acquire labels in regions with poor model fits Incrementally add models as more data available. Global vs. local techniques # Local models learnt by Hierarchical BlockRank 18

19 Issues with SCOAL Hard partitioning double-edged Mixed memberships may be more natural Cold-start problem (new user/product) Need to couple covariates with latent variables Need uniform way of dealing with other side information and missing data Social network Free text Concept hierarchies Solution: Probabilistic Generative models! 19

20 Latent Dirichlet Affinity Aware Bayesian Estimation customers models products z1, z2 indicate cluster memberships (latent variables) 20

21 Generative Model Exponential family Generalized linear model 21

22 Incorporating a Social Network Neighborhood yields an MRF prior on the cluster memberships 22

23 Affinities Not Missing at Random W P(rated ; w=1) à Rating, y à W = binary variable (present/absent) 23

24 Key Properties of Inferencing Direct EM formulation intractable Variational Methods available, providing mean-field approximations Iterative; linear complexity per iteration Dynamic versions smoothen over time (Kalman Filtering etc). Non-parametric Bayesian versions possible, but need sampling. 24

25 Concluding Remarks Simultaneous Decomposition and Prediction (SDaP) Philosophy Simple, fast, interpretable, scalable Useful for many large-scale applications involving multi-relational data Predicting affinities Soft versions: Generative models capture richer info/constraints Still efficient but more opaque A lot more to be done! Explore-exploit strategies; sequential learning Multi-objective optimization and variety 25

26 References D. Agarwal, Recommender Problems for Content Optimization, Talk at MMDS, June 10. Meghana Deodhar and Joydeep Ghosh SCOAL: A Framework for Simultaneous Co-clustering and Learning from Complex Data ACM Transactions on Knowledge Discovery from Data, Oct 2010 Meghana Deodhar and Joydeep Ghosh Mining for the Most Certain Predictions from Dyadic Data. KDD 09 26

27 Backups 27

28 ERIM Marketing Dataset Household panel data collected by A.C. Nielsen 1714 customers, 121 products from 6 product categories (ketchup, sugar, etc.) Customer-product matrix cell values = # units purchased Household Attributes income, # residents, male head employed, female head employed, total visits, total expense Product Attributes market share, price, # times product was advertised Properties Fairly Sparse 74.86% values are 0 Very skewed 99.12% values < 20, rest very high (outliers) 28

29 Simultaneous Co-clustering and Classification Elements of Z are class labels (2 class problem) Logistic regression model relating attributes to class label Log odds modeled as a linear combination of the attributes (= β T x ij ) Find (ρ, γ) and (β s) that minimize the total log loss u, v w uv ln( 1+ exp( z uv β T ρ ( u) γ ( v) xuv)) 29

30 Simultaneous grouping and topics (X. Wang 09)

31 Related Approaches for Modeling Dyadic Data Predictive discrete latent factor modeling [Agarwal and Merugu 07] Response variable modeled as a sum of 1. Function of covariates (global structure) 2. Co-cluster specific constant (local structure) Matrix factorization with side information Regression based latent factor modeling [Agarwal and Chen 09] Cold-start problem Regression based priors on latent user/item factors Spatio-temporal Kalman filtering [Lu et al. 09] MRF prior to capture spatial structure Temporal changes modeled by Kalman filtering 31

32 PDLF Model p( z ij x ij ) k l = I = 1 J = 1 T ij π fφ ( z ; g( µ ) = x β + δ IJ ij ij IJ ) Constrained mixture model k * l components, π IJ : mixture prior of IJ th component Each component is a generalized linear model f φ : exponential family, g: link function Global trends x ijt β shared across the components Each co-cluster/latent factor has an additional offset: δ IJ 32

33 PDLF vs. SCOAL PDLF Single global model + co-cluster constants Robust even when data is limited SCOAL T m n p( zij xij, ρ, γ ) = fφ ( zij; β xij + δ ρ ( i), γ ( j) ),[ i] 1,[ j] 1 k x l co-cluster models Works well when large amount of data is available p ( T m n zij xij, ρ, γ ) = fφ ( zij; β ρ xij),[ i] j i j 1,[ ] ( ), γ ( ) 1 Complementary approaches 33

34 Active Learning Learner selects instances to be labeled such that the generalization accuracy is improved the most Example: Large scale recommender systems Require large number of customers to rate many products Obtaining ratings is expensive Solution: Select those customers and products to query whose feedback improves the model the most 34

35 Hierarchical BlockRank Learning multiple local models involves tuning a lot of parameters May not have enough data Single global model may do better when training data is limited Solution: Increase model complexity (# local models) as more labeled data is acquired Begin with a single co-cluster Perform model selection step every N acquisition steps Increase # co-clusters if validation set error reduces Use greedy, bisecting step to add one row/column cluster 35

36 Evaluation of the BlockRank Policy ERIM Marketing Dataset 36

37 Parallel SCOAL [Joint work with Clinton Jones] Expected to show maximum value on large datasets Large heterogeneity Sufficient data available for learning parameters Distributed implementation using Map-Reduce framework Developed on open source Hadoop platform Experiments on Amazon EC2 clusters SCOAL row/column cluster updates Parallel distance computation for each matrix entry Co-cluster model updates: Can be done in parallel 37

38 Parallel SCOAL Pseudo Code 1. Learn co-cluster models 2. Update row clusters 3. Update col clusters Map Reduce Map Reduce Map Reduce <id, tuple> <cc id, tuple> <cc id, <tuple>> learn cc model <id, tuple> <id, tuple> compute distance <row id, tuple> <row id, <tuple>> agg. dist. and assign <id, tuple> <id, tuple> compute distance <col id, tuple> <col id, <tuple>> agg. dist. and assign <id, tuple> iterate 38

39 Results MovieLens 100K MovieLens 1M Timing results with 10 Map, 1 Reduce job 39

40 Decoupled SCOAL Grid Partitioning with SCOAL Partitioning with D-SCOAL Global objective function requires synchronization at the end of each iteration 40

41 Future Directions Exploiting additional structural information Hierarchy/taxonomy on items Social network on users Spatial structure Sparse co-cluster models and feature selection Interaction between # models and model complexity Robust error functions Better at handling outliers Soft selection of data subset to be modeled 41

42 Row and Column Cluster Updates Objective function is a sum of row/column errors Assign each row to a row cluster that minimizes the row error Row cluster assignment for row u ρ new ( u) = arg min g n v= 1 w uv ( z uv zˆ uv 2 ) u β 11 β 12 β 21 β 22 β 31 β 32 e 1 (u) e 2 (u) e 3 (u) 42

43 Background Difficult classification/regression problems may involve a heterogeneous population Divide and conquer approach Partition the population and model each partition separately E.g., electric load forecasting Advantages Learning models on more homogeneous data Improves accuracy Simpler, more interpretable models 43

44 Motivation Traditionally partitioning is a priori Domain knowledge Clustering algorithm but, a priori partitioning may be suboptimal Solution: Interleaving partitioning and construction of prediction models Hard vs. soft partitioning solutions 44

45 Approaches to Interactive Decomposition Hard partitioning of input space for regression CART [Brieman et al. 84] Model trees (M5') [Wang and Witten 97] Soft partitioning of input space for regression Mixture of Experts [Jacobs and Jordan 91] Hierarchical versions Fuzzy clustering and modeling [Wedel and Steenkamp 89] Output space partitioning for modeling large number of classes Binary Hierarchical Classifier [Kumar et al. 01] 45

46 Incremental Model Selection (M-SCOAL) Top down, bisecting, greedy algorithm Run SCOAL with k=1, l=1 Repeat 1. Split row/column cluster with highest error 2. Initialize SCOAL with current partitioning 3. Accept split if validation error is reduced until no change in k and l products customers Gives better local minimum Fast convergence 46

47 Modeling Ordered Data [Deodhar and Ghosh, Data Mining for Design and Mkt. Workshop, ICDM 08] Dyadic data with implicit ordering along a mode E.g., temporal marketing data Segment the time axis Dynamic programming (quadratic) or greedy (linear) While clustering other entities 47

48 Results on Tensor ERIM Dataset Dollars spent by households per week at 5 stores 1240 households X 50 weeks X 5 stores Algorithm Train R-sq. Test MSE Test R-sq. Global (1.27) 0.44 Cluster Models SCOAL (1.36) (0.99) CoSeg (0.88) 0.55 Prediction error averaged over 10 random % training-test data splits 48

49 Mining for the Most Reliable Predictions [Deodhar and Ghosh, KDD 09] Problem: Rank predictions by estimate of their accuracy E.g., options market player strategy SCOAL based ranking Row-Col Ranking: Rank by estimated mean row error + col error Block Ranking: Rank by co-cluster error 49

50 Comparing Ranking Techniques: MovieLens 50