Big Data Analytics. Lucas Rego Drumond

Transcription

1 Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Application Scenario: Recommender Systems Going For Large Scale Application Scenario: Recommender Systems 1 / 44

2 Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 1 / 44

3 1. ADMM Continued Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 1 / 44

4 1. ADMM Continued Applying ADMM to predictive problems minimize β (i),α N i=1 (x,y) D train i subject to β (i) α = 0 l(y, ŷ(x; β (i) )) + R(β (i) ) The ADMM algorithm iteratively performs the following steps: β (i)t+1 arg min β α t+1 arg min α (x,y) D train i l(y, ŷ(x; β)) + R(β) + ν (i)t T β + s 2 β αt 2 2 N ν (i)t T (β (i) t+1 α) + s 2 β(i)t+1 α 2 2 i=1 ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 1 / 44

5 1. ADMM Continued Solving Linear Regression with ADMM Loss function: y X β λ β 2 2 N = y (i) X (i) β (i) λ β (i) 2 2 i=1 The ADMM algorithm iteratively performs the following steps: β (i)t+1 arg min y (i) X (i) β λ β ν (i)t T s β + β 2 β αt 2 2 N α t+1 arg min ν (i)t T (β (i) t+1 α) + s α 2 β(i)t+1 α 2 2 i=1 ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 2 / 44

6 1. ADMM Continued Solving Linear Regression with ADMM 1st Step: β t+1 arg min y X β λ β ν t T s β + β 2 β αt 2 2 ( β y X β λ β ν t T s ) β + 2 β αt 2 2 = 0 2X T (y X ˆβ) + 2λ ˆβ + ν t + s(β α t ) = 0 2X T X ˆβ 2X T y + 2λ ˆβ + ν t + sβ sα t = 0 2X T X ˆβ + 2λ ˆβ + sβ = 2X T y ν t + sα t ˆβ = (2X T X + 2(λ + s)i) 1 (2X T y ν t + sα t ) (2X T X + 2(λ + s)i) ˆβ = (2X T y ν t + sα t ) Going For Large Scale Application Scenario: Recommender Systems 3 / 44

7 1. ADMM Continued Solving Linear Regression with ADMM 2nd Step: α t+1 arg min α N ν (i)t T (β (i) t+1 α) + s 2 β(i)t+1 α 2 2 i=1 ( N ) α ν (i)t T (β (i) t+1 α) + s 2 β(i)t+1 α 2 2 = 0 i=1 N ν (i)t i=1 N s(β (i)t+1 α) = 0 i=1 Nsα = N ν (i)t + sβ (i)t+1 i=1 N i=1 α = ν(i)t + sβ (i)t+1 Ns Going For Large Scale Application Scenario: Recommender Systems 4 / 44

8 1. ADMM Continued Solving Linear Regression with ADMM Loss function: y X β λ β 2 2 N = y (i) X (i) β (i) λ β (i) 2 2 i=1 The ADMM algorithm iteratively performs the following steps: β (i)t+1 (2X (i)t X (i) + 2(λ + s)i) 1 (2X (i)t y (i) ν (i)t + sα t ) N α t+1 i=1 ν(i)t + sβ (i)t+1 Ns ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 5 / 44

9 1. ADMM Continued Solving Linear Regression with ADMM Now assume we initialize ν (i)0 = 0 Our first α update step will be: N α 1 i=1 ν(i)0 + sβ (i)1 Ns N i=1 = 0 + sβ(i)1 Ns = = N i=1 sβ(i)1 Ns N i=1 β(i)1 N Going For Large Scale Application Scenario: Recommender Systems 6 / 44

10 1. ADMM Continued Solving Linear Regression with ADMM If we have then the ν update will be α 1 = N i=1 β(i)1 N ν (i)1 ν (i)0 + s (β (i)1 α 1) = s β (i)1 N j=1 β(j)1 N Going For Large Scale Application Scenario: Recommender Systems 7 / 44

11 1. ADMM Continued Solving Linear Regression with ADMM The next α update step will be: N i=1 ν(i)1 i=1 N i=1 sβ(i)2 α 2 + Ns Ns = 1 N s β (i)1 Ns N = 1 N N i=1 ( β (i)1) 1 N N j=1 β(j)1 N i=1 + N j=1 β(j)1 N N i=1 β(i)2 + N N i=1 β(i)2 N = N i=1 β(i)1 N N j=1 β(j)1 N + N i=1 β(i)2 N = N i=1 β(i)2 N Going For Large Scale Application Scenario: Recommender Systems 8 / 44

12 1. ADMM Continued Solving Linear Regression with ADMM From this it follows that, given ν (i)0 = 0, the algorithm can be further simplified to iteratively perform the following steps: β (i)t+1 (2X (i)t X (i) + 2(λ + s)i) 1 (2X (i)t y (i) ν (i)t + sα t ) N i=1 β(i)t+1 α t+1 N ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 9 / 44

13 1. ADMM Continued Solving Linear Regression with ADMM Finally notice that the α and ν update steps do not depend directly on the loss function. This means the ADMM algorithm can be generalized with the simplified updates: β (i)t+1 arg min β (x,y) D train i N i=1 β(i)t+1 α t+1 N ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) l(y, ŷ(x; β)) + R(β) + ν (i)t T β + s 2 β αt 2 2 Going For Large Scale Application Scenario: Recommender Systems 10 / 44

14 1. ADMM Continued Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data Going For Large Scale Application Scenario: Recommender Systems 11 / 44

15 1. ADMM Continued ADMM on The Year Prediction Dataset Runtime in Seconds ADMM Closed Form Number of Workers Going For Large Scale Application Scenario: Recommender Systems 12 / 44

16 2. Recommender Systems Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 13 / 44

17 2. Recommender Systems Recommender Systems Going For Large Scale Application Scenario: Recommender Systems 13 / 44

18 2. Recommender Systems Why Recommender Systems? Powerful method for enabling users to filter large amounts of information Personalized recommendations can boost the revenue of an e-commerce system: Amazon recommender systems Netflix challgenge: 1 million dollars for improving their system on 10% Different applications: Human computer interaction E-commerce Education... Going For Large Scale Application Scenario: Recommender Systems 14 / 44

19 2. Recommender Systems Why Personalization? - The Long Tail Source: Going For Large Scale Application Scenario: Recommender Systems 15 / 44

20 2. Recommender Systems Rating Prediction Given the previously rated items, how the user will evaluate other items? Going For Large Scale Application Scenario: Recommender Systems 16 / 44

21 2. Recommender Systems Item Prediction Which will be the next items to be consumed by a user? Going For Large Scale Application Scenario: Recommender Systems 17 / 44

22 2. Recommender Systems Formalization U - Set of Users I - Set of Items Ratings data D U I R Rating data D are typically represented as a sparse matrix R R U I items users Going For Large Scale Application Scenario: Recommender Systems 18 / 44

23 2. Recommender Systems Example Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) Bob (b) 4 3 John (j) 4 3 Users U := {Alice, Bob, John} Items I := {Titanic, Matrix, The Godfather, Once} Ratings data D := {(Alice, Titanic, 4), (Bob, Matrix, 4),...} Going For Large Scale Application Scenario: Recommender Systems 19 / 44

24 2. Recommender Systems Recommender Systems - Some definitions Some useful definitions: N (u) is the set of all items rated by user u N (Alice) := {Titanic, The Godfather, Once} N (i) is the set of all users that rated item i N (Once) := {Alice, John} Going For Large Scale Application Scenario: Recommender Systems 20 / 44

25 2. Recommender Systems Recommender Systems - Task Given a set of users U, items I and training data D train U I R, find a function such that some error is minimal error(ˆr, D test ) := ˆr : U I R (u,i,r ui ) D test l(r ui, ˆr(u, i)) Going For Large Scale Application Scenario: Recommender Systems 21 / 44

26 3. Traditional Recommendation approaches Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 22 / 44

27 3. Traditional Recommendation approaches Recommender Systems Approaches Most recommender system approaches can be classified into: Content Based Filtering: recommends items similar to the items liked by a user using textual similarity in metadata Collaborative Filtering: similar behavior recommends items liked by users with We will focus on collaborative filtering! Going For Large Scale Application Scenario: Recommender Systems 22 / 44

28 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Nearest Neighbor Approaches Nearest neighbor approaces build on the concept of similarity between users and/or items. The neighborhood N u of a user u is the set of k most similar users to u Analogously, the neighborhood N i of an item i is the set of k most similar items to i There are two main neighborhood based approaches User Based: The rating of an item by a user is computed based on how similar users have rated the same item Item Based: The rating of an item by a user is computed based on how similar items have been rated the user Going For Large Scale Application Scenario: Recommender Systems 23 / 44

29 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches User Based Recommender A user u U is represented as a vector u R I containing user ratings. Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) Bob (b) 4 3 John (j) 4 3 Examples: a := [4, 0, 2, 5] b := [0, 4, 3, 0] j := [0, 4, 0, 3] Going For Large Scale Application Scenario: Recommender Systems 24 / 44

30 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches User Based Recommender - Prediction Function ˆr(u, i) := r u + v N u sim(u, v)(r vi r v ) v N u sim(u, v) Where: r u is the average rating of user u sim is a similarity function used to compute the neighborhood N u Going For Large Scale Application Scenario: Recommender Systems 25 / 44

31 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Item Based Recommender An item i I is represented as a vector i R U containing information on how items are rated by users. Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) Bob (b) 4 3 John (j) 4 3 Examples: t := [4, 0, 0] m := [0, 4, 4] g := [2, 3, 0] o := [5, 0, 3] Going For Large Scale Application Scenario: Recommender Systems 26 / 44

32 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Item Based Recommender - Prediction Function ˆr(u, i) := r i + j N i sim(i, j)(r ui r i ) j N i sim(i, j) Where: r i is the average rating of item i sim is a similarity function used to compute the neighborhood N i Going For Large Scale Application Scenario: Recommender Systems 27 / 44

33 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Similarity Measures On both user and item based recomenders the similarity measure plays an important role: It is used to compute the neighborhood of users and items (neighbors are most similar ones) It is used during the prediction of the ratings Which similarity measure to use? Going For Large Scale Application Scenario: Recommender Systems 28 / 44

34 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Similarity Measures Commonly used similarity measures: Cosine: Pearson correlation: sim(u, i) = sim(u, v) = cos(u, v) = u v u 2 v 2 i N (u) N (v) (r ui r u )(r vi r v ) i N (u) N (v) (r ui r u ) 2 i N (u) N (v) (r vi r v ) 2 Going For Large Scale Application Scenario: Recommender Systems 29 / 44

35 3. Traditional Recommendation approaches 3.2. Factorization Models Why Factorization Models? Neighborhood based approaches have been shon to be effective but... Computing and maintaining the neighborhoods is expensive In the last years, a number of models have been shown to outperform them One of the results of the Netflix Challenge was the power of factorization models when applied to recommender systems Going For Large Scale Application Scenario: Recommender Systems 30 / 44

36 3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models Going For Large Scale Application Scenario: Recommender Systems 31 / 44

37 3. Traditional Recommendation approaches 3.2. Factorization Models Partially observed matrices The ratings matrix R is usually partially observed: No user is able to rate all items Most of the items are not rated by all users Can we estimate the factorization of a matrix from some observations to predict its unbserved part? Going For Large Scale Application Scenario: Recommender Systems 32 / 44

38 3. Traditional Recommendation approaches 3.2. Factorization Models Factorization models Each item i I is associated with a latent feature vector q i R k Each user u U is associated with a latent feature vector p u R k Each entry in the original matrix can be estimated by k ˆr(u, i) = p u q i = p u,f q i,f f =1 Going For Large Scale Application Scenario: Recommender Systems 33 / 44

39 3. Traditional Recommendation approaches 3.2. Factorization Models Example Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) Bob (b) 4 3 John (j) 4 3 T M G O T M G O Alice Alice Bob 4 3 a b Bob x John 4 3 John R P Q T Going For Large Scale Application Scenario: Recommender Systems 34 / 44

40 3. Traditional Recommendation approaches 3.2. Factorization Models Latent Factors Source: Yehuda Koren, Robert Bell, Chris Volinsky: Matrix Factorization Techniques for Recommender Systems, Computer, v.42 n.8, p.30-37, August 2009 Going For Large Scale Application Scenario: Recommender Systems 35 / 44

41 3. Traditional Recommendation approaches 3.2. Factorization Models Learning a factorization model - Objective Function Task: Where: arg min P,Q (u,i,r ui ) D train (r ui ˆr(u, i)) 2 + λ( P 2 + Q 2 ) ˆr(u, i) := p u q i D train is the training data λ is a regularization constant Going For Large Scale Application Scenario: Recommender Systems 36 / 44

42 3. Traditional Recommendation approaches 3.2. Factorization Models Optimization method L := (r ui ˆr(u, i)) 2 + λ( P 2 + Q 2 ) (u,i,r ui ) D train Stochastic Gradient Descent: Conditions: Loss function should be decomposable into a sum of components The loss function should be differentiable Procedure: Randomly draw one component of the sum Update the parameters in the opposite direction of the gradient Going For Large Scale Application Scenario: Recommender Systems 37 / 44

43 3. Traditional Recommendation approaches 3.2. Factorization Models SGD: gradients Gradients: L := (u,i,r ui ) D train (r ui ˆr(u, i)) 2 + λ( P 2 + Q 2 ) L p u = 2(r u,i ˆr(u, i))q i + 2λp u L q i = 2(r u,i ˆr(u, i))p u + 2λq i Going For Large Scale Application Scenario: Recommender Systems 38 / 44

44 3. Traditional Recommendation approaches 3.2. Factorization Models Stochastic Gradient Descent Algorithm 1: procedure LearnLatentFactors input: D Train, λ, α 2: (p u ) u U N(0, σi) 3: (q i ) i I N(0, σi) 4: repeat 5: for (u, i, r u,i ) D Train do In a random order 6: p u p u α ( 2(r u,i ˆr(u, i))q i + 2λp u ) 7: q i q i α ( 2(r u,i ˆr(u, i))p u + 2λq i ) 8: end for 9: until convergence 10: return P, Q 11: end procedure Going For Large Scale Application Scenario: Recommender Systems 39 / 44

45 3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models on practice Dataset: MovieLens (ML1M) Users: 6040 Movies: 3703 Ratings: From 1 (worst) to 5 (best) observed ratings (approx. 4.5% of possible ratings) Going For Large Scale Application Scenario: Recommender Systems 40 / 44

46 3. Traditional Recommendation approaches 3.2. Factorization Models Evaluation Evaluation protocol 10-fold cross falivation Leave-one-out Measure: RMSE (Root Mean Squared Error) (u,i,r ui ) D Test(r ui ˆr(u, i)) 2 RMSE = D Test Going For Large Scale Application Scenario: Recommender Systems 41 / 44

47 3. Traditional Recommendation approaches 3.2. Factorization Models SGD for factorization Models - Performance over epochs RMSE RMSE RMSE on test RMSE on train epoch Going For Large Scale Application Scenario: Recommender Systems 42 / 44

48 3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models - Impact of the number of latent features Movielens1M RMSE k latent features Going For Large Scale Application Scenario: Recommender Systems 43 / 44

49 3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models - Effect of regularization RMSE Regularized Model Unregularized Model epoch Going For Large Scale Application Scenario: Recommender Systems 44 / 44