Online Collaborative Filtering

Transcription

1 Online Collaborative Filtering Jacob Abernethy Department of Computer Science, UC Berkeley Berkeley, CA John Langford Yahoo! Research New York 45 W. 8th St. New York, NY 00 ABSTRACT We present an algorithm for learning a rank-k matrix factorization online for collaborative filtering tasks. This algorithm has several nice properties: it is naturally designed to handle data attributes or features, it scales linearly with k and the number of ratings, it does not require that we hold all data in memory, and it can be easily parallelized across multiple cores or an entire cluster. This algorithm performs well in practice and in particular, when tested on the well-known dataset from the Netflix Prize, achieves an impressive error rate very quickly. Categories and Subject Descriptors G.3 [Mathematics of Computing]: Probability and Statistics Statistical Computing; I.2.6 [Artificial Intelligence]: Learning Parameter Learning Keywords collaborative filtering, online learning, low-rank matrix factorization. INTRODUCTION Problem The problem of collaborative filtering is typically defined as the task of inferring consumer preferences: Given an observed set of product preferences for a set of users, can we accurately predict the unobserved preferences? As we have all discovered, finding a good book to buy in a bookstore can be rather difficult, given the immense number of books to choose from. On the other hand, we expect that a user is more likely to enjoy reading a particular book if other users with similar taste have also enjoyed this book. Kevin Canini Department of Computer Science, UC Berkeley Berkeley, CA kevin@cs.berkeley.edu Alex Simma Department of Computer Science, UC Berkeley Berkeley, CA asimma@cs.berkeley.edu This prediction problem has become popular recently, especially with the advent of recommendation systems employed by companies such as Amazon.com, Yahoo!, and Netflix. Accurately predicting a user s preferences is potentially very lucrative, since a user is likely to purchase more frequently if given good recommendations. As consumers access to product choices grows, the ability to bring high-quality recommendations to their attention becomes even more important. As evidence of the value of such a system, Netflix has recently released a dataset with 00 million examples and offered a $,000,000 prize for a substantial improvement upon their own recommendation algorithm. Notation. We define a collaborative filtering problem as a distribution D over triples (a, b, r) A B R where A and B are finite sets of size n and m respectively. We are given a set M of triples {(a, b, r)} and want to find a function f(, ) which minimizes the expected squared error E (a,b,r) D (f(a, b) r) 2. Typically, we think of A as our set of users, B as our set of products, and r as user a s rating of product b. In most movie recommendation datasets, r is a number in {, 2, 3, 4, 5} as in the number of stars, although in other settings we may only be given r {0, }, as in liked/disliked. Past Work. A common approach is to reduce the prediction problem to finding a low-rank approximation to the partially observed A B matrix of observations: given a rank parameter k A, B, we infer the unobserved entries by finding the matrix X of rank k which best fits our observations. This natural linear factor model has been analyzed quite thoroughly [4, 5, 7, 8]. If we assume that the space of user preferences is low-dimensional then we expect that all rows (or columns) of our matrix could be described by a linear combinations of just a few rows (or columns). In other words, we would expect that each user s preferences can be linearly compressed to just a few parameters. The low-rank approach has also recently been generalized into a kernel framework []. This method allows us to find a matrix factorization which incorporates user and/or product features. Prediction accuracy is improved when additional information, such as a user s age or a movie s genre, are available to the algorithm.

2 There are a host of other practical problems associated with these low-rank approximation techniques. Unfortunately, low-rank matrix factorization requires solving a non-convex optimization problem. A simple convex method, Maximum Margin Matrix Factorization [8], was proposed that relies on an l norm regularization, yet this requires retaining the entire A B matrix of variables and is thus quite slow. Various speedups [5, 7] have been suggested, yet again at the price of non-convexity. Furthermore, such optimizations are effectively not scalable. Datasets are growing quickly while standard optimization techniques scale as O(N 2 ). The training set of the Netflix Prize, for example, includes 00 million ratings of 500,000 users and 8,000 movies - it is even challenging to retain such a dataset in memory for most consumer PCs. Perhaps the most serious drawback is that low-rank matrix factorization techniques are currently only useful in the batch prediction model. In batch learning problems we are given all labeled data ahead of time. However, in most practical applications of collaborative filtering, data is arriving constantly, new users are being added, and new products are being developed and offered. We want to incorporate such information online so we can make up-to-date predictions on the fly, and without having to re-optimize from scratch with each new piece of data. What we do. We present an algorithm for learning a lowrank matrix online. This algorithm has a number of attractive properties:. The algorithm need only observe a single data point at a time. 2. The running time scales linearly with the amount of data and the rank parameter k (although performance can be improved with several passes over the data). This update has no dependence on n or m. 3. User/product features can be incorporated, using the model presented in [] 4. The algorithm can trivially adapt to new users and new products 5. It can be paralellized across multiple cores or an entire cluster for computational speedup. This paper shows that several collaborative filtering methods can be realized in an online framework, and that such online algorithms perform as well as batch style methods, are much faster, and are scalable. We provide a number of experiments on both real and artificial datasets, and we discuss details on how to do parallelization within the MapReduce framework. Furthermore, we tested our algorithm on various datasets, including that of the Netflix Prize, we report a performance that is nearly state-of-the-art. Outline. The paper is organized as follows. In Section 2 we describe the basic low-rank matrix factorization problem, and we review how these methods can be generalized to a kernelized framework for incorporating features. In Section 3 we describe our online matrix factorization algorithm, both with and without features. In Section 4 we provide a number of experiments, and we conclude in Section FORMAL SETTING In this section we describe the basic low-rank matrix factorization problem, as well as the generalization of the kernelized low-rank framework introduced by []. 2. Low-Rank Matrix Factorization Given a rank parameter k, we can define a low-rank matrix X as a product of an n k matrix U = [u al ] and an m k matrix V = [v bl ]. Then X = UV T, as depicted below. X = U We may think of U as a compressed version of X into only k columns, where the matrix V T represents the reconstruction function. Similarly, we may consider V T as a compression of the rows of X. Given a fully observed matrix M, the optimal rank-k approximation to M is then U V T, where U and V are defined as V T argmin M UV T 2 F, () U R n k,v R m k P where X F is the Frobenius norm of the matrix X, i.e., a,b x2 ab. This low-rank approximation problem is wellstudied, and can be reduced to a Singular Value Decomposition (SVD) of the matrix M. Roughly speaking, we simply need to find the k largest eigenvalues and project the columns of M onto the corresponding k eigenvectors. Interestingly, while the optimization problem described above is not convex, there are polytime algorithms for computing a matrix s SVD. When a matrix is only partially observed, as is the case in the collaborative filtering problem, we do not have an obvious way to compute an SVD since the notion of an eigenvalue or eigenvector is not defined. However, we can still define an optimal low-rank approximation. If we let U a be the ath row of U, and V b be the bth row of V, then given our partially observed set of entries M = {(a, b, r)}, we can define the low-rank approximation to M as U V T, where U and V are found by the following optimization, argmin X U R n k,v R m k (a,b,r) (U av T b r) 2. (2) In many cases, this optimization is modified by adding a regularization term, so as to avoid overfitting U and V to the observed ratings.

3 This optimization is nonconvex, so methods have been proposed to cast the low-rank approximation problem as a convex optimization problem. Notably, [8] points out that we can force a low-rank solution by regularizing with the trace norm. If an n n matrix X has singular values λ,..., λ n, then the trace norm of X is defined as X Σ = P i λi. We can think of the rank of a matrix as an l 0 norm, and it can be shown that the trace norm provides a smooth upper bound on the l 0 norm which is tight. We could then solve bx = argmin X R n m R X (X ab r) 2 + λ X Σ, (3) (a,b,r) where λ is a regularization parameter. It can be shown that bx is low-rank, given an appropriate choice of λ. A major obstacle with the above approach is that we are required to hold the entire solution X b in memory and optimize over nm variables. This is certainly infeasible when either n or m is large. This is a major advantage of the factorization formulation in (2): by solving for U and V, we need only maintain (n + m)k variables, and we can obtain a prediction for entry (i, j) easily by computing U ivj T. As considered in [7], one can still use the tracenorm formulation (3) that is much more efficient, both in terms of speed as well as memory, although it unfortunately does require solving a non-convex problem. Here, we maintain matrices U and V as before and we solve argmin U,V R X (a,b,r) (U av T b r) 2 + λ( U 2 fro + V 2 fro). (4) This simpler formulation is achieved by the following alternative representation of the tracenorm: X Σ = inf U,V 2 ( U 2 fro + V 2 fro) 2.2 Matrix Factorization with Attributes We now show how the above low-rank factorization techniques can be generalized to handle user and/or product features, such as user demographics or product attributes. The problem of finding a low-rank approximation can be cast as that of finding a low-rank function f(i, j) in a tensor product of two reproducing kernel Hilbert spaces. Considering the problem in this way, we can work with general classes of functions f that operate on both a user s identity i and his known features. The matrix factorization problem is now just a special case of this more general framework. We now briefly review the method of []. We would like to estimate a function f : A B R having observed a finite set of triples {(a, b, r)} A B R. We assume we are given two positive semi-definite kernel functions, κ : A A R and γ : B B R. A kernel can be thought of as a comparison function, i.e., some kind of feature similarity metric between data points. As an example, without access to any side information of the elements in A, a trivial kernel might be κ(a, a ) = when a = a and 0 otherwise. This is often referred to as the Dirac kernel. We let K and G be the reproducing kernel Hilbert spaces (RKHS) of kernels κ and γ, respectively. Let us now consider the tensor product K G. This product can be defined as the closure of the set of finite sums of products of functions in K and G, ( ) K G := cl f(a, b) = u l (a)v l (b) : u l K, v l G. We call each pair u(a)v(b) an atomic term. Given a function f K G, if f can be written with k atomic terms P k u l(a)v l (b), and k is minimal, then we say that f has rank k. Notice that when we have finite spaces A and B, a function f(a, b) = P k u l(a)v l (b) in K G is effectively a matrix X, where we can define X ab := f(a, b). More importantly, when the function f(a, b) has rank k, this corresponds exactly to X having rank k. Conveniently, we can write the factorization out explicitly: X = UV T, where U al := u l (a) and V bl := v l (b). In particular, when κ and γ are Dirac kernels, K and G are arbitrary functions on A and B, so any u l K can be written as u l (a ),..., u l (a n) = u l,..., u nl, and similarly for any v l G. Thus, for these simple choices of κ and γ, the set of rank-k functions f K G is exactly the set of matrices {UV T : U R n k, V R m k }. There is a nice alternative representation of K G. Define the tensor product kernel κ : (A B) (A B) R as κ `(a, b), `a, b = k `a, a g `b, b. (5) This is known to be a positive semi-definite kernel [3, p.70], and so we denote by H the associated RKHS of κ. A classical result of Aronszajn [2] states that H is identically K G. The space H is equipped with a norm such that u v = u K v G, and thus P l u l v l 2 = P l,p u l, u p K v l, v p G. Let H k be the set of functions in H of rank k. Given that we want to estimate an f H k, we can write a general optimization problem min f H k E {(a,b,r)} (f(a, b) r) 2 + λ f 2. (6) where the expectation is uniform over the set of observed preferences {(a, b, r)}. Assuming A and B to be of finite size n and m respectively, now let K and G be the n n and m m kernel matrices of κ and γ. It is proven in [] that the above optimization has the following equivalent formulation: min α R n k β R m k E {(a,b,r)} (Kα l ) a(gβ l ) b r) + λ i= j=! 2 αi Kα jβi Gβ j (7) This representer theorem tells us that our optimal solution is of the form X = Kα(Gβ) T for an n k matrix α and an m k matrix β. In particular, when κ and γ are Dirac kernels, K and G are identity matrices, we are simply solving (2) with U = α and V = β with a regularization term. Using the Dirac kernel corresponds to representing a user or product simply by its identity. Of course, when additional

4 information about our users and products is available, it may be beneficial to instead use this information to make our prediction. We achieve this simply by an appropriate choice of kernels κ and γ. For example, assume that we are given a C-dimensional feature vector f a for each user a, then a potential kernel could be a linear kernel κ(a, a ) = f a f a, or alternatively an RBF kernel κ(a, a ) = exp( a a 2 ). 2πσ 2 σ As shown in [], it is better to include both identity information as well as feature information. This can be achieved 2 simply by choosing a mixed kernel κ(a, a ) = η δ(a, a ) + ( η)ˆκ(a, a ), (8) where δ is the Dirac kernel and ˆκ is some kernel based on user features. 3. ONLINE MATRIX FACTORIZATION There is a long history of batch-to-online gradient descent. For example, a perceptron with a margin update rule can be thought of as the online gradient descent version of a support vector machine. In the next two subsections we show how to derive an efficient online gradient descent update rule for the matrix factorization problem. 3. Online Low-rank Approximation Without Features Assume we are now working with the following model. Our prediction matrix X = αβ T, where α R n k and β R m k. Given a single observation (a, b, r), the loss on this observation is l(x, (a, b, r)). For convenience, we use the square loss, although other convex loss functions can be implemented quite easily. Thus, l(x, (a, b, r)) = (r X ab ) 2 = r! 2 α al β bl When we differentiate with respect to α and β, we see that the only nonzero terms are α al and β bl for l =,..., k. This is computationally convenient since an update is then only dependent on k and not on n and m which can be quite large. Differentiating gives! = 2β bl r α al β bl α al β bl = 2α al r! α al β bl With these derivatives in mind, we immediately obtain the following algorithm. This algorithm can be easily adjusted for different loss functions. A very common metric in Collaborative Filtering is Mean Absolute Error (MAE), and several results are reported with this performance measure. The online update would then become α al α al + 2τβ bl sign(r ˆr) β bl β bl + 2τα al sign(r ˆr) 3.2 Online Low-rank Approximation With Features Algorithm Online Low-rank Approximation : Parameters: n users, m products, desired rank k, stepsize τ. 2: Input: Observations {(a, b, r)} 3: Initialize α R n p and β R m p randomly. 4: for each (a, b, r) do 5: Compute ˆr = P k α alβ bl. 6: for l = to k do 7: α al α al + 2τβ bl (r ˆr) 8: β bl β bl + 2τα al (r ˆr) 9: end for 0: end for : Output: α and β. Let s now apply the same approach described above to the kernelized matrix factorization methods described in Section 2.2. Assume we are given kernel matrices K and G and we have parameter matrices α and β, and thus our current prediction matrix X = Kα(Gβ) T. We are now given a data point (a, b, r) and we need to compute the gradient of the loss: 2 l(x, (a, b, r)) = r (Kα) a (Gβ) T b Unfortunately, if we assume that K and G are nontrivial matrices, this derivative is no longer dependent solely on the parameters α al and β bl, l =,..., k. This suggests that a gradient step could require an update on as many as (n + m)k variables! Ideally, we want our update to be independent of n and m, as these might grow very large. Fortunately, this can still be achieved if we assume that our kernels k and g are linear, and thus we can work with the primal formulation rather than the dual described in (7). Recall that our model is to learn a factorization in terms of functions, which we can describe by, f(a, b) = u (a)v (b) + u 2(a)v 2(b) u p(a)v k (b) In the dual formulation, we would define u l (a ) = nx mx α al k(a, a ) v l (b ) = β bl k(b, b ). a= b= So far, we have assumed nothing about the representation of a, b. To get an efficient update, we assume a consists of an identifier and a set of features, a = ( a, x a,..., x ac), where a is the ath basis vector in R n, and x a,..., x ac is a C- dimensional feature vector for user a. Since our functions are linear, we can write u l (a) = ū l ( a) + û l (x a,..., x ac), where ū l and û l are linear functions. We now define ū l in terms of coefficients α l,..., α nl and û l in terms of coefficients µ l,..., µ lc, and thus u l (a) = nx CX α a l( a) a + µ lc x ac a = = α al + CX µ lc x ac c= c= Following the same analysis, we can describe v l (b) similarly. Assume that product b has D features y b,..., y bd, then

5 we can describe v l (b) in terms of parameters β bl and ν ld for d =,..., D. That is, v l (b) = β bl + DX ν ld y bd d= Thus our prediction is of the form: f(a, b) = (α al + µ l: x a:) (β bl + ν l: y b: ). If we let ˆr := f(a, b) as above, using the square loss l(f, r) = (ˆr r) 2, then we have the following derivatives: α al = 2 (β bl + ν l: y b: ) (r ˆr) β bl = 2 (α al + µ l: x a:) (r ˆr) µ lc = 2x ac (β bl + ν l: y b: ) (r ˆr) ν ld = 2y bd (α al + µ l: x a:) (r ˆr) Given these derivatives, we obtain the following online algorithm: Algorithm 2 Online Low-rank with Features : Parameters: n users, m products, desired rank k, stepsize τ. 2: Input: Observations {(a, b, r)} 3: Input: User features matrix [x ac] R n C, Product features matrix [y bd ] R n D. 4: Initialize: α R n p, β R m p, µ R k C, ν R k D randomly. 5: for each (a, b, r) do 6: Compute ˆr P p (α al + µ l: x a:)(β bl + ν l: y b: ). 7: for l = to k do 8: α al α al + 2τ(β bl + ν l: y b: )(r ˆr) 9: β bl β bl + 2τ(α al + µ l: x a:)(r ˆr) 0: for c = to C do : µ lc µ lc + 2τx ac(β bl + ν l: y b: )(r ˆr) 2: end for 3: for d = to D do 4: ν ld ν ld + 2τy bd (α al + µ l: x a:)(r ˆr) 5: end for 6: end for 7: end for 8: Output: α and β. Let us take a step back for a moment and consider intuitively what the above algorithms are doing. In Algorithm, where we don t include features, we want to learn k parameters α a,..., α ak for each user and k parameters β b,..., β bk for each movie so that the rating of user a on movie b is predicted as P l α alβ bl. When we receive a rating (a, b, r), we determine how much our prediction differs from this observation, and we adjust our parameters slightly to account for this error. We hope that, after we have seen all of our data, we have learned good parameters, in the sense that they have accurate predicting power. In Algorithm 2, the terms α il and β jl are now replaced by α al + µ l: x a: and β bl + ν l: y b:. Recall that µ is a matrix of parameters that are user-independent, while the parameter α al is particular to only user a (and similarly for µ and β bl ). We can now think of a user as begin described by k attributes, where attribute l is the sum of a user specific parameter α al plus a the value µ l: x a: which depends on this user s features. If we assume that the features x b: are highly informative, i.e. the could accurately describe the users ratings, then we would expect to accurately learn µ l: x a: and thus the α al s would not be necessary. On the other hand, with uninformative features, we would hope that the α al s would be learned appropriately and µ would be roughly 0. A note on regularization: In both of the above algorithms, it is often beneficial to include a regularization into this online optimization, especially when k is large relative to the number of observed entries. A natural choice of regularization penalty is that derived in (7), f = p= u l, u p v l, v p In the primal form this regularization becomes f = p= (α:l T α :p + µ T l: µ p:)(β:l T β :p + νl: T ν p:) When we omit features, the terms with µ and ν simply disappear. It is straightforward to differentiate the above norm and thus include this penalization in the update step. However, naïvely computing the derivative requires O(k(n + m)) running time. This can be efficiently improved by maintaining two k k matrices [α:l T α :p+µ T l: µ p:] lp and [β:l T β :p+νl: T ν p:] lp. On each example (a, b, r) we adjust µ, ν, and α al and β bl for each l =,..., k which require updating the entries in each of the above matrices, costing a total of O(k 2 (C + D)) for each update. This is mildly expensive, but by noting that f does not change substantially on each update, it is roughly sufficient to compute it only once for each pass through the data. Without features, an alternative regularization is described in (4) in which we control the tracenorm of our solution matrix X. As shown, this can be represented as the sum of frobenius norms, and in terms of α and β, this is exactly! nx mx R(α, β) = αi,l 2 +. i= j= β 2 j,l The gradient of this is very simple to compute: R α i,l = 2α i,l R β j,l = 2β j,l In our experiments without features we preferred to employ this regularization for simplicity as well as computational speedup. 4. IMPLEMENTATION AND EXPERIMENTS We tested our algorithms on a number of datasets and we now provide a discussion of several implementation issues as well as empirical results.

6 4. Datasets A common Collaborative Filtering dataset is the MovieLens One Million Ratings dataset. This data was compiled by the GroupLens Research Project who created a movie recommendation system, MovieLens, where users can input movie ratings and receive movie suggestions. The data consists of,000,209 ratings from 6,400 users of 3,900 movies. Each rating is an integer from to 5. In addition to ratings, additional information is also provided about each user and each movie. User data consists of their gender, one of several occupations (such as scientist or farmer ), an age attribute from one of Under 8, 8-24, 25-34, 35-44, 45-49, 50-55, or 56+. Each movie was labelled with one of several genres including, for example, Action, Drama, and Thriller. We also include results from the Netflix Prize data. In October 2006, Netflix released a real movie rating dataset, generated by their customers, and offered a $,000,000 prize for a 0% improvement in their current movie recommendation algorithm. The data includes 00,000,000 ratings from roughly 480,000 users and 7,700 movies. The size of this dataset is immensely larger than what had previously been publicly available and has presented an interesting challenge to researchers in machine learning and data mining. Netflix did not release user features for this data while it did reveal movie titles and genre information The final dataset we use is a toy dataset that we generated. We created 0,000 users and 5,000 movies, and we define each user a by a k-length feature vector x a and each movie b by a k-length feature vector y b. The coordinates of these vectors were generated from a Gaussian distribution, and we chose k = 0. We chose a random sample of ratings for each user, and the ratings were generated as r ab = x T a Ay b +ɛ for a randomly chosen k k matrix A and a normally distributed noise term ɛ. The Masters thesis of Ben Marlin [6] provides a very comprehensive survey of a number of Collaborative Filtering algorithms, their running times, and a large performance comparison. This work has provided a very convenient set of benchmarks to take advantage of. The two major datasets used in this comparison are the MovieLens, as desribed above, and the EachMovie dataset of HP/Compaq Research. Unfortunately, the EachMovie dataset is no longer publicly available. In general, we try to report results using the same methods and performance measures used in [6]. 4.2 Implementation Issues 4.2. Learning Rate and Stopping With insufficient regularization, running the algorithm until convergence does not yield good accuracy due to overfitting. As can be seen in Figure, the impact of overtraining is very significant. Furthermore, while increasing regularization shrinks the generalization error, the choice of λ required to avoid overtraining tends to lead to insufficiently flexible models. Therefore, we apply early stopping by monitoring the error on a validation set and decreasing the step size, or terminating the training procedure when the validation error increases for several iterations. Empirically, we observe that the final test error of the early-stopped predictor is fairly insensitive to the choice of λ across a wide range. One potential concern with early stopping is that since some users and products have many more observed ratings than others, the method may overtrain one type of users while undertraining another. However, experiments in which accuracy statistics were calculated for different strata of users indicate that the performance on all users and products, regardless of the number of ratings, peaks around the same time. Additionally, this validation set based step size control greatly speeds up convergence, allowing fast descent toward the neighborhood of the solution and then slowing down for fine refinement. The effect is similar to that of simulated annealing the choice of the direction based on one (or a few, if momentum is used) ratings randomizes the search and the decrease in step size is similar to decreasing the temperature in simulated annealing Momentum The algorithm as described so far uses the gradient at a single rating as an estimate of the global gradient. An obvious alterative is using multiple training points to estimate the gradient; standard gradient descent uses the complete training set. However, our emperical results suggest that standard gradient descent is an order of magnitude slower for the model presented in this paper. A good compromise is to use momentum, which estimates the gradient as a moving average of the gradients at the ratings recently considered. Specifically, we replace updates of the form with α al α al + 2τβ bl (r ˆr) α m al γα m al + ( γ)(2τβ bl (r ˆr)) α al α al + α m al where γ [0, ) is the momentum parameter. This smoothes the updates and allows the use of higher step size τ, leading to faster convergence Randomizing the Order Iterating through the ratings in a fixed order, particularly one where all the ratings for a single user or produced are adjacent leads to the parameters oscillating in a periodic pattern and impedes the convergence to an optimum. Therefore, it is better to cycle through the ratings in a random order. 4.3 Performance Results 4.3. Comparison with Full Gradient Descent While this algorithm was designed to perform a kind of online gradient descent, a surprising result is that we actually outperform full-fledged gradient descent, that is where each descent step is performed with all the data. This improvement is both in terms of speed as well as performance. The improved performance is possibly due to the fact that singlerating gradient descent is less prone to local-minima or even

7 lambda=0 lambda=e 6 lambda=3e 6 lambda=e 5 lambda=3e NMAE Number of passes through the dataset. Figure : The performance plot of our algorithm as a function of the number of passes through the data. The numbers, in normalized MAE, are different than what we report since we withheld a large testset. Full gradient descent Online gradient descent.25.2 Full gradient descent Online gradient descent.5. Minimum RMSE 0.95 RMSE Runtime (sec) Figure 2: We consider the performance/time trade off of our algorithm (online gradient descent) versus full gradient descent. Each point is obtained with on run given a particular fixed step-size and regularization parameter. less prone to overfitting. On the other hand, the improvement in speed is probably due to the increased number of updates per parameter for each time we pass through our data. Using the MovieLens dataset, we performed several runs and tried various step-size and regularization parameters for both our online descent method as well as full gradient descent. The RMSE performance we plot for each run is the optimal test performance obtained before over-fitting began to occur. In Figure 2, we show the comparison of optimal test performance with number of seconds to optimal test error. In Figure 3, we plot the test error v.s. time for the best performing parameter choices for each algorithm Standard Matrix Factorization For the Movielens dataset, we evaluate performance with the Normalized Mean Abslute Error metric, in order for our results to be comparable to those reported in [6]. We estimate Runtime (sec) Figure 3: Test error as a function of training time. Online descent converges much faster and to a better minimum than full gradient descent. the accuracy using leave-one-out crossvalidation, training on a dataset of 5000 users and using a validation set to determine the optimal value of the regularization parameter λ. The resulting model achieves NMAE of with k = 5 and 0.433, which is comparable with Aptutude, whose NMAE of is the best reported in [6]. We performed a preliminary test of our algorithm as well on the Netflix dataset. With k = 20 and no features included, our algorithm achieved an RMSE of with the model taking only an hour to train, a 3% improvement over Netflix s own algorithm Improvements with Features We report preliminary results of our algorithm using our toy dataset when we have access to both user and product features. Here, for a user a and user b, the true ratings are computed as x T a Ay b + ɛ given feature vectors x a and y b. We tried providing the algorithm with certain subvectors of x b and y b of various sizes, and we consider test performance. In Figure 4, we see the performance in RMSE as more and more user and movie features are provided to the algorithm. It is

8 0.32 Effect of features on RMSE RMSE Number of Features Used Figure 4: Using toy data, we consider the performance of employing features in our algorithm. interesting to note that the biggest improvement is achieved with the addition of just one feature. 4.4 Paralellizaton Here we discuss a very simple approach for implementing a parallel version of our algorithm. For simplicity, we focus primarily on simple matrix factorization model where we don t consider features. With some work, this can be generalized for the more general version that does use features. The online nature of the low-rank approximation algorithm without features lends itself well to a simple parallelization scheme. The key observation is that within the outer loop of Algorithm (line 4), the only variables that are modified are α al and β bl, for all l {,..., k}. This property of the algorithm allows the loop to be executed independently for a collection of the ratings {(a, b, r)} for which each a value and b value appear at most once. We can generalize this notion to collections of rating subsets, rather than collections of single ratings. If we wish to parallelize the algorithm across P processors, the entire observation set M is partitioned into P 2 blocks along a-value and b-value boundaries, as depicted below. M M 2 M P M 2 M 22 M 2P M P M P 2 M P P With this partition of the observation matrix, given a collection of blocks {M ij : i, j P } for which each i and j value appears at most once, an execution of the gradientdescent step on a particular rating will alter the α and β parameters that are relevant only to other ratings within the same block. So we choose P blocks {M ij} such that each i and j value appears exactly once and assign each processor to one of these chosen blocks. Each processor then performs the online gradient descent step for each of the ratings in its assigned block, disseminates its updated parameter values, and the process is repeated on a new set of P blocks. Initial testing suggests that the low-rank approximation algorithm without features can achieve a speedup of roughly 2.5x when distributed across four processors within a single machine. 5. CONCLUSION Here we present an online version of a common collaborative filtering approach, low-rank matrix approximation. Among other desirable features, this algorithm is fast, scales linearly with the dataset size, and can be parallelized. More importantly, these benefits are not at the expense of performance, as we show that our method performs nearly as well as several state-of-the-art techniques for collaborative prediction. 6. ADDITIONAL AUTHORS 7. REFERENCES [] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes, [2] N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68: , 950. [3] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Springer-Verlag, New-York, 984. [4] D. Billsus and M. J. Pazzani. Learning collaborative information filters. In ICML 98: Proceedings of the Fifteenth International Conference on Machine Learning, pages 46 54, San Francisco, CA, USA, 998. Morgan Kaufmann Publishers Inc. [5] D. DeCoste. Collaborative prediction using ensembles of maximum margin matrix factorizations. In ICML 06: Proceedings of the 23rd international conference on Machine learning, pages , New York, NY, USA, ACM Press. [6] B. Marlin. Collaborative filtering: A machine learning perspective, [7] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proc. ICML, 2005.

9 [8] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Adv. NIPS 7, 2005.