Online Collaborative Filtering

Size: px
Start display at page:

Download "Online Collaborative Filtering"

Transcription

1 Online Collaborative Filtering Jacob Abernethy Department of Computer Science, UC Berkeley Berkeley, CA John Langford Yahoo! Research New York 45 W. 8th St. New York, NY 00 ABSTRACT We present an algorithm for learning a rank-k matrix factorization online for collaborative filtering tasks. This algorithm has several nice properties: it is naturally designed to handle data attributes or features, it scales linearly with k and the number of ratings, it does not require that we hold all data in memory, and it can be easily parallelized across multiple cores or an entire cluster. This algorithm performs well in practice and in particular, when tested on the well-known dataset from the Netflix Prize, achieves an impressive error rate very quickly. Categories and Subject Descriptors G.3 [Mathematics of Computing]: Probability and Statistics Statistical Computing; I.2.6 [Artificial Intelligence]: Learning Parameter Learning Keywords collaborative filtering, online learning, low-rank matrix factorization. INTRODUCTION Problem The problem of collaborative filtering is typically defined as the task of inferring consumer preferences: Given an observed set of product preferences for a set of users, can we accurately predict the unobserved preferences? As we have all discovered, finding a good book to buy in a bookstore can be rather difficult, given the immense number of books to choose from. On the other hand, we expect that a user is more likely to enjoy reading a particular book if other users with similar taste have also enjoyed this book. Kevin Canini Department of Computer Science, UC Berkeley Berkeley, CA kevin@cs.berkeley.edu Alex Simma Department of Computer Science, UC Berkeley Berkeley, CA asimma@cs.berkeley.edu This prediction problem has become popular recently, especially with the advent of recommendation systems employed by companies such as Amazon.com, Yahoo!, and Netflix. Accurately predicting a user s preferences is potentially very lucrative, since a user is likely to purchase more frequently if given good recommendations. As consumers access to product choices grows, the ability to bring high-quality recommendations to their attention becomes even more important. As evidence of the value of such a system, Netflix has recently released a dataset with 00 million examples and offered a $,000,000 prize for a substantial improvement upon their own recommendation algorithm. Notation. We define a collaborative filtering problem as a distribution D over triples (a, b, r) A B R where A and B are finite sets of size n and m respectively. We are given a set M of triples {(a, b, r)} and want to find a function f(, ) which minimizes the expected squared error E (a,b,r) D (f(a, b) r) 2. Typically, we think of A as our set of users, B as our set of products, and r as user a s rating of product b. In most movie recommendation datasets, r is a number in {, 2, 3, 4, 5} as in the number of stars, although in other settings we may only be given r {0, }, as in liked/disliked. Past Work. A common approach is to reduce the prediction problem to finding a low-rank approximation to the partially observed A B matrix of observations: given a rank parameter k A, B, we infer the unobserved entries by finding the matrix X of rank k which best fits our observations. This natural linear factor model has been analyzed quite thoroughly [4, 5, 7, 8]. If we assume that the space of user preferences is low-dimensional then we expect that all rows (or columns) of our matrix could be described by a linear combinations of just a few rows (or columns). In other words, we would expect that each user s preferences can be linearly compressed to just a few parameters. The low-rank approach has also recently been generalized into a kernel framework []. This method allows us to find a matrix factorization which incorporates user and/or product features. Prediction accuracy is improved when additional information, such as a user s age or a movie s genre, are available to the algorithm.

2 There are a host of other practical problems associated with these low-rank approximation techniques. Unfortunately, low-rank matrix factorization requires solving a non-convex optimization problem. A simple convex method, Maximum Margin Matrix Factorization [8], was proposed that relies on an l norm regularization, yet this requires retaining the entire A B matrix of variables and is thus quite slow. Various speedups [5, 7] have been suggested, yet again at the price of non-convexity. Furthermore, such optimizations are effectively not scalable. Datasets are growing quickly while standard optimization techniques scale as O(N 2 ). The training set of the Netflix Prize, for example, includes 00 million ratings of 500,000 users and 8,000 movies - it is even challenging to retain such a dataset in memory for most consumer PCs. Perhaps the most serious drawback is that low-rank matrix factorization techniques are currently only useful in the batch prediction model. In batch learning problems we are given all labeled data ahead of time. However, in most practical applications of collaborative filtering, data is arriving constantly, new users are being added, and new products are being developed and offered. We want to incorporate such information online so we can make up-to-date predictions on the fly, and without having to re-optimize from scratch with each new piece of data. What we do. We present an algorithm for learning a lowrank matrix online. This algorithm has a number of attractive properties:. The algorithm need only observe a single data point at a time. 2. The running time scales linearly with the amount of data and the rank parameter k (although performance can be improved with several passes over the data). This update has no dependence on n or m. 3. User/product features can be incorporated, using the model presented in [] 4. The algorithm can trivially adapt to new users and new products 5. It can be paralellized across multiple cores or an entire cluster for computational speedup. This paper shows that several collaborative filtering methods can be realized in an online framework, and that such online algorithms perform as well as batch style methods, are much faster, and are scalable. We provide a number of experiments on both real and artificial datasets, and we discuss details on how to do parallelization within the MapReduce framework. Furthermore, we tested our algorithm on various datasets, including that of the Netflix Prize, we report a performance that is nearly state-of-the-art. Outline. The paper is organized as follows. In Section 2 we describe the basic low-rank matrix factorization problem, and we review how these methods can be generalized to a kernelized framework for incorporating features. In Section 3 we describe our online matrix factorization algorithm, both with and without features. In Section 4 we provide a number of experiments, and we conclude in Section FORMAL SETTING In this section we describe the basic low-rank matrix factorization problem, as well as the generalization of the kernelized low-rank framework introduced by []. 2. Low-Rank Matrix Factorization Given a rank parameter k, we can define a low-rank matrix X as a product of an n k matrix U = [u al ] and an m k matrix V = [v bl ]. Then X = UV T, as depicted below. X = U We may think of U as a compressed version of X into only k columns, where the matrix V T represents the reconstruction function. Similarly, we may consider V T as a compression of the rows of X. Given a fully observed matrix M, the optimal rank-k approximation to M is then U V T, where U and V are defined as V T argmin M UV T 2 F, () U R n k,v R m k P where X F is the Frobenius norm of the matrix X, i.e., a,b x2 ab. This low-rank approximation problem is wellstudied, and can be reduced to a Singular Value Decomposition (SVD) of the matrix M. Roughly speaking, we simply need to find the k largest eigenvalues and project the columns of M onto the corresponding k eigenvectors. Interestingly, while the optimization problem described above is not convex, there are polytime algorithms for computing a matrix s SVD. When a matrix is only partially observed, as is the case in the collaborative filtering problem, we do not have an obvious way to compute an SVD since the notion of an eigenvalue or eigenvector is not defined. However, we can still define an optimal low-rank approximation. If we let U a be the ath row of U, and V b be the bth row of V, then given our partially observed set of entries M = {(a, b, r)}, we can define the low-rank approximation to M as U V T, where U and V are found by the following optimization, argmin X U R n k,v R m k (a,b,r) (U av T b r) 2. (2) In many cases, this optimization is modified by adding a regularization term, so as to avoid overfitting U and V to the observed ratings.

3 This optimization is nonconvex, so methods have been proposed to cast the low-rank approximation problem as a convex optimization problem. Notably, [8] points out that we can force a low-rank solution by regularizing with the trace norm. If an n n matrix X has singular values λ,..., λ n, then the trace norm of X is defined as X Σ = P i λi. We can think of the rank of a matrix as an l 0 norm, and it can be shown that the trace norm provides a smooth upper bound on the l 0 norm which is tight. We could then solve bx = argmin X R n m R X (X ab r) 2 + λ X Σ, (3) (a,b,r) where λ is a regularization parameter. It can be shown that bx is low-rank, given an appropriate choice of λ. A major obstacle with the above approach is that we are required to hold the entire solution X b in memory and optimize over nm variables. This is certainly infeasible when either n or m is large. This is a major advantage of the factorization formulation in (2): by solving for U and V, we need only maintain (n + m)k variables, and we can obtain a prediction for entry (i, j) easily by computing U ivj T. As considered in [7], one can still use the tracenorm formulation (3) that is much more efficient, both in terms of speed as well as memory, although it unfortunately does require solving a non-convex problem. Here, we maintain matrices U and V as before and we solve argmin U,V R X (a,b,r) (U av T b r) 2 + λ( U 2 fro + V 2 fro). (4) This simpler formulation is achieved by the following alternative representation of the tracenorm: X Σ = inf U,V 2 ( U 2 fro + V 2 fro) 2.2 Matrix Factorization with Attributes We now show how the above low-rank factorization techniques can be generalized to handle user and/or product features, such as user demographics or product attributes. The problem of finding a low-rank approximation can be cast as that of finding a low-rank function f(i, j) in a tensor product of two reproducing kernel Hilbert spaces. Considering the problem in this way, we can work with general classes of functions f that operate on both a user s identity i and his known features. The matrix factorization problem is now just a special case of this more general framework. We now briefly review the method of []. We would like to estimate a function f : A B R having observed a finite set of triples {(a, b, r)} A B R. We assume we are given two positive semi-definite kernel functions, κ : A A R and γ : B B R. A kernel can be thought of as a comparison function, i.e., some kind of feature similarity metric between data points. As an example, without access to any side information of the elements in A, a trivial kernel might be κ(a, a ) = when a = a and 0 otherwise. This is often referred to as the Dirac kernel. We let K and G be the reproducing kernel Hilbert spaces (RKHS) of kernels κ and γ, respectively. Let us now consider the tensor product K G. This product can be defined as the closure of the set of finite sums of products of functions in K and G, ( ) K G := cl f(a, b) = u l (a)v l (b) : u l K, v l G. We call each pair u(a)v(b) an atomic term. Given a function f K G, if f can be written with k atomic terms P k u l(a)v l (b), and k is minimal, then we say that f has rank k. Notice that when we have finite spaces A and B, a function f(a, b) = P k u l(a)v l (b) in K G is effectively a matrix X, where we can define X ab := f(a, b). More importantly, when the function f(a, b) has rank k, this corresponds exactly to X having rank k. Conveniently, we can write the factorization out explicitly: X = UV T, where U al := u l (a) and V bl := v l (b). In particular, when κ and γ are Dirac kernels, K and G are arbitrary functions on A and B, so any u l K can be written as u l (a ),..., u l (a n) = u l,..., u nl, and similarly for any v l G. Thus, for these simple choices of κ and γ, the set of rank-k functions f K G is exactly the set of matrices {UV T : U R n k, V R m k }. There is a nice alternative representation of K G. Define the tensor product kernel κ : (A B) (A B) R as κ `(a, b), `a, b = k `a, a g `b, b. (5) This is known to be a positive semi-definite kernel [3, p.70], and so we denote by H the associated RKHS of κ. A classical result of Aronszajn [2] states that H is identically K G. The space H is equipped with a norm such that u v = u K v G, and thus P l u l v l 2 = P l,p u l, u p K v l, v p G. Let H k be the set of functions in H of rank k. Given that we want to estimate an f H k, we can write a general optimization problem min f H k E {(a,b,r)} (f(a, b) r) 2 + λ f 2. (6) where the expectation is uniform over the set of observed preferences {(a, b, r)}. Assuming A and B to be of finite size n and m respectively, now let K and G be the n n and m m kernel matrices of κ and γ. It is proven in [] that the above optimization has the following equivalent formulation: min α R n k β R m k E {(a,b,r)} (Kα l ) a(gβ l ) b r) + λ i= j=! 2 αi Kα jβi Gβ j (7) This representer theorem tells us that our optimal solution is of the form X = Kα(Gβ) T for an n k matrix α and an m k matrix β. In particular, when κ and γ are Dirac kernels, K and G are identity matrices, we are simply solving (2) with U = α and V = β with a regularization term. Using the Dirac kernel corresponds to representing a user or product simply by its identity. Of course, when additional

4 information about our users and products is available, it may be beneficial to instead use this information to make our prediction. We achieve this simply by an appropriate choice of kernels κ and γ. For example, assume that we are given a C-dimensional feature vector f a for each user a, then a potential kernel could be a linear kernel κ(a, a ) = f a f a, or alternatively an RBF kernel κ(a, a ) = exp( a a 2 ). 2πσ 2 σ As shown in [], it is better to include both identity information as well as feature information. This can be achieved 2 simply by choosing a mixed kernel κ(a, a ) = η δ(a, a ) + ( η)ˆκ(a, a ), (8) where δ is the Dirac kernel and ˆκ is some kernel based on user features. 3. ONLINE MATRIX FACTORIZATION There is a long history of batch-to-online gradient descent. For example, a perceptron with a margin update rule can be thought of as the online gradient descent version of a support vector machine. In the next two subsections we show how to derive an efficient online gradient descent update rule for the matrix factorization problem. 3. Online Low-rank Approximation Without Features Assume we are now working with the following model. Our prediction matrix X = αβ T, where α R n k and β R m k. Given a single observation (a, b, r), the loss on this observation is l(x, (a, b, r)). For convenience, we use the square loss, although other convex loss functions can be implemented quite easily. Thus, l(x, (a, b, r)) = (r X ab ) 2 = r! 2 α al β bl When we differentiate with respect to α and β, we see that the only nonzero terms are α al and β bl for l =,..., k. This is computationally convenient since an update is then only dependent on k and not on n and m which can be quite large. Differentiating gives! = 2β bl r α al β bl α al β bl = 2α al r! α al β bl With these derivatives in mind, we immediately obtain the following algorithm. This algorithm can be easily adjusted for different loss functions. A very common metric in Collaborative Filtering is Mean Absolute Error (MAE), and several results are reported with this performance measure. The online update would then become α al α al + 2τβ bl sign(r ˆr) β bl β bl + 2τα al sign(r ˆr) 3.2 Online Low-rank Approximation With Features Algorithm Online Low-rank Approximation : Parameters: n users, m products, desired rank k, stepsize τ. 2: Input: Observations {(a, b, r)} 3: Initialize α R n p and β R m p randomly. 4: for each (a, b, r) do 5: Compute ˆr = P k α alβ bl. 6: for l = to k do 7: α al α al + 2τβ bl (r ˆr) 8: β bl β bl + 2τα al (r ˆr) 9: end for 0: end for : Output: α and β. Let s now apply the same approach described above to the kernelized matrix factorization methods described in Section 2.2. Assume we are given kernel matrices K and G and we have parameter matrices α and β, and thus our current prediction matrix X = Kα(Gβ) T. We are now given a data point (a, b, r) and we need to compute the gradient of the loss: 2 l(x, (a, b, r)) = r (Kα) a (Gβ) T b Unfortunately, if we assume that K and G are nontrivial matrices, this derivative is no longer dependent solely on the parameters α al and β bl, l =,..., k. This suggests that a gradient step could require an update on as many as (n + m)k variables! Ideally, we want our update to be independent of n and m, as these might grow very large. Fortunately, this can still be achieved if we assume that our kernels k and g are linear, and thus we can work with the primal formulation rather than the dual described in (7). Recall that our model is to learn a factorization in terms of functions, which we can describe by, f(a, b) = u (a)v (b) + u 2(a)v 2(b) u p(a)v k (b) In the dual formulation, we would define u l (a ) = nx mx α al k(a, a ) v l (b ) = β bl k(b, b ). a= b= So far, we have assumed nothing about the representation of a, b. To get an efficient update, we assume a consists of an identifier and a set of features, a = ( a, x a,..., x ac), where a is the ath basis vector in R n, and x a,..., x ac is a C- dimensional feature vector for user a. Since our functions are linear, we can write u l (a) = ū l ( a) + û l (x a,..., x ac), where ū l and û l are linear functions. We now define ū l in terms of coefficients α l,..., α nl and û l in terms of coefficients µ l,..., µ lc, and thus u l (a) = nx CX α a l( a) a + µ lc x ac a = = α al + CX µ lc x ac c= c= Following the same analysis, we can describe v l (b) similarly. Assume that product b has D features y b,..., y bd, then

5 we can describe v l (b) in terms of parameters β bl and ν ld for d =,..., D. That is, v l (b) = β bl + DX ν ld y bd d= Thus our prediction is of the form: f(a, b) = (α al + µ l: x a:) (β bl + ν l: y b: ). If we let ˆr := f(a, b) as above, using the square loss l(f, r) = (ˆr r) 2, then we have the following derivatives: α al = 2 (β bl + ν l: y b: ) (r ˆr) β bl = 2 (α al + µ l: x a:) (r ˆr) µ lc = 2x ac (β bl + ν l: y b: ) (r ˆr) ν ld = 2y bd (α al + µ l: x a:) (r ˆr) Given these derivatives, we obtain the following online algorithm: Algorithm 2 Online Low-rank with Features : Parameters: n users, m products, desired rank k, stepsize τ. 2: Input: Observations {(a, b, r)} 3: Input: User features matrix [x ac] R n C, Product features matrix [y bd ] R n D. 4: Initialize: α R n p, β R m p, µ R k C, ν R k D randomly. 5: for each (a, b, r) do 6: Compute ˆr P p (α al + µ l: x a:)(β bl + ν l: y b: ). 7: for l = to k do 8: α al α al + 2τ(β bl + ν l: y b: )(r ˆr) 9: β bl β bl + 2τ(α al + µ l: x a:)(r ˆr) 0: for c = to C do : µ lc µ lc + 2τx ac(β bl + ν l: y b: )(r ˆr) 2: end for 3: for d = to D do 4: ν ld ν ld + 2τy bd (α al + µ l: x a:)(r ˆr) 5: end for 6: end for 7: end for 8: Output: α and β. Let us take a step back for a moment and consider intuitively what the above algorithms are doing. In Algorithm, where we don t include features, we want to learn k parameters α a,..., α ak for each user and k parameters β b,..., β bk for each movie so that the rating of user a on movie b is predicted as P l α alβ bl. When we receive a rating (a, b, r), we determine how much our prediction differs from this observation, and we adjust our parameters slightly to account for this error. We hope that, after we have seen all of our data, we have learned good parameters, in the sense that they have accurate predicting power. In Algorithm 2, the terms α il and β jl are now replaced by α al + µ l: x a: and β bl + ν l: y b:. Recall that µ is a matrix of parameters that are user-independent, while the parameter α al is particular to only user a (and similarly for µ and β bl ). We can now think of a user as begin described by k attributes, where attribute l is the sum of a user specific parameter α al plus a the value µ l: x a: which depends on this user s features. If we assume that the features x b: are highly informative, i.e. the could accurately describe the users ratings, then we would expect to accurately learn µ l: x a: and thus the α al s would not be necessary. On the other hand, with uninformative features, we would hope that the α al s would be learned appropriately and µ would be roughly 0. A note on regularization: In both of the above algorithms, it is often beneficial to include a regularization into this online optimization, especially when k is large relative to the number of observed entries. A natural choice of regularization penalty is that derived in (7), f = p= u l, u p v l, v p In the primal form this regularization becomes f = p= (α:l T α :p + µ T l: µ p:)(β:l T β :p + νl: T ν p:) When we omit features, the terms with µ and ν simply disappear. It is straightforward to differentiate the above norm and thus include this penalization in the update step. However, naïvely computing the derivative requires O(k(n + m)) running time. This can be efficiently improved by maintaining two k k matrices [α:l T α :p+µ T l: µ p:] lp and [β:l T β :p+νl: T ν p:] lp. On each example (a, b, r) we adjust µ, ν, and α al and β bl for each l =,..., k which require updating the entries in each of the above matrices, costing a total of O(k 2 (C + D)) for each update. This is mildly expensive, but by noting that f does not change substantially on each update, it is roughly sufficient to compute it only once for each pass through the data. Without features, an alternative regularization is described in (4) in which we control the tracenorm of our solution matrix X. As shown, this can be represented as the sum of frobenius norms, and in terms of α and β, this is exactly! nx mx R(α, β) = αi,l 2 +. i= j= β 2 j,l The gradient of this is very simple to compute: R α i,l = 2α i,l R β j,l = 2β j,l In our experiments without features we preferred to employ this regularization for simplicity as well as computational speedup. 4. IMPLEMENTATION AND EXPERIMENTS We tested our algorithms on a number of datasets and we now provide a discussion of several implementation issues as well as empirical results.

6 4. Datasets A common Collaborative Filtering dataset is the MovieLens One Million Ratings dataset. This data was compiled by the GroupLens Research Project who created a movie recommendation system, MovieLens, where users can input movie ratings and receive movie suggestions. The data consists of,000,209 ratings from 6,400 users of 3,900 movies. Each rating is an integer from to 5. In addition to ratings, additional information is also provided about each user and each movie. User data consists of their gender, one of several occupations (such as scientist or farmer ), an age attribute from one of Under 8, 8-24, 25-34, 35-44, 45-49, 50-55, or 56+. Each movie was labelled with one of several genres including, for example, Action, Drama, and Thriller. We also include results from the Netflix Prize data. In October 2006, Netflix released a real movie rating dataset, generated by their customers, and offered a $,000,000 prize for a 0% improvement in their current movie recommendation algorithm. The data includes 00,000,000 ratings from roughly 480,000 users and 7,700 movies. The size of this dataset is immensely larger than what had previously been publicly available and has presented an interesting challenge to researchers in machine learning and data mining. Netflix did not release user features for this data while it did reveal movie titles and genre information The final dataset we use is a toy dataset that we generated. We created 0,000 users and 5,000 movies, and we define each user a by a k-length feature vector x a and each movie b by a k-length feature vector y b. The coordinates of these vectors were generated from a Gaussian distribution, and we chose k = 0. We chose a random sample of ratings for each user, and the ratings were generated as r ab = x T a Ay b +ɛ for a randomly chosen k k matrix A and a normally distributed noise term ɛ. The Masters thesis of Ben Marlin [6] provides a very comprehensive survey of a number of Collaborative Filtering algorithms, their running times, and a large performance comparison. This work has provided a very convenient set of benchmarks to take advantage of. The two major datasets used in this comparison are the MovieLens, as desribed above, and the EachMovie dataset of HP/Compaq Research. Unfortunately, the EachMovie dataset is no longer publicly available. In general, we try to report results using the same methods and performance measures used in [6]. 4.2 Implementation Issues 4.2. Learning Rate and Stopping With insufficient regularization, running the algorithm until convergence does not yield good accuracy due to overfitting. As can be seen in Figure, the impact of overtraining is very significant. Furthermore, while increasing regularization shrinks the generalization error, the choice of λ required to avoid overtraining tends to lead to insufficiently flexible models. Therefore, we apply early stopping by monitoring the error on a validation set and decreasing the step size, or terminating the training procedure when the validation error increases for several iterations. Empirically, we observe that the final test error of the early-stopped predictor is fairly insensitive to the choice of λ across a wide range. One potential concern with early stopping is that since some users and products have many more observed ratings than others, the method may overtrain one type of users while undertraining another. However, experiments in which accuracy statistics were calculated for different strata of users indicate that the performance on all users and products, regardless of the number of ratings, peaks around the same time. Additionally, this validation set based step size control greatly speeds up convergence, allowing fast descent toward the neighborhood of the solution and then slowing down for fine refinement. The effect is similar to that of simulated annealing the choice of the direction based on one (or a few, if momentum is used) ratings randomizes the search and the decrease in step size is similar to decreasing the temperature in simulated annealing Momentum The algorithm as described so far uses the gradient at a single rating as an estimate of the global gradient. An obvious alterative is using multiple training points to estimate the gradient; standard gradient descent uses the complete training set. However, our emperical results suggest that standard gradient descent is an order of magnitude slower for the model presented in this paper. A good compromise is to use momentum, which estimates the gradient as a moving average of the gradients at the ratings recently considered. Specifically, we replace updates of the form with α al α al + 2τβ bl (r ˆr) α m al γα m al + ( γ)(2τβ bl (r ˆr)) α al α al + α m al where γ [0, ) is the momentum parameter. This smoothes the updates and allows the use of higher step size τ, leading to faster convergence Randomizing the Order Iterating through the ratings in a fixed order, particularly one where all the ratings for a single user or produced are adjacent leads to the parameters oscillating in a periodic pattern and impedes the convergence to an optimum. Therefore, it is better to cycle through the ratings in a random order. 4.3 Performance Results 4.3. Comparison with Full Gradient Descent While this algorithm was designed to perform a kind of online gradient descent, a surprising result is that we actually outperform full-fledged gradient descent, that is where each descent step is performed with all the data. This improvement is both in terms of speed as well as performance. The improved performance is possibly due to the fact that singlerating gradient descent is less prone to local-minima or even

7 lambda=0 lambda=e 6 lambda=3e 6 lambda=e 5 lambda=3e NMAE Number of passes through the dataset. Figure : The performance plot of our algorithm as a function of the number of passes through the data. The numbers, in normalized MAE, are different than what we report since we withheld a large testset. Full gradient descent Online gradient descent.25.2 Full gradient descent Online gradient descent.5. Minimum RMSE 0.95 RMSE Runtime (sec) Figure 2: We consider the performance/time trade off of our algorithm (online gradient descent) versus full gradient descent. Each point is obtained with on run given a particular fixed step-size and regularization parameter. less prone to overfitting. On the other hand, the improvement in speed is probably due to the increased number of updates per parameter for each time we pass through our data. Using the MovieLens dataset, we performed several runs and tried various step-size and regularization parameters for both our online descent method as well as full gradient descent. The RMSE performance we plot for each run is the optimal test performance obtained before over-fitting began to occur. In Figure 2, we show the comparison of optimal test performance with number of seconds to optimal test error. In Figure 3, we plot the test error v.s. time for the best performing parameter choices for each algorithm Standard Matrix Factorization For the Movielens dataset, we evaluate performance with the Normalized Mean Abslute Error metric, in order for our results to be comparable to those reported in [6]. We estimate Runtime (sec) Figure 3: Test error as a function of training time. Online descent converges much faster and to a better minimum than full gradient descent. the accuracy using leave-one-out crossvalidation, training on a dataset of 5000 users and using a validation set to determine the optimal value of the regularization parameter λ. The resulting model achieves NMAE of with k = 5 and 0.433, which is comparable with Aptutude, whose NMAE of is the best reported in [6]. We performed a preliminary test of our algorithm as well on the Netflix dataset. With k = 20 and no features included, our algorithm achieved an RMSE of with the model taking only an hour to train, a 3% improvement over Netflix s own algorithm Improvements with Features We report preliminary results of our algorithm using our toy dataset when we have access to both user and product features. Here, for a user a and user b, the true ratings are computed as x T a Ay b + ɛ given feature vectors x a and y b. We tried providing the algorithm with certain subvectors of x b and y b of various sizes, and we consider test performance. In Figure 4, we see the performance in RMSE as more and more user and movie features are provided to the algorithm. It is

8 0.32 Effect of features on RMSE RMSE Number of Features Used Figure 4: Using toy data, we consider the performance of employing features in our algorithm. interesting to note that the biggest improvement is achieved with the addition of just one feature. 4.4 Paralellizaton Here we discuss a very simple approach for implementing a parallel version of our algorithm. For simplicity, we focus primarily on simple matrix factorization model where we don t consider features. With some work, this can be generalized for the more general version that does use features. The online nature of the low-rank approximation algorithm without features lends itself well to a simple parallelization scheme. The key observation is that within the outer loop of Algorithm (line 4), the only variables that are modified are α al and β bl, for all l {,..., k}. This property of the algorithm allows the loop to be executed independently for a collection of the ratings {(a, b, r)} for which each a value and b value appear at most once. We can generalize this notion to collections of rating subsets, rather than collections of single ratings. If we wish to parallelize the algorithm across P processors, the entire observation set M is partitioned into P 2 blocks along a-value and b-value boundaries, as depicted below. M M 2 M P M 2 M 22 M 2P M P M P 2 M P P With this partition of the observation matrix, given a collection of blocks {M ij : i, j P } for which each i and j value appears at most once, an execution of the gradientdescent step on a particular rating will alter the α and β parameters that are relevant only to other ratings within the same block. So we choose P blocks {M ij} such that each i and j value appears exactly once and assign each processor to one of these chosen blocks. Each processor then performs the online gradient descent step for each of the ratings in its assigned block, disseminates its updated parameter values, and the process is repeated on a new set of P blocks. Initial testing suggests that the low-rank approximation algorithm without features can achieve a speedup of roughly 2.5x when distributed across four processors within a single machine. 5. CONCLUSION Here we present an online version of a common collaborative filtering approach, low-rank matrix approximation. Among other desirable features, this algorithm is fast, scales linearly with the dataset size, and can be parallelized. More importantly, these benefits are not at the expense of performance, as we show that our method performs nearly as well as several state-of-the-art techniques for collaborative prediction. 6. ADDITIONAL AUTHORS 7. REFERENCES [] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes, [2] N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68: , 950. [3] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Springer-Verlag, New-York, 984. [4] D. Billsus and M. J. Pazzani. Learning collaborative information filters. In ICML 98: Proceedings of the Fifteenth International Conference on Machine Learning, pages 46 54, San Francisco, CA, USA, 998. Morgan Kaufmann Publishers Inc. [5] D. DeCoste. Collaborative prediction using ensembles of maximum margin matrix factorizations. In ICML 06: Proceedings of the 23rd international conference on Machine learning, pages , New York, NY, USA, ACM Press. [6] B. Marlin. Collaborative filtering: A machine learning perspective, [7] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proc. ICML, 2005.

9 [8] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Adv. NIPS 7, 2005.

Fast Maximum Margin Matrix Factorization for Collaborative Prediction

Fast Maximum Margin Matrix Factorization for Collaborative Prediction for Collaborative Prediction Jason D. M. Rennie JRENNIE@CSAIL.MIT.EDU Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA Nathan Srebro NATI@CS.TORONTO.EDU

More information

The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

More information

Maximum-Margin Matrix Factorization

Maximum-Margin Matrix Factorization Maximum-Margin Matrix Factorization Nathan Srebro Dept. of Computer Science University of Toronto Toronto, ON, CANADA nati@cs.toronto.edu Jason D. M. Rennie Tommi S. Jaakkola Computer Science and Artificial

More information

Probabilistic Matrix Factorization

Probabilistic Matrix Factorization Probabilistic Matrix Factorization Ruslan Salakhutdinov and Andriy Mnih Department of Computer Science, University of Toronto 6 King s College Rd, M5S 3G4, Canada {rsalakhu,amnih}@cs.toronto.edu Abstract

More information

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

We shall turn our attention to solving linear systems of equations. Ax = b

We shall turn our attention to solving linear systems of equations. Ax = b 59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

More information

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu

More information

Nonlinear Iterative Partial Least Squares Method

Nonlinear Iterative Partial Least Squares Method Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Factorization Machines

Factorization Machines Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute of Scientific and Industrial Research Osaka University, Japan rendle@ar.sanken.osaka-u.ac.jp Abstract In this

More information

Factorization Theorems

Factorization Theorems Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

2.2 Creaseness operator

2.2 Creaseness operator 2.2. Creaseness operator 31 2.2 Creaseness operator Antonio López, a member of our group, has studied for his PhD dissertation the differential operators described in this section [72]. He has compared

More information

IPTV Recommender Systems. Paolo Cremonesi

IPTV Recommender Systems. Paolo Cremonesi IPTV Recommender Systems Paolo Cremonesi Agenda 2 IPTV architecture Recommender algorithms Evaluation of different algorithms Multi-model systems Valentino Rossi 3 IPTV architecture 4 Live TV Set-top-box

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Bayesian Factorization Machines

Bayesian Factorization Machines Bayesian Factorization Machines Christoph Freudenthaler, Lars Schmidt-Thieme Information Systems & Machine Learning Lab University of Hildesheim 31141 Hildesheim {freudenthaler, schmidt-thieme}@ismll.de

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

More information

General Framework for an Iterative Solution of Ax b. Jacobi s Method

General Framework for an Iterative Solution of Ax b. Jacobi s Method 2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

More information

1 Sets and Set Notation.

1 Sets and Set Notation. LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Variational approach to restore point-like and curve-like singularities in imaging

Variational approach to restore point-like and curve-like singularities in imaging Variational approach to restore point-like and curve-like singularities in imaging Daniele Graziani joint work with Gilles Aubert and Laure Blanc-Féraud Roma 12/06/2012 Daniele Graziani (Roma) 12/06/2012

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt. An Overview Of Software For Convex Optimization Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu In fact, the great watershed in optimization isn t between linearity

More information

Density Level Detection is Classification

Density Level Detection is Classification Density Level Detection is Classification Ingo Steinwart, Don Hush and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov Abstract We

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

1 Short Introduction to Time Series

1 Short Introduction to Time Series ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

More information

The Characteristic Polynomial

The Characteristic Polynomial Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

More information

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

More information

Chapter 10: Network Flow Programming

Chapter 10: Network Flow Programming Chapter 10: Network Flow Programming Linear programming, that amazingly useful technique, is about to resurface: many network problems are actually just special forms of linear programs! This includes,

More information

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725 Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

TD(0) Leads to Better Policies than Approximate Value Iteration

TD(0) Leads to Better Policies than Approximate Value Iteration TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Notes on Symmetric Matrices

Notes on Symmetric Matrices CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

More information

Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Randomized Robust Linear Regression for big data applications

Randomized Robust Linear Regression for big data applications Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

An Introduction to Neural Networks

An Introduction to Neural Networks An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Integer Factorization using the Quadratic Sieve

Integer Factorization using the Quadratic Sieve Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 seib0060@morris.umn.edu March 16, 2011 Abstract We give

More information

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization. Based on Mark Schmidt s slides Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear

More information

Predicting User Preference for Movies using NetFlix database

Predicting User Preference for Movies using NetFlix database Predicting User Preference for Movies using NetFlix database Dhiraj Goel and Dhruv Batra Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 {dgoel,dbatra}@ece.cmu.edu

More information

The QOOL Algorithm for fast Online Optimization of Multiple Degree of Freedom Robot Locomotion

The QOOL Algorithm for fast Online Optimization of Multiple Degree of Freedom Robot Locomotion The QOOL Algorithm for fast Online Optimization of Multiple Degree of Freedom Robot Locomotion Daniel Marbach January 31th, 2005 Swiss Federal Institute of Technology at Lausanne Daniel.Marbach@epfl.ch

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Chapter 7. Feature Selection. 7.1 Introduction

Chapter 7. Feature Selection. 7.1 Introduction Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature

More information

called Restricted Boltzmann Machines for Collaborative Filtering

called Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Ruslan Salakhutdinov rsalakhu@cs.toronto.edu Andriy Mnih amnih@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu University of Toronto, 6 King

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore Acknowledgments: Peilin Zhao, Jialei Wang, Hao Xia, Jing Lu, Rong Jin, Pengcheng Wu, Dayong Wang, etc. 2 Agenda

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

University of Lille I PC first year list of exercises n 7. Review

University of Lille I PC first year list of exercises n 7. Review University of Lille I PC first year list of exercises n 7 Review Exercise Solve the following systems in 4 different ways (by substitution, by the Gauss method, by inverting the matrix of coefficients

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

How High a Degree is High Enough for High Order Finite Elements?

How High a Degree is High Enough for High Order Finite Elements? This space is reserved for the Procedia header, do not use it How High a Degree is High Enough for High Order Finite Elements? William F. National Institute of Standards and Technology, Gaithersburg, Maryland,

More information

Solving Three-objective Optimization Problems Using Evolutionary Dynamic Weighted Aggregation: Results and Analysis

Solving Three-objective Optimization Problems Using Evolutionary Dynamic Weighted Aggregation: Results and Analysis Solving Three-objective Optimization Problems Using Evolutionary Dynamic Weighted Aggregation: Results and Analysis Abstract. In this paper, evolutionary dynamic weighted aggregation methods are generalized

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

6 Scalar, Stochastic, Discrete Dynamic Systems

6 Scalar, Stochastic, Discrete Dynamic Systems 47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1

More information

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology k-nearest Neighbor

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

On the Interaction and Competition among Internet Service Providers

On the Interaction and Competition among Internet Service Providers On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers

More information

1 Introduction to Matrices

1 Introduction to Matrices 1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information