Primal-Dual methods for sparse constrained matrix completion
|
|
|
- Tracy Russell
- 9 years ago
- Views:
Transcription
1 Yu Xin MIT CSAIL Tommi Jaakkola MIT CSAIL Abstract We develop scalable algorithms for regular and non-negative matrix completion. In particular, we base the methods on trace-norm regularization that induces a low rank predicted matrix. The regularization problem is solved via a constraint generation method that explicitly maintains a sparse dual and the corresponding low rank primal solution. We provide a new dual block coordinate descent algorithm for solving the dual problem with a few spectral constraints. Empirical results illustrate the effectiveness of our method in comparison to recently proposed alternatives. 1 Introduction Matrix completion lies at the heart of collaborative filtering. Given a sparsely observed rating matrix of users and items, the goal is to reconstruct the remaining entries so as to be able to recommend additional items to users. By placing constraints on the underlying rating matrix, and assuming favorable conditions for the selection of observed entries (cf. [12]), we may be able to recover the matrix from sparse [6] and potentially also noisy observations. The most commonly used constraint on the underlying matrix is that user preferences vary only across a few prominent underlying dimensions. The assumption can be made explicit so that the matrix has low rank or introduced as a regularizer (convex relaxations of rank) [7, 14]. An explicit low rank assumption about the underlying matrix results in a non-convex optimization problem. For this reason, recent work has focused on optimization algorithms (e.g., [9, 16, 15, 13]) and statistical recovery questions (e.g., [6, 2]) pertaining to convex relaxations of rank or alternative convex formulations (e.g., [1]). Appearing in Proceedings of the 15 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2012, La Palma, Canary Islands. Volume 22 of JMLR: W&CP 22. Copyright 2012 by the authors Our focus in this paper is on optimization algorithms for trace-norm regularized matrix completion. Tracenorm (a.k.a nuclear norm) is a 1-norm penalty on the singular values of the matrix and thus leads to a low rank solution with sufficient regularization. One key difficulty with this approach is that while the resulting optimization problem is convex, it is not differentiable. A number of approaches have been suggested to deal with this problem (e.g., [4]). In particular, many variants of proximal gradient methods (e.g., [9]) are effective as they fold the non-smooth regularization penalty into a simple proximal update that makes use of singular value decomposition. An alternative strategy would be to cast the trace-norm itself as a minimization problem over weighted Frobenius norms that are both convex and smooth. Another key difficulty arises from the sheer size of the full rating matrix even if the observations are sparse. This is a problem with all convex optimization approaches (e.g., proximal gradient methods) that explicitly maintain the predicted rating matrix (rank constraint would destroy convexity). The scaling problem can be remedied by switching to the dual regularization problem where dual variables are associated with the few observed entries [14]. The standard dual approach would, however, require us to solve an additional reconstruction problem (using complementary slackness) to realize the actual rating matrix. We introduce here a new primal-dual approach that avoids both of these problems, leading to a sparse dual optimization problem while also iteratively generating a low rank estimated matrix. The approach scales well to large and especially sparse matrices that are unsuitable for methods that maintain the full matrix or require singular value decomposition. We also extend the approach to non-negative matrix completion. 2 Trace norm regularization for matrix completion We consider the well-known sparse matrix completion problem [6]. The goal is to predict the missing entries of a n m (n m) real valued target matrix Y
2 based on a small subset of observed entries. We index the observed entries as Y u,i, (u, i) D. A typical approach to this problem would constrain the predicted matrix W to have low rank: W = UV T, where the smaller dimension of U and V is (substantially) less than m. A convex relaxation of the corresponding estimation problem is obtained via trace norm regularization (e.g., [7, 14]) J(W ) = Loss(Y u,i, W u,i ) + λ W (1) where Loss(Y u,i, W u,i ) is a convex loss function of W u,i such as the squared loss. To simplify the ensuing notation, we assume that the loss function depends only on the difference between W and Y so that Loss(Y u,i, W u,i ) = Loss(Y u,i W u,i ). The trace-norm W is a 1-norm penalty on the singular values of the matrix, i.e., W = m j=1 σ j(w ) where σ j (W ) 0 is the j th singular value of W. For large enough λ, some of the singular values are set exactly to zero resulting in a low rank predicted matrix W. One key optimization challenge comes from the regularization penalty W which is convex but not differentiable. Effective algorithms based on (accelerated) proximal gradient updates have been developed to overcome this issue (e.g., [9, 2]). However, such methods update the full matrix W in each iteration and are thus unsuitable for large problems. 3 A primal-dual algorithm We consider here primal-dual algorithms that operate only over the observed entries, thus leading to a sparse estimation problem. In our formulation, the dual variables arise from Legendre conjugate transformations of the loss functions: Loss(z) = max {qz Loss (q)} (2) q where the conjugate function Loss (q) is also convex. For example, for the squared loss, Loss (q) = q 2 /2. The resulting dual variables, Q u,i, (u, i) D, one variable for each observed entry, can be viewed as a sparse n m matrix Q with remaining entries set to zero. The trace norm regularization in the primal formulation translates into a semi-definite constraint in the dual (cf. [7, 14]): maximize tr(q T Y ) Loss (Q u,i ) (3) subject to Q T Q λ 2 I We derive the dual problem in the next section as the derivation will be reused with slight modification in 1324 the context of non-negative matrix factorization. Note that the constraint Q T Q λ 2 I can be equivalently written as Qb 2 λ 2 for all b such that b = 1. In other words, it is a constraint on the spectral norm of Q. We will solve the dual by iteratively adding spectral constraints that are violated. 3.1 Derivation of the dual The trace norm of matrix W with factorization W = UV T is given by (e.g., [14]) W = 1 min W =UV T 2 ( U 2 F + V 2 F ) (4) where U and V need not be low rank so the factorization always exists. Consider then an extended symmetric matrix ( ) UU T UV X = T V U T V V T (5) whose upper right and lower left components equal W and W T, respectively. As a result, W = min W =X UR tr(x)/2 where X UR is the upper right part of X. By construction, X is symmetric and positive semi-definite. We can also expand the observations into a symmetric matrix ( ) 0 Y Z = Y T (6) 0 and use Ω as the index set to identify observed entries in the upper right and lower left components of Z. Formally, Ω = {(i, j) 1 i, j m + n, (i, j n) or (i n, j) D}. With these definitions, the primal trace norm regularization problem is equivalent to the extended problem min Loss(Z u,i X u,i ) + λtr(x) (7) X S where S is the cone of positive semi-definite matrices in R (m+n) (m+n). We introduce dual variables for each observed entry in Z via Legendre conjugate transformations of the loss functions. The Lagrangian involving both primal and dual variables is then given by [ ] L(A, X, Z) = A u,i (Z u,i X u,i ) Loss (A u,i ) + λtr(x) (8) where A can be written as ( ) 0 Q A = Q T 0 (9)
3 Yu Xin, Tommi Jaakkola and Q is sparse such that Q u,i = 0 is (u, i) / D. To introduce the dual variables for the positive semi-definite constraint, we consider the dual cone of S which is defined as S = {E R (m+n) (m+n), tr(e T M) 0, M S} (10) S is self-dual so that S = S. The Lagrangian is then L(A, E, X, Z) = [ ] A u,i (Z u,i X u,i ) Loss (A u,i ) + λtr(x) tr(e T X) (11) To solve for X, we set d/x L(A, E, X, Z) = 0, and get λi A = E S (12) Inserting the solution back into the Lagrangian, we obtain L(A) = A u,i Z u,i Loss (A u,i ) (13) Since E does not show up in the Lagrangian, we can replace the equation (12) with a constraint λi A S. The formulation can be simplified by considering the original Q and Y which correspond to the upper right components of A and Z. The dual problem in these variables is then maximize tr(q T Y ) Loss (Q u,i ) subject to λ 2 I Q T Q S (14) 3.2 Solving the dual There are three challenges with the dual. The first one is the separation problem, i.e., finding vectors b, b = 1, such that Qb 2 > λ 2, where Q refers to the current solution. Each such b can be found efficiently precisely because Q is sparse. The second challenge is effectively solving the dual under a few spectral constraints. For this, we derive a new block coordinate descent approach (cf. [16]). The third challenge concerns the problem of reconstructing the primal matrix W from the dual solution. By including only a few spectral constraints in the dual, we obtain a low-rank primal solution. We can thus explicitly maintain a primal-dual pair of the relaxed problem (fewer constraints) throughout the optimization. The separation problem. We will iteratively add constraints represented by b. The separation problem we must solve is then: given the current solution Q, find b for which Qb 2 > λ 2. This is easily solved by finding the eigenvector of Q T Q with the largest eigenvalue. For example, the power method b = randn(m, 1). Iterate b Q T Qb, b b/ b (15) 1325 is particularly effective with sparse matrices. If Qb 2 > λ 2 for the resulting b, then we add a single constraint Qb 2 λ 2 into the dual. Note that b does not have to be solved exactly; any b provides a valid albeit not necessarily the tightest constraint. An existing constraint can also be easily tightened later on with a few iterations of the power method, starting with the current b. We can fold this tightening together with the block coordinate optimization discussed below. Primal-dual block coordinate descent. The second problem is to solve the dual subject to Qb l 2 λ 2, l = 1,..., k, instead of the full set of constraints Q T Q λ 2 I. This partially constrained dual problem can be written as tr(q T Y ) Loss (Q u,i ) k h( Qb l 2 λ 2 ) (16) l=1 where h(z) = if z > 0 and h(z) = 0 otherwise. The advantage of this form is that each h( Qb 2 λ 2 ) term is a convex function of Q (a convex non-decreasing function of a convex quadratic function of Q, therefore itself convex). We can thus obtain its conjugate dual as h( Qb 2 λ 2 ) = sup{ξ( Qb λ 2 )/2} ξ 0 = sup {v T Qb v 2 /(2ξ) ξλ 2 /2} ξ 0,v where the latter form is jointly concave in (ξ, v) where b is assumed fixed. This step lies at the core of our algorithm. By relaxing the supremum over (ξ, v), we obtain a linear, not quadratic, function of Q. The new Lagrangian is given by L(Q, V, ξ) = tr(q T Y ) Loss (Q u,i ) k l=1 [ ] (v l ) T Qb l v l 2 /(2ξ l ) ξ l λ 2 /2 = tr(q T (Y l v l (b l ) T )) Loss (Q u,i ) + k l=1 [ v l 2 2ξ l + ξl λ 2 ] 2 which can be maximized with respect to Q for fixed (ξ l, v l ), l = 1,..., k. Indeed, our primal-dual algorithm seeks to iteratively minimize L(V, ξ) = max Q L(Q, V, ξ) while explicitly maintaining Q = Q(V, ξ). Note also that by maximizing over Q, we reconstitute the loss terms tr(q T (Y W )) Loss (Q u,i ). The predicted rank k matrix W is therefore obtained explicitly W = l vl (b l ) T.
4 By allowing k constraints in the dual, we search over rank k predictions. The iterative algorithm proceeds by selecting one l, fixing (ξ j, v j ), j l, and optimizing L(V, ξ) with respect to (ξ l, v l ). As a result, L(V, ξ) is monotonically decreasing. Let W = j l vj (b j ) T, where only the observed entries need to be evaluated. We consider two variants for minimizing L(V, ξ) with respect to (ξ l, v l ): Method 1: When the loss function is not strictly convex, we first solve ξ l as function of v l resulting in ξ l = v l /λ. Recall that L(V, ξ) involves a maximization over Q that reconstitutes the loss terms with a predicted matrix W = W +v l (b l ) T. By dropping terms not depending on v l, the relevant part of min ξ l L(V, ξ) is given by Loss(Y u,i W u,i vub l l i) + λ v l (17) which is a simple primal group Lasso minimization problem over v l. Since b l is fixed, the problem is convex and can be solved by standard methods. Method 2: When the loss function is strictly convex, we can first solve v l = ξ l Qb l and explicitly maintain the maximizing Q = Q(ξ l ). The minimization problem over ξ l 0 is max Q { tr(q T (Y W )) } ξ l ( Qb l 2 λ 2 )/2 Loss (Q u,i ) (18) where we have dropped all the terms that remain constant during the iterative step. For the squared loss, for example, Q(ξ l ) is obtained in closed form: ( ) Q u,iu (ξ l ) = (Y u,iu W u,iu ) 1 ξl b l I u (b l I u ) T 1 + ξ l b l, I u 2 where I u = {i : (u, i) D} is the index set of observed elements for a row (user) u. In general, an iterative solution is required. The optimal value ξ l 0 is subsequently set as follows. If Q(0)b l 2 λ 2, then ξ l = 0. Otherwise, since Q(ξ l )b l 2 is monotonically decreasing as a function of ξ l, we find (e.g., via bracketing) ξ l > 0 such that Q(ξ l )b l 2 = λ 2. 4 Constraints and convergence The algorithm described earlier monotonically decreases L(V, ξ) for any fixed set of constraints corresponding to b l, l = 1,..., k. Any additional constraint further decreases this function. We consider here how quickly the solution approaches the dual optimum as 1326 new constraints are added. To this end, we write our algorithm more generally as minimizing { F (B) = max tr(q T Y ) Loss (Q u,i ) Q + 1/2tr((λ 2 I Q T Q)B) } (19) where B is a positive semi-definite m m matrix. For any fixed set of constraints, our algorithm minimizes F (B) over the cone B = k l=1 ξl b l (b l ) T that corresponds to constraints tr((λ 2 I Q T Q)b l (b l ) T ) > 0, l. F (B) is clearly convex as a point-wise supremum of linear functions of B. For simplicity, we assume that the loss function Loss(z) is strictly convex with a Lipschitz continuous derivative (e.g., the squared loss). In this case, F (B) is also differentiable with gradient df (B) = 1/2(λ 2 I Q T Q) where Q = Q(B) (unique). Moreover, B that minimizes F (B) remains bounded, as does Q(B). Under these assumptions F (B) has a Lipschitz continuous derivative but need not itself be strictly convex. Theorem 4.1. Under the assumptions discussed above, F (B r ) F (B ) = O(1/r). Proof. Consider (B 0, Q 0 ), (B 1, Q 1 ),... the sequence generated by the primal-dual algorithm. Since df (B) is Lipschitz continuous with some constant L, F (B) has a quadratic upper bound: F (B) F (B r 1 ) + df (B r 1 ), B B r 1 +L/2 B B r 1 2 F (20) In each iteration, we add a constraint (b r ) T Q T Qb r λ 2 and fully optimize Q and ξ i to satisfy all the constraints that have been added. Prior to including the r th constraint, from complementary slackness, we have for each i < r, ξ i (b i ) T (λ 2 I Q T Q)b i = 0 where Q is optimized based on r 1 constraints. This means that tr((λ 2 I Q T Q) r 1 i=0 ξi b i (b i ) T ) = tr((λ 2 I Q T Q)B r 1 ) = 0, or df (B r 1 ), B r 1 = 0. Let B r = B r 1 + ξ r b r (b r ) T, where b r = argmax b =1 b T Q T Qb, i.e., the largest eigenvalue λ 2 r (λ r > λ). Here, λ r is the largest eigenvalue prior to including b r. As a result, the upper bound evaluated at B r yields F (B r ) F (B r 1 ) + ξ r (λ 2 λ 2 r) + 2L(ξ r ) 2 By setting ξ r = (λ 2 r λ 2 )/4L (non-negative), we get F (B r ) F (B r 1 ) (λ 2 r λ 2 ) 2 /8L The optimal value of ξ r may be different. Moreover, at each iteration, B r, including all of its previous constraints, are optimized. Thus the actual decrease may be somewhat larger. From the convexity of F (B) and complementary slackness, we have F (B r 1 ) F (B ) df (B r 1 ), B r 1 B = df (B r 1 ), B
5 Yu Xin, Tommi Jaakkola where B is the optimum. B is a positive semidefinite matrix. Therefore df (B r 1 ), B P roj neg (df (B r 1 )), B (λ 2 r λ 2 )tr(b ) = (λ 2 r λ 2 )C where P roj neg ( ) is a projection to negative semidefinite matrices. The minimum eigenvalue of df (B r 1 ) = 1/2(λ 2 Q T Q) is λ 2 λ 2 r. Combining this with the sufficient decrease, we get F (B r 1 ) F (B r ) (F (Br 1 ) F (B )) 2 The above inequality implies 8LC 2 1 F (B r ) F (B ) 1 F (B r 1 ) F (B ) F (Br 1 ) F (B ) 8LC 2 (F (B r ) F (B )) 1 8LC 2 Summing over all r, we get F (B r ) F (B ) 8LC2 (F (B 0 ) F (B )) r(f (B 0 ) F (B )) + 8LC 2 = O(1 r ) 5 Convex formulation of nonnegative matrix factorization The non-negative matrix factorization(nmf) problem is typically written as (Y u,i (UV T ) u,i ) 2 (21) min U,V the dual cone of S +, i.e., (S + ) = C +. (the dimensions of these cones are clear from context). Following the derivation of the dual in section 3, we consider expanded symmetric matrices X and Z in R (m+n) (m+n), defined as before. Instead of requiring X to be positive semi-definite, however, for NMF we constrain X to be a completely positive matrix, i.e., in S +. The primal optimization problem can be given as min X S + Loss(X u,i Z u,i ) + λtr(x) (22) Notice that the primal problem involves infinite constraints corresponding to X S + and is difficult to solve directly. We start with the Lagrangian involving primal and dual variables: [ ] L(A, C, X, Z) = A u,i (Z u,i X u,i ) Loss (A u,i ) + λtr(x) tr(c T X) (23) where C C + and A = [0, Q; Q T, 0] as before. setting d/x L(A, E, X, Z) = 0, we get By λi A = C C + (24) Substituting the result back into the Lagrangian, the dual problem becomes maximize tr(a T Z) Loss (A) (25) subject to λi A C + where U R n k and V R m k are both matrices with nonnegative entries. The optimization problem defined in this way is non-convex and NP-hard in general [20]. As before, we look for convex relaxations of this problem via trace norm regularization. This step requires some care, however. The set of matrices UV T, where U and V are non-negative and of rank k, is not the same as the set of rank k matrices with non-negative elements. The difference is well-known even if not fully characterized. Let us first introduce some basic concepts such as completely positive and co-positive matrices. A matrix W is completely positive if there exists a nonnegative matrix B such that W = BB T. It can be seen from the definition that each completely positive matrix is also positive semi-definite. A matrix C is co-positive if for any v 0, i.e., a vector with non-negative entries, v T Cv 0. Unlike completely positive matrices, copositive matrices may be indefinite. We will denote the set of completely positive symmetric matrices and co-positive matrices as S + and C +, respectively. C + is 1327 where A is a sparse matrix in the sense that A u,i = 0 if (u, i) / Ω. The co-positivity constraint is equivalent to v T (λi A)v 0, v 0. Similarly to the primal dual algorithm, in each iteration v l = [a l ; b l ] is selected as the vector that violates the constraint the most. This vector can be found by solving max v 0, v 1 vt Av = max a,b 0, a 2 + b 2 1 at Qb (26) where Q is the matrix of dual variables (before expansion), and a R m, b R n. While the sub-problem of finding the most violating constraint is NP-hard in general, it is unnecessary to find the global optimum. Any non-negative v that violates the constraint v T (λi A)v 0 can be used to improve the current solution. At the optimum, a 2 = b 2 = 1/2, and thus it is equivalent to solve max a,b 0, a 2 = b 2 =1/2 at Qb (27)
6 A local optimum can be found by alternatingly optimizing a and b according to a = h(qb)/( 2 h(qb) ) b = h(q T a)/( 2 h(q T a) ) (28) where h(x) = max(0, x) is the element-wise nonnegative projection. The running time across the two operations is directly proportional to the number of observed entries. The stationary point v s = [a s ; b s ] satisfies v s = 1 and Av s + max(0, Av s ) + h(av s ) v s = 0 which are necessary (but not sufficient) optimality conditions for (26). As a result, many elements in a and b are exactly zero so that the decomposition is not only non-negative but sparse. Given v l, l = 1,..., k, the Lagrangian is then L(A, ξ) = tr(a T Z) Loss (A) l ξ l ((v l ) T Av l λ) where ξ l 0 and k l=1 ξl v l (v l ) T S +. To update ξ l, we fix ξ j (j l) and optimize the objective with respect to A and ξ l, max A,ξ l 0 tr(a T (Z j l ξj v j (v j ) T )) Loss (A) ξ l ((v l ) T Av l λ) It can be seen from above formulation that our primal solution without the l th constraint can be reconstructed from the dual as X = j l ξj v j (v j ) T S +. Only a small fraction of entries of this matrix (those corresponding to observations) need to be evaluated. If the loss function is the squared loss, then the variables A and ξ l have closed form expressions. Specifically, for a fixed ξ l, A(ξ l ) u,i = Z u,i X u,i ξ l v l uv l i (29) If (v l ) T A(0)v l λ, then the optimum ˆξ l = 0. Otherwise ˆξ l = (Z u,i X u,i )vuv l i l λ (vl uvi l (30) )2 frequently trapped in locally optimal solutions. It is also likely to convergence slowly, and often requires a proper initialization scheme [23]. In contrast, the primal-dual method discussed here solves a convex optimization problem. With a large enough λ, only a few constraints are expected to be necessary, resulting in a low rank reconstruction. The key difficulty in our approach is identifying a violating constraint. The simple iterative method may fail even though a violating constraint exists. This is where randomization is helpful in our approach. We initialize a and b randomly and run the separation algorithm several times so as to get a reasonable guarantee of finding a violated constraint when they exist. 6 Experiments There are many variations of our primal-dual () method pertaining to how the constraints are enforced and when they are updated. We introduced the method as one that fully optimizes all ξ l corresponding to the available constraints prior to searching for additional constraints. Another variant, analogous to gradient based methods, is to update only the last ξ k associated with the new constraint without ever going back and iteratively optimizing any of the previous ξ l (l < k) (or the constraints themselves). The method adds constraints fasters as ξ l are updated only once when introduced. Another variant is to update all the previous ξ l together with their constraint vectors b l (through the power method) before introducing a new constraint. By updating b l as well, this approach replaces the previous constraints with tighter ones. This protocol can reduce the number of constraints that are added is thus useful in cases where a very low rank solution is desirable. We compare here the two variants of our method to recent approaches proposed in the literature. We begin with the state of art method proposed in [18] which we refer to as JS. Similar to, JS formulates a trace norm regularization problem using extended matrices X, Z R (m+n) (m+n) which are derived from the predictions and observed data. It solves the following optimization problem, The update of A and ξ l takes time linear in the number of observed elements. If we update all ξ l in each iteration, then the running time is k times longer and is the same as the update times in multiplicative update methods ([22]) that aim to directly solve the optimization problem (21). The multiplicative update method revises the decomposition into U and V iteratively while keeping U and V nonnegative. Because the objective is non-convex, the algorithm gets 1328 min Loss(X, Z) (31) X 0,tr(X)=t Instead of penalizing the trace of X, JS fixes tr(x) to a constant t. In each iteration, JS finds the maximum eigenvector of the gradient, v k, and updates X according to X = (1 t k )X + t k v k (v k ) T where t k is an optimized step size. During the optimization process, the trace norm constraint is always satisfied. Though JS and attempt to optimize different objective functions, if we set t = tr(x ) where X is the optimum
7 Yu Xin, Tommi Jaakkola matrix from our regularized objective (1), then JS will converge to the same solution in the limit. The comparison is on Movielens 10M dataset which contain 10 7 ratings of users on movies. Following the setup in [18], we use partition r b provided with the dataset which has 10 test ratings per user. The regularization parameter λ in the method is set to 50. For the JS method, the corresponding value of the constant t = tr(x ) = where X is the optimum from the method that fully optimizes all ξ l before searching for additional constraints. In order to ensure that we perform a fair comparison with JS, rather than updating ξ l for all l = 1, 2,..., k at iteration k, only ξ k is updated (the first variant introduced above). In other words, all ξ l are updated only once when the corresponding constraint is incorporated. Figure 1-a compares the test RMSE of JS and as a function of time. The test error of decreases much faster than JS. Figure 1-b compares the test error as a function of rank, i.e., as a function of iteration rather than time. performs significantly better than JS when rank is small at the beginning owing to the better optimization step. In Figure 2-a, we compare the regularized primal objective function values J(W ) = L(W ) + λ W as a function of time. Note that the objectives used by the methods are different and thus JS does not directly optimize this function. However, it will converge to the same limiting value at the optimum given how the parameters t and λ are set. From the figure, J(W ) does not converge to the optimum over the training set as the limiting values should agree. This is also evident in Figure 2-b that shows the training error as a function of time. is consistently lower than JS. We also compare our method to a recently proposed GECO algorithm (see [19]). GECO seeks to solve the rank constrained optimization problem min Loss(W, Y ) (32) rank(w ) r It maintains a decomposition in the form W = UV T and, at each iteration, increases the rank of U and V by concatenating vectors u and v that correspond to the largest singular values of the gradient of Loss(W, Y ). Moreover, it optimizes the expanded U and V by rotating and scaling the corresponding vectors. This solidifying step is computationally expensive but nevertheless improves the fixed-rank result. In order to compare fairly with their method, we use the second variant of our method. In other words, before adding a new constraint in iteration k, we update all ξ l including b l for l = 1, 2,...k for a fixed number of times related to their algorithm (q times). Note that our 1329 a) b) RMSE RMSE JS Time(s) JS Rank Figure 1: a) test RMSE comparison of JS and as a function of training time. b) test RMSE comparison of JS and as a function of rank. algorithm here is different from that used in comparison to JS. Due to the complexity of GECO, we used a smaller dataset, Movielens 1M, which contain 10 6 ratings of 6040 users about 3706 movies. Following the setup in [19], 50% of the ratings are used for training, and the rest for testing. Figure 3 compares the test RMSE of GECO and. GECO becomes very slow when the rank increases (the complexity can be O(r 6 ) as a function of rank r). It already took 17 hours to get rank 9 solutions from GECO, so we didn t attempt to use it on bigger datasets. Figure 3-b compares the test RMSE as a function of rank. is substantially faster and needs fewer constraints to perform well. 7 Discussion Our primal-dual algorithm is monotone and heavily exploits sparsity of observations for efficient computations. Constraints are added to the dual problem in a sequential cutting plane fashion. The method also maintains a low rank primal solution corresponding to the partially constrained dual. We believe our primal-dual method () is particularly suitable for sparse large scale problems, and empirical results con-
8 6.5 7 x 106 JS GECO Objective RMSE a) Time(s) a) Time(s) JS GECO Training RMSE RMSE b) Time(s) b) Rank Figure 2: a) objective J(W ) as a function of time. b) training RMSE as a function of time. firm that the stronger systematic optimization of each constraint is beneficial in comparison to other recent approaches. There are a number of possible extensions of the basic approach. We already showed how non-negative matrix factorization problems can be solved within a similar framework with the caveat that the separation problem is substantially harder in this case. Other variations such as non-negative factorization with sparse factors are also possible. References [1] J. Abernethy, F. Bach, T. Evgeniou and J. Vert, A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization, Journal of Machine Learning Research, Volume 10, [2] A. Agarwal, S. N. Negahban, and M. J. Wainwright, Fast global convergence of gradient methods for high-dimensional statistical recovery, NIPS, Figure 3: a) test RMSE comparison of GECO and as a function of training time. b) test RMSE comparison of GECO and as a function of rank. [4] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, [5] D. S. Bernstein. Matrix Mathematics: Theory, Facts, and Formulas with Application to Linear System Theory, Princeton University Press, [6] E. J. Candes and B. Recht, Exact matrix completion via convex optimization, Found. of Comput. Math., , [7] M. Fazel, H. Hindi, and S. Boyd, A rank minimization heuristic with application to minimum order system approximation, Proceedings American Control Conference, volume 6, pages , [8] A. Goldberg, X. Zhu, B. Recht, J. Sui, and R. Nowak, Transduction with matrix completion: Three birds with one stone, NIPS, [9] S. Ji and J. Ye, An accelerated gradient method for trace norm minimization, ICML, [3] A. Argyriou, C. A. Micchelli, and M. Pontil, On Spectral Learning, Journal of Machine Learning Research 11(2010), [10] A. S. Lewis, The Convex Analysis of Unitarily Invariant Matrix Functions, Journal of Convex Analysis Volume 2 (1995), No.1/2,
9 Yu Xin, Tommi Jaakkola [11] S. Ma, D. Goldfarb, and L. Chen, Fixed point and Bregman iterative methods for matrix rank minimization, Technical Report, Department of IEOR, Columbia University, October, [12] B. M. Marlin, R. S. Zemel, S. Roweis, and M. Slaney, Collaborative Filtering and the Missing at Random Assumption, UAI, [13] T. K. Pong, P. Tseng, S. Ji, and J. Ye, Trace Norm Regularization: Reformulations, Algorithms, and Multi-task Learning, SIAM Journal on Optimization, [14] N. Srebro, J. Rennie and T. Jaakkola, Maximum Margin Matrix Factorizations, In Advances in Neural Information Processing Systems 17, [15] K-C. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems, preprint, Department of Mathematics, National University of Singapore, March [16] P. Tseng, Dual Coordinate Ascent Methods for Non-strictly Convex Minimization, Mathematical Programming, Vol. 59, pp , 1993 [17] G. A. Watson, Characterization of the subdifferential of some matrix norms, Linear Algebra and its Applications, Volume 170, June 1992, Pages [18] M. Jaggi and with Marek Sulovsk, A Simple Algorithm for Nuclear Norm Regularized Problems, ICML [19] S. Shalev-Shwartz, A. Gonenand and O. Shamir Large-Scale Convex Minimization with a Low-Rank Constraint, ICML [20] Vavasis, S. On the complexity of nonnegative matrix factorization. arxiv.org, [21] J. B. Hiriart-Urruty and A. Seeger, A variational approach to co-positive matrices, SIAM Review 52 (2010), no. 4, [22] D. D. Lee and H. S. Seung. Algorithms for nonnegative matrix factorization. NIPS [23] C. Boutsidis and E. Gallopoulos, SVD based initialization: A head start for nonnegative matrix factorization, Pattern Recognition
Notes on Symmetric Matrices
CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Maximum-Margin Matrix Factorization
Maximum-Margin Matrix Factorization Nathan Srebro Dept. of Computer Science University of Toronto Toronto, ON, CANADA [email protected] Jason D. M. Rennie Tommi S. Jaakkola Computer Science and Artificial
Several Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
Practical Guide to the Simplex Method of Linear Programming
Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear
GI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
Proximal mapping via network optimization
L. Vandenberghe EE236C (Spring 23-4) Proximal mapping via network optimization minimum cut and maximum flow problems parametric minimum cut problem application to proximal mapping Introduction this lecture:
Lecture 5 Principal Minors and the Hessian
Lecture 5 Principal Minors and the Hessian Eivind Eriksen BI Norwegian School of Management Department of Economics October 01, 2010 Eivind Eriksen (BI Dept of Economics) Lecture 5 Principal Minors and
An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.
An Overview Of Software For Convex Optimization Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 [email protected] In fact, the great watershed in optimization isn t between linearity
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
Mathematical finance and linear programming (optimization)
Mathematical finance and linear programming (optimization) Geir Dahl September 15, 2009 1 Introduction The purpose of this short note is to explain how linear programming (LP) (=linear optimization) may
Applications to Data Smoothing and Image Processing I
Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
Introduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725
Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T
Duality of linear conic problems
Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least
24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no
Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS
Sensitivity Analysis 3 We have already been introduced to sensitivity analysis in Chapter via the geometry of a simple example. We saw that the values of the decision variables and those of the slack and
(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
Math 4310 Handout - Quotient Vector Spaces
Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable
Nonlinear Programming Methods.S2 Quadratic Programming
Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
10. Proximal point method
L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.
Walrasian Demand Econ 2100 Fall 2015 Lecture 5, September 16 Outline 1 Walrasian Demand 2 Properties of Walrasian Demand 3 An Optimization Recipe 4 First and Second Order Conditions Definition Walrasian
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013
Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Linear Programming Notes V Problem Transformations
Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material
Chapter 17. Orthogonal Matrices and Symmetries of Space
Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length
Fast Maximum Margin Matrix Factorization for Collaborative Prediction
for Collaborative Prediction Jason D. M. Rennie [email protected] Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA Nathan Srebro [email protected]
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
Nonlinear Optimization: Algorithms 3: Interior-point methods
Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris [email protected] Nonlinear optimization c 2006 Jean-Philippe Vert,
Duality in Linear Programming
Duality in Linear Programming 4 In the preceding chapter on sensitivity analysis, we saw that the shadow-price interpretation of the optimal simplex multipliers is a very useful concept. First, these shadow
Inner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
Notes on Determinant
ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without
Online Lazy Updates for Portfolio Selection with Transaction Costs
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department
Constrained Least Squares
Constrained Least Squares Authors: G.H. Golub and C.F. Van Loan Chapter 12 in Matrix Computations, 3rd Edition, 1996, pp.580-587 CICN may05/1 Background The least squares problem: min Ax b 2 x Sometimes,
Support Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
3 Orthogonal Vectors and Matrices
3 Orthogonal Vectors and Matrices The linear algebra portion of this course focuses on three matrix factorizations: QR factorization, singular valued decomposition (SVD), and LU factorization The first
Continuity of the Perron Root
Linear and Multilinear Algebra http://dx.doi.org/10.1080/03081087.2014.934233 ArXiv: 1407.7564 (http://arxiv.org/abs/1407.7564) Continuity of the Perron Root Carl D. Meyer Department of Mathematics, North
[1] Diagonal factorization
8.03 LA.6: Diagonalization and Orthogonal Matrices [ Diagonal factorization [2 Solving systems of first order differential equations [3 Symmetric and Orthonormal Matrices [ Diagonal factorization Recall:
Linear Programming. March 14, 2014
Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1
Elasticity Theory Basics
G22.3033-002: Topics in Computer Graphics: Lecture #7 Geometric Modeling New York University Elasticity Theory Basics Lecture #7: 20 October 2003 Lecturer: Denis Zorin Scribe: Adrian Secord, Yotam Gingold
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
BANACH AND HILBERT SPACE REVIEW
BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but
What is Linear Programming?
Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to
Linear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka [email protected] http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
General Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
Big Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Derivative Free Optimization
Department of Mathematics Derivative Free Optimization M.J.D. Powell LiTH-MAT-R--2014/02--SE Department of Mathematics Linköping University S-581 83 Linköping, Sweden. Three lectures 1 on Derivative Free
Vector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
Numerical Methods I Eigenvalue Problems
Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001, Fall 2010 September 30th, 2010 A. Donev (Courant Institute)
SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA
SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA This handout presents the second derivative test for a local extrema of a Lagrange multiplier problem. The Section 1 presents a geometric motivation for the
Linear Programming in Matrix Form
Linear Programming in Matrix Form Appendix B We first introduce matrix concepts in linear programming by developing a variation of the simplex method called the revised simplex method. This algorithm,
Error Bound for Classes of Polynomial Systems and its Applications: A Variational Analysis Approach
Outline Error Bound for Classes of Polynomial Systems and its Applications: A Variational Analysis Approach The University of New South Wales SPOM 2013 Joint work with V. Jeyakumar, B.S. Mordukhovich and
Lecture 13 Linear quadratic Lyapunov theory
EE363 Winter 28-9 Lecture 13 Linear quadratic Lyapunov theory the Lyapunov equation Lyapunov stability conditions the Lyapunov operator and integral evaluating quadratic integrals analysis of ARE discrete-time
Date: April 12, 2001. Contents
2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........
Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.
1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.
Cost Minimization and the Cost Function
Cost Minimization and the Cost Function Juan Manuel Puerta October 5, 2009 So far we focused on profit maximization, we could look at a different problem, that is the cost minimization problem. This is
Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.
Epipolar Geometry We consider two perspective images of a scene as taken from a stereo pair of cameras (or equivalently, assume the scene is rigid and imaged with a single camera from two different locations).
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
Statistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
CHAPTER 9. Integer Programming
CHAPTER 9 Integer Programming An integer linear program (ILP) is, by definition, a linear program with the additional constraint that all variables take integer values: (9.1) max c T x s t Ax b and x integral
Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8
Spaces and bases Week 3: Wednesday, Feb 8 I have two favorite vector spaces 1 : R n and the space P d of polynomials of degree at most d. For R n, we have a canonical basis: R n = span{e 1, e 2,..., e
Solving Linear Programs
Solving Linear Programs 2 In this chapter, we present a systematic procedure for solving linear programs. This procedure, called the simplex method, proceeds by moving from one feasible solution to another,
Solving polynomial least squares problems via semidefinite programming relaxations
Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose
1 Norms and Vector Spaces
008.10.07.01 1 Norms and Vector Spaces Suppose we have a complex vector space V. A norm is a function f : V R which satisfies (i) f(x) 0 for all x V (ii) f(x + y) f(x) + f(y) for all x,y V (iii) f(λx)
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Support Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION
1 A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION Dimitri Bertsekas M.I.T. FEBRUARY 2003 2 OUTLINE Convexity issues in optimization Historical remarks Our treatment of the subject Three unifying lines of
13 MATH FACTS 101. 2 a = 1. 7. The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.
3 MATH FACTS 0 3 MATH FACTS 3. Vectors 3.. Definition We use the overhead arrow to denote a column vector, i.e., a linear segment with a direction. For example, in three-space, we write a vector in terms
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
Inner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
The Matrix Elements of a 3 3 Orthogonal Matrix Revisited
Physics 116A Winter 2011 The Matrix Elements of a 3 3 Orthogonal Matrix Revisited 1. Introduction In a class handout entitled, Three-Dimensional Proper and Improper Rotation Matrices, I provided a derivation
LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS. 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method
LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method Introduction to dual linear program Given a constraint matrix A, right
Similarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
Large-Scale Similarity and Distance Metric Learning
Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
Class Meeting # 1: Introduction to PDEs
MATH 18.152 COURSE NOTES - CLASS MEETING # 1 18.152 Introduction to PDEs, Fall 2011 Professor: Jared Speck Class Meeting # 1: Introduction to PDEs 1. What is a PDE? We will be studying functions u = u(x
A characterization of trace zero symmetric nonnegative 5x5 matrices
A characterization of trace zero symmetric nonnegative 5x5 matrices Oren Spector June 1, 009 Abstract The problem of determining necessary and sufficient conditions for a set of real numbers to be the
Definition 11.1. Given a graph G on n vertices, we define the following quantities:
Lecture 11 The Lovász ϑ Function 11.1 Perfect graphs We begin with some background on perfect graphs. graphs. First, we define some quantities on Definition 11.1. Given a graph G on n vertices, we define
Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.
1. Introduction Linear Programming for Optimization Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc. 1.1 Definition Linear programming is the name of a branch of applied mathematics that
1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every
Moving Least Squares Approximation
Chapter 7 Moving Least Squares Approimation An alternative to radial basis function interpolation and approimation is the so-called moving least squares method. As we will see below, in this method the
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.
Summer course on Convex Optimization Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.Minnesota Interior-Point Methods: the rebirth of an old idea Suppose that f is
Constrained optimization.
ams/econ 11b supplementary notes ucsc Constrained optimization. c 2010, Yonatan Katznelson 1. Constraints In many of the optimization problems that arise in economics, there are restrictions on the values
Dual Methods for Total Variation-Based Image Restoration
Dual Methods for Total Variation-Based Image Restoration Jamylle Carter Institute for Mathematics and its Applications University of Minnesota, Twin Cities Ph.D. (Mathematics), UCLA, 2001 Advisor: Tony
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Lecture 5: Variants of the LMS algorithm
1 Standard LMS Algorithm FIR filters: Lecture 5: Variants of the LMS algorithm y(n) = w 0 (n)u(n)+w 1 (n)u(n 1) +...+ w M 1 (n)u(n M +1) = M 1 k=0 w k (n)u(n k) =w(n) T u(n), Error between filter output
Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method
Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming
