# When Is There a Representer Theorem? Vector Versus Matrix Regularizers

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Journal of Machine Learning Research 10 (2009) Submitted 9/08; Revised 3/09; Published 11/09 When Is There a Representer Theorem? Vector Versus Matrix Regularizers Andreas Argyriou Department of Computer Science University College London Gower Street, London, WC1E 6BT, UK Charles A. Micchelli Department of Mathematics and Statistics State University of New York The University at Albany Albany, New York 12222, USA Massimiliano Pontil Department of Computer Science University College London Gower Street, London, WC1E 6BT, UK Editor: Ralph Herbrich Abstract We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the L 2 norm, then the learned vector is a linear combination of the input data. This result, known as the representer theorem, lies at the basis of kernel-based methods in machine learning. In this paper, we prove the necessity of the above condition, in the case of differentiable regularizers. We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multi-task learning. In this context, we study a more general representer theorem, which holds for a larger class of regularizers. We provide a necessary and sufficient condition characterizing this class of matrix regularizers and we highlight some concrete examples of practical importance. Our analysis uses basic principles from matrix theory, especially the useful notion of matrix nondecreasing functions. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, regularization 1. Introduction Regularization in Hilbert spaces is an important methodology for learning from examples and has a long history in a variety of fields. It has been studied, from different perspectives, in statistics (see Wahba, 1990, and references therein), in optimal estimation (Micchelli and Rivlin, 1985) and recently has been a focus of attention in machine learning theory, see, for example (Cucker and Smale, 2001; De Vito et al., 2004; Micchelli and Pontil, 2005a; Shawe-Taylor and Cristianini, 2004; Vapnik, 2000) and references therein. Regularization is formulated as an optimization problem involving an error term and a regularizer. The regularizer plays an important role, in that it favors solutions with certain desirable properties. It has long been observed that certain regularizers exhibit an appealing c 2009 Andreas Argyriou, Charles A. Micchelli, and Massimiliano Pontil.

2 ARGYRIOU, MICCHELLI AND PONTIL property, called the representer theorem, which states that there exists a solution of the regularization problem that is a linear combination of the data. This property has important computational implications in the context of regularization with positive semidefinite kernels, because it transforms high or infinite-dimensional problems of this type into finite dimensional problems of the size of the number of available data (Schölkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004). The topic of interest in this paper will be to determine the conditions under which representer theorems hold. In the first half of the paper, we describe a property which a regularizer should satisfy in order to give rise to a representer theorem. It turns out that this property has a simple geometric interpretation and that the regularizer can be equivalently expressed as a nondecreasing function of the Hilbert space norm. Thus, we show that this condition, which has already been known to be sufficient for representer theorems, is also necessary. In the second half of the paper, we depart from the context of Hilbert spaces and focus on a class of problems in which matrix structure plays an important role. For such problems, which have recently appeared in several machine learning applications, we show a modified version of the representer theorem that holds for a class of regularizers significantly larger than in the former context. As we shall see, these matrix regularizers are important in the context of multi-task learning: the matrix columns are the parameters of different regression tasks and the regularizer encourages certain dependences across the tasks. In general, we consider problems in the framework of Tikhonov regularization (Tikhonov and Arsenin, 1977). This approach finds, on the basis of a set of input/output data (x 1,y 1 ),..., (x m,y m ) H Y, a vector inh as the solution of an optimization problem. Here,H is a prescribed Hilbert space equipped with the inner product, and Y R a set of possible output values. The optimization problems encountered in regularization are of the type min { E ( ( w,x 1,..., w,x m ),(y 1,...,y m ) ) + γω(w) : w H }, (1) where γ > 0 is a regularization parameter. The functione : R m Y m R is called an error function and Ω :H R is called a regularizer. The error function measures the error on the data. Typically, it decomposes as a sum of univariate functions. For example, in regression, a common choice would be the sum of square errors, m i=1 ( w,x i y i ) 2. The function Ω, called the regularizer, favors certain regularity properties of the vector w (such as a small norm) and can be chosen based on available prior information about the target vector. In some Hilbert spaces such as Sobolev spaces the regularizer is a measure of smoothness: the smaller the norm the smoother the function. This framework includes several well-studied learning algorithms, such as ridge regression (Hoerl and Kennard, 1970), support vector machines (Boser et al., 1992), and many more see Schölkopf and Smola (2002) and Shawe-Taylor and Cristianini (2004) and references therein. An important aspect of the practical success of this approach is the observation that, for certain choices of the regularizer, solving (1) reduces to identifying m parameters and not dim(h ). Specifically, when the regularizer is the square of the Hilbert space norm, the representer theorem holds: there exists a solution ŵ of (1) which is a linear combination of the input vectors, ŵ = m i=1 c i x i, (2) where c i are some real coefficients. This result is simple to prove and dates at least from the 1970 s, see, for example, Kimeldorf and Wahba (1970). It is also known that it extends to any regularizer that is a nondecreasing function of the norm (Schölkopf et al., 2001). Several other variants 2508

5 WHEN IS THERE A REPRESENTER THEOREM? VECTOR VERSUS MATRIX REGULARIZERS where w i,v i are the i-th components of w,v respectively. More generally, we will consider Hilbert spaces which we will denote byh, equipped with an inner product,. We also let M d,n be the linear space of d n real matrices. If W,Z M d,n we define their Frobenius inner product as W,Z = tr(w Z), where tr denotes the trace of a matrix. With S d we denote the set of d d real symmetric matrices and with S d + (S d ++) its subset of positive semidefinite (definite) ones. We use and for the positive definite and positive semidefinite partial orderings, respectively. Finally, we let O d be the set of d d orthogonal matrices. 2. Regularization Versus Interpolation The line of attack which we shall follow in this paper will go through interpolation. That is, our main concern will be to obtain necessary and sufficient conditions for representer theorems that hold for interpolation problems. However, in practical applications one encounters regularization problems more frequently than interpolation problems. First of all, the family of the former problems is more general than that of the latter ones. Indeed, an interpolation problem can be simply obtained in the limit as the regularization parameter goes to zero (Micchelli and Pinkus, 1994). More importantly, regularization enables one to trade off interpolation of the data against smoothness or simplicity of the model, whereas interpolation frequently suffers from overfitting. Thus, frequently one considers problems of the form min { E ( ( w,x 1,..., w,x m ),(y 1,...,y m ) ) + γω(w) : w H }, (6) where γ > 0 is called the regularization parameter. This parameter is not known in advance but can be tuned with techniques like cross validation, see, for example, Wahba (1990). Here, the function Ω :H R is a regularizer, E : R m Y m R is an error function and x i H,y i Y, i N m, are given input and output data. The set Y is a subset of R and varies depending on the context, so that it is typically assumed equal to R in the case of regression or equal to { 1,1} in binary classification problems. One may also consider the associated interpolation problem, which is min { Ω(w) : w H, w,x i = y i, i N m }. (7) Under certain assumptions, the minima in problems (6) and (7) are attained (the latter whenever the constraints in (7) are satisfiable). Such assumptions could involve, for example, lower semicontinuity and boundedness of sublevel sets for Ω and boundedness from below for E. These issues will not concern us here, as we shall assume the following about the error functione and the regularizer Ω, from now on. Assumption 1 The minimum (6) is attained for any γ > 0, any input and output data {x i,y i : i N m } and any m N. The minimum (7) is attained for any input and output data {x i,y i : i N m } and any m N, whenever the constraints in (7) are satisfiable. The main objective of this paper is to obtain necessary and sufficient conditions on Ω so that the solution of problem (6) satisfies a linear representer theorem. Definition 2 We say that a class of optimization problems such as (6) or (7) satisfies the linear representer theorem if, for any choice of data {x i,y i : i N m } such that the problem has a solution, there exists a solution that belongs to span{x i : i N m }. 2511

6 ARGYRIOU, MICCHELLI AND PONTIL In this section, we show that the existence of representer theorems for regularization problems is equivalent to the existence of representer theorems for interpolation problems, under a quite general condition that has a simple geometric interpretation. We first recall a lemma from (Micchelli and Pontil, 2004, Sec. 2) which states that (linear or not) representer theorems for interpolation lead to representer theorems for regularization, under no conditions on the error function. Lemma 3 Let E : R m Y m R, Ω :H R satisfying Assumption 1. Then if the class of interpolation problems (7) satisfies the linear representer theorem, so does the class of regularization problems (6). Proof Consider a problem of the form (6) and let ŵ be a solution. We construct an associated interpolation problem min { Ω(w) : w H, w,x 1 = ŵ,x 1,..., w,x m = ŵ,x m }. (8) By hypothesis, there exists a solution w of (8) that lies in span{x i : i N m }. But then Ω( w) Ω(ŵ) and hence w is a solution of (6). The result follows. This lemma requires no special properties of the functions involved. Its converse, in contrast, requires assumptions about the analytical properties of the error function. We provide one such natural condition in the theorem below, but other conditions could conceivably work too. The main idea in the proof is, based on a single input, to construct a sequence of appropriate regularization problems for different values of the regularization parameter γ. Then, it suffices to show that letting γ 0 + yields a limit of the minimizers that satisfies an interpolation constraint. Theorem 4 LetE : R m Y m R and Ω :H R. Assume thate,ω are lower semicontinuous, that Ω has bounded sublevel sets and thate is bounded from below. Assume also that, for some v R m \ {0},y Y m, there exists a unique minimizer of min{e(av,y) : a R} and that this minimizer does not equal zero. Then if the class of regularization problems (6) satisfies the linear representer theorem, so does the class of interpolation problems (7). Proof Fix an arbitrary x 0 and let a 0 be the minimizer of min{e(av,y) : a R}. Consider the problems { ( ) } a0 min E x 2 w,x v,y + γω(w) : w H, for every γ > 0, and let w γ be a solution in the span of x (known to exist by hypothesis). We then obtain that ( ) a0 E(a 0 v,y)+γω(w γ ) E x 2 w γ,x v,y + γω(w γ ) E(a 0 v,y)+γω(x). (9) Thus, Ω(w γ ) Ω(x) and so, by the hypothesis on Ω, the set {w γ : γ > 0} is bounded. Therefore, there exists a convergent subsequence {w γl : l N}, with γ l 0 +, whose limit we call w. By taking the limit inferior as l on the inequality on the right in (9), we obtain ( ) a0 E x 2 w,x v,y E(a 0 v,y). 2512

7 WHEN IS THERE A REPRESENTER THEOREM? VECTOR VERSUS MATRIX REGULARIZERS Consequently, a 0 x 2 w,x = a 0 or, using the hypothesis that a 0 0, w,x = x 2. In addition, since w γ belongs to the span of x for every γ > 0, so does w. Thus, we obtain that w = x. Moreover, from the definition of w γ we have that ( ) a0 E x 2 w γ,x v,y + γω(w γ ) E(a 0 v,y)+γω(w) w H such that w,x = x 2 and, combining with the definition of a 0, we obtain that Ω(w γ ) Ω(w) w H such that w,x = x 2. Taking the limits inferior as l, we conclude that w = x is a solution of the problem min{ω(w) : w H, w,x = x 2 }. Moreover, this assertion holds even when x = 0, since the hypothesis implies that 0 is a global minimizer of Ω. Indeed, any regularization problem of the type (6) with zero inputs, x i = 0, i N m, admits a solution in their span. Thus, we have shown that Ω satisfies property (13) in the next section and the result follows immediately from Lemma 9 below. We now comment on some commonly used error functions. The first is the square loss, E(z,y) = i N m (z i y i ) 2, for z,y R m. It is immediately apparent that Theorem 4 applies in this case. The second function is the hinge loss, E(z,y) = i N m max(1 z i y i,0), where the outputs y i are assumed to belong to { 1,1} for the purpose of classification. In this case, we may select y i = 1, i N m, and v = (1, 2,0,...,0) for m 2. Then the function E( v,y) is the one shown in Figure 1. Finally, the logistic loss, E(z,y) = i N m log ( 1+e z iy i ), is also used in classification problems. In this case, we may select y i = 1, i N m, and v = (2, 1) for m = 2 or v = (m 2, 1,..., 1) for m > 2. In the latter case, for example, setting to zero the derivative ofe( v,y) yields the equation (m 1)e a(m 1) + e a m+2 = 0, which can easily be seen to have a unique solution. Summarizing, we obtain the following corollary. 2513

8 ARGYRIOU, MICCHELLI AND PONTIL Figure 1: Hinge loss along the direction (1, 2, 0,..., 0). Corollary 5 IfE : R m Y m R is the square loss, the hinge loss (for m 2) or the logistic loss (for m 2) and Ω : H R is lower semicontinuous with bounded sublevel sets, then the class of problems (6) satisfies the linear representer theorem if and only if the class of problems (7) does. Note also that the condition one in Theorem 4 is rather weak in that an error functione may satisfy it without being convex. At the same time, an error function that is too flat, such as a constant loss, will not do. We conclude with a remark about the situation in which the inputs x i are linearly independent. 3 It has a brief and straightforward proof, which we do not present here. Remark 6 Let E be the hinge loss or the logistic loss and Ω : H R be of the form Ω(w) = h( w ), where h : R + R is a lower semicontinuous function with bounded sublevel sets. Then the class of regularization problems (6) in which the inputs x i,i N m, are linearly independent, satisfies the linear representer theorem. 3. Representer Theorems for Interpolation Problems The results of the previous section allow us to focus on linear representer theorems for interpolation problems of the type (7). We are going to consider the case of a Hilbert spaceh as the domain of an interpolation problem. Interpolation constraints will be formed as inner products of the variable with the input data. In this section, we consider the interpolation problem min{ω(w) : w H, w,x i = y i,i N m }. (10) We coin the term admissible to denote the class of regularizers we are interested in. Definition 7 We say that the function Ω : H R is admissible if, for every m N and any data set {(x i,y i ) : i N m } H Y such that the interpolation constraints are satisfiable, problem (10) 3. This occurs frequently in practice, especially when the dimensionality d is high. 2514

9 WHEN IS THERE A REPRESENTER THEOREM? VECTOR VERSUS MATRIX REGULARIZERS admits a solution ŵ of the form where c i are some real parameters. ŵ = i N m c i x i, (11) We say that Ω :H R is differentiable if, for every w H, there is a unique vector denoted by Ω(w) H, such that for all p H, Ω(w+t p) Ω(w) lim = Ω(w), p. t 0 t This notion corresponds to the usual notion of directional derivative on R d and in that case Ω(w) is the gradient of Ω at w. In the remainder of the section, we always assume that Assumption 1 holds for Ω. The following theorem provides a necessary and sufficient condition for a regularizer to be admissible. Theorem 8 Let Ω :H R be a differentiable function and dim(h ) 2. Then Ω is admissible if and only if for some nondecreasing function h : R + R. Ω(w) = h( w,w ) w H, (12) It is well known that the above functional form is sufficient for a representer theorem to hold, see, for example, Schölkopf et al. (2001). Here we show that it is also necessary. The route we follow to prove the above theorem is based on a geometric interpretation of representer theorems. This intuition can be formally expressed as condition (13) in the lemma below. Both condition (13) and functional form (12) express the property that the contours of Ω are spheres (or regions between spheres), which is apparent from Figure 2. Lemma 9 A function Ω :H R is admissible if and only if it satisfies the property that Ω(w+ p) Ω(w) w, p H such that w, p = 0. (13) Proof Suppose that Ω satisfies property (13), consider arbitrary data x i,y i,i N m, and let ŵ be a solution to problem (10). We can uniquely decompose ŵ as ŵ = w+ p where w L := span{x i : i N m } and p L. From (13) we obtain that Ω(ŵ) Ω( w). Also w satisfies the interpolation constraints and hence we conclude that w is a solution to problem (10). Conversely, if Ω is admissible choose any w H and consider the problem min{ω(z) : z H, z,w = w,w }. By hypothesis, there exists a solution belonging in span{w} and hence w is a solution to this problem. Thus, we have that Ω(w+ p) Ω(w) for every p such that w, p = 0. It remains to establish the equivalence of the geometric property (13) to condition (12), which says that Ω is a nondecreasing function of the L 2 norm. Proof of Theorem 8 Assume first that (13) holds and dim(h ) <. In this case, we only need to consider the case thath = R d since (13) can always be rewritten as an equivalent condition on R d, using an orthonormal basis ofh. 2515

10 ARGYRIOU, MICCHELLI AND PONTIL w 0 w+p Figure 2: Geometric interpretation of Theorem 8. The function Ω should not decrease when moving to orthogonal directions. The contours of such a function should be spherical. First we observe that, since Ω is differentiable, this property implies the condition that Ω(w), p = 0, (14) for all w, p R d such that w, p = 0. Now, fix any w 0 R d such that w 0 = 1. Consider an arbitrary w R d \ {0}. Then there exists an orthogonal matrix U O d such that w = w Uw 0 and det(u) = 1 (see Lemma 20 in the appendix). Moreover, we can write U = e D for some skew-symmetric matrix D M d,d (see Example in Horn and Johnson, 1991). Consider now the path z : [0,1] R d with z(λ) = w e λd w 0 λ [0,1]. We have that z(0) = w w 0 and z(1) = w. Moreover, since z(λ),z(λ) = w,w, we obtain that Applying (14) with w = z(λ), p = z (λ), it follows that dω(z(λ)) dλ z (λ),z(λ) = 0 λ (0,1). = Ω(z(λ)),z (λ) = 0. Consequently, Ω(z(λ)) is constant and hence Ω(w) = Ω( w w 0 ). Setting h(ξ) = Ω( ξw 0 ), ξ R +, yields (12). In addition, h must be nondecreasing in order for Ω to satisfy property (13). For the case dim(h ) = we can argue similarly using instead the path z(λ) = (1 λ)w 0 + λw (1 λ)w 0 + λw w 2516

11 WHEN IS THERE A REPRESENTER THEOREM? VECTOR VERSUS MATRIX REGULARIZERS which is differentiable on (0,1) when w / span{w 0 }. We confirm equation (12) for vectors in span{w 0 } by a limiting argument on vectors not in span{w 0 } since Ω is continuous. Conversely, if Ω(w) = h( w,w ) and h is nondecreasing, property (13) follows immediately. The assumption of differentiability is crucial for the above proof and we postpone the issue of removing it to future work. Nevertheless, property (13) follows immediately from functional form (12) without any assumptions. Remark 10 Let Ω :H R be a function of the form Ω(w) = h( w,w ) w H, for some nondecreasing function h : R + R. Then Ω is admissible. We also note that we could modify Definition 7 by requiring that any solution of problem (10) be in the linear span of the input data. We call such regularizers strictly admissible. Then with minor modifications to Lemma 9 (namely, requiring that equality in (13) holds only if p = 0) and to the proof of Theorem 8 (namely, requiring h to be strictly increasing) we have the following corollary. Corollary 11 Let Ω : H R be a differentiable function and dim(h ) 2. Then Ω is strictly admissible if and only if Ω(w) = h( w,w ), w H, where h : R + R is strictly increasing. Theorem 8 can be used to verify whether the linear representer theorem can be obtained when using a regularizer Ω. For example, the function w p p = i Nd w i p is not admissible for any p > 1, p 2, because it cannot be expressed as a function of the Hilbert space norm. Indeed, if we choose any a R and let w = (aδ i1 : i N d ), the requirement that w p p = h( w,w ) would imply that h(a 2 ) = a p, a R, and hence that w p = w. 4. Matrix Learning Problems In this section, we investigate how representer theorems and results like Theorem 8 can be extended in the context of optimization problems which involve matrices. 4.1 Exploiting Matrix Structure As we have already seen, our discussion in Section 3 applies to any Hilbert space. Thus, we may consider the finite Hilbert space of d n matrices M d,n equipped with the Frobenius inner product,. As in Section 3, we could consider interpolation problems of the form min{ω(w) : W M d,n, W,X i = y i, i N m } (15) where X i M d,n are prescribed input matrices and y i Y are scalar outputs, for i N m. Then Theorem 8 states that such a problem admits a solution of the form Ŵ = i N m c i X i, (16) where c i are some real parameters, if and only if Ω can be written in the form Ω(W) = h( W,W ) W M d,n, (17) 2517

14 ARGYRIOU, MICCHELLI AND PONTIL also satisfied constraints of the form w t x 1 = y t1 for some x 1 R d,y t1 Y, then x 1 would have to be the zero vector. Therefore, the class of problems (18) is strictly larger than the class (21). However, it turns out that with regard to representer theorems of the form (19) there is no distinction between these two classes of problems. In other words, the representer theorem holds for the same regularizers Ω, independently of whether each task has its own specific inputs or the inputs are the same across the tasks. More importantly, we can connect the existence of representer theorems to a geometric property of the regularizer, in a way analogous to property (13) in Section 3. These facts are stated in the following proposition. Proposition 13 The following statements are equivalent: (a) Problem (21) admits a solution of the form (19), for every data set {(x i,y ti ) : i N m,t N n } R d Y and every m N, such that the interpolation constraints are satisfiable. (b) Problem (18) admits a solution of the form (19), for every data set {(x ti,y ti ) : i N mt,t N n } R d Y and every {m t : t N n } N, such that the interpolation constraints are satisfiable. (c) The function Ω satisfies the property Ω(W + P) Ω(W) W,P M d,n such that W P = 0. (22) Proof We will show that (a) = (c), (c) = (b) and (b) = (a). [(a) = (c)] Consider any W M d,n. Choose m = n and the input data to be the columns of W. In other words, consider the problem min{ω(z) : Z M d,n,z W = W W}. By hypothesis, there exists a solution Ẑ = WC for some C M n,n. Since (Ẑ W) W = 0, all columns of Ẑ W have to belong to the null space of W. But, at the same time, they have to lie in the range of W and hence we obtain that Ẑ = W. Therefore, we obtain property (22) after the variable change P = Z W. [(c) = (b)] Consider arbitrary x ti R d,y ti Y, i N mt,t N n, and let Ŵ be a solution to problem (18). We can decompose the columns of Ŵ as ŵ t = w t + p t, where w t L := span{x si,i N ms,s N n } and p t L, t N n. By hypothesis Ω(Ŵ) Ω( W). Since Ŵ interpolates the data, so does W and therefore W is a solution to (18). [(b) = (a)] Trivial, since any problem of type (21) is also of type (18). We remark in passing that, by a similar proof, property (22) is also equivalent to representer theorems (19) for the class of problems (15). The above proposition provides us with a criterion for characterizing all regularizers satisfying representer theorems of the form (19), in the context of problems (15), (18) or (21). Our objective will be to obtain a functional form analogous to (12) that describes functions satisfying property (22). This property does not have a simple geometric interpretation, unlike (13) which describes functions with spherical contours. The reason is that matrix products are more difficult to tackle than inner products. 2520

15 WHEN IS THERE A REPRESENTER THEOREM? VECTOR VERSUS MATRIX REGULARIZERS Similar to the Hilbert space setting (12), where we required h to be a nondecreasing real function, the functional description of the regularizer now involves the notion of a matrix nondecreasing function. Definition 14 We say that the function h : S n + R is nondecreasing in the order of matrices if h(a) h(b) for all A,B S n + such that A B. Theorem 15 Let d,n N with d 2n. The differentiable function Ω : M d,n R satisfies property (22) if and only if there exists a matrix nondecreasing function h : S n + R such that Ω(W) = h(w W) W M d,n. (23) Proof We first assume that Ω satisfies property (22). From this property it follows that, for all W,P M d,n with W P = 0, Ω(W),P = 0. (24) To see this, observe that if the matrix W P is zero then, for all ε > 0, we have that Ω(W + εp) Ω(W) ε Taking the limit as ε 0 + we obtain that Ω(W),P 0. Similarly, choosing ε < 0 we obtain that Ω(W),P 0 and equation (24) follows. Now, consider any matrix W M d,n. Let r = rank(w) and let us write W in a singular value decomposition as follows W = i N r σ i u i v i, where σ 1 σ 2 σ r > 0 are the singular values and u i R d, v i R n, i N r, are sets of singular vectors, so that u i u j = v i v j = δ i j, i, j N r. Also, let u r+1,...,u d R d be vectors that together with u 1,...,u r form an orthonormal basis of R d. Without loss of generality, let us pick u 1 and consider any unit vector z orthogonal to the vectors u 2,...,u r. Let k = d r+ 1 and q R k be the unit vector such that z = Rq, where R = (u 1,u r+1,...,u d ). We can complete q by adding d r columns to its right in order to form an orthogonal matrix Q O k and, since d > n, we may select these columns so that det(q) = 1. Furthermore, we can write this matrix as Q = e D with D M k,k a skew-symmetric matrix (see Horn and Johnson, 1991, Example ). We also define the path Z : [0,1] M d,n as Z(λ) = σ 1 Re λd e 1 v 1 + r i=2 0. σ i u i v i λ [0,1], where e 1 denotes the vector ( ) R k. In other words, we fix the singular values, the right singular vectors and the r 1 left singular vectors u 2,...,u r and only allow the first left singular vector to vary. This path has the properties that Z(0) = W and Z(1) = σ 1 zv 1 + r i=2 σ i u i v i. 2521

16 ARGYRIOU, MICCHELLI AND PONTIL By construction of the path, it holds that Z (λ) = σ 1 Re λd De 1 v 1 and hence ( Z(λ) Z (λ) = σ 1 Re λd e 1 v1) σ1 Re λd De 1 v 1 = σ 2 1 v 1 e 1De 1 v 1 = 0, for every λ (0,1), because D 11 = 0. Hence, using equation (24), we have that Ω(Z(λ)),Z (λ) = 0 and, since dω(z(λ)) = Ω(Z(λ)),Z (λ), we conclude that Ω(Z(λ)) equals a constant independent of λ. In particular, Ω(Z(0)) = Ω(Z(1)), that dλ is, Ω(W) = Ω ( σ 1 zv 1 + r i=2 ) σ i u i v i. In other words, if we fix the singular values of W, the right singular vectors and all the left singular vectors but one, Ω does not depend on the remaining left singular vector (because the choice of z is independent of u 1 ). In fact, this readily implies that Ω does not depend on the left singular vectors at all. Indeed, fix an arbitrary Y M d,n such that Y Y = I. Consider the matrix Y(W W) 1 2, which can be written using the same singular values and right singular vectors as W. That is, Y(W W) 1 2 = i N r σ i τ i v i, where τ i = Y v i, i N r. Now, we select unit vectors z 1,...,z r R d as follows: z 1 = u 1 z 2 z 1,u 3,...,u r,τ 1.. z r z 1,...,z r 1,τ 1,...,τ r

17 WHEN IS THERE A REPRESENTER THEOREM? VECTOR VERSUS MATRIX REGULARIZERS This construction is possible since d 2n. Replacing successively u i with z i and then z i with τ i, i N r, and applying the invariance property, we obtain that ( ) Ω(W) = Ω σ i u i v i i N r ( = Ω (. σ 1 z 1 v 1 + σ 2 z 2 v 2 + = Ω σ i z i v i i N r ( = Ω. σ 1 τ 1 v 1 + ) r i=2 (.. ) = Ω σ i τ i v i i N r σ i z i v i r i=3 ) σ i u i v i ) ( ) = Ω Y(W W) 2 1. Therefore, defining the function h : S n + R as h(a) = Ω(YA 2), 1 we deduce that Ω(W) = h(w W). Finally, we show that h is matrix nondecreasing, that is, h(a) h(b) if 0 A B. For any such A,B and since d 2n, we may define W = A , P = (B A) 2 1 M d,n. Then W P = 0, 0 0 A = W W, B = (W + P) (W + P) and thus, by hypothesis, h(b) = Ω(W + P) Ω(W) = h(a). This completes the proof in one direction of the theorem. To show the converse, assume that Ω(W) = h(w W), where the function h is matrix nondecreasing. Then for any W,P M d,n with W P = 0, we have that (W +P) (W +P) =W W +P P W W and, so, Ω(W + P) Ω(W), as required. As in Section 3, differentiability is not required for sufficiency, which follows from the last lines of the above proof. Moreover, the assumption on d and n is not required either. Remark 16 Let the function Ω : M d,n R be of the form Ω(W) = h(w W) W M d,n, for some matrix nondecreasing function h : S n + R. Then Ω satisfies property (22). We conclude this section by providing a necessary and sufficient first-order condition for the function h to be matrix nondecreasing. Proposition 17 Let h : S n + R be a differentiable function. The following properties are equivalent: 2523

18 ARGYRIOU, MICCHELLI AND PONTIL (a) h is matrix nondecreasing (b) the matrix h(a) := ( h a i j : i, j N n ) is positive semidefinite, for every A S n +. Proof If (a) holds, we choose any x R n, t > 0,A S n + and note that h(a+txx ) h(a) t Letting t go to zero gives that x h(a)x 0. Conversely, if (b) is true, we have, for every x R n,m S n +, that x h(m)x = h(m),xx 0 and, so, h(m),c 0 for all C S n +. For any A,B S n + such that A B, consider the univariate function g : [0,1] R, g(t) = h(a+t(b A)). By the chain rule it is easy to verify that g is nondecreasing. Therefore we conclude that h(a) = g(0) g(1) = h(b) Examples Using Proposition 13, Theorem 15 and Remark 16, one may easily verify that the representer theorem holds for a variety of regularizers. In particular, functional description (23) subsumes the special case of monotone spectral functions. Definition 18 Let r = min{d,n}. A function Ω : M d,n R is called monotone spectral if there exists a function h : R r + R such that Ω(W) = h(σ 1 (W),...,σ r (W)) W M d,n, where σ 1 (W) σ r (W) 0 denote the singular values of matrix W and h(σ 1,...,σ r ) h(τ 1,...,τ r ) whenever σ i τ i for all i N r. Corollary 19 Assume that the function Ω : M d,n R is monotone spectral. Then Ω satisfies property (22). ( ( )) Proof Clearly, Ω(W) = h λ (W W) 2 1, where λ : S n + R r + maps a matrix to the ordered vector of its r highest eigenvalues. Let A,B S n + such that A B. Weyl s monotonicity theorem (Horn and Johnson, 1985, Cor ) states that if A B then the spectra of A and B are ordered. Thus, λ(a 2) 1 λ(b 2) 1 and hence h(λ(a 2)) 1 h(λ(b 2)). 1 Applying Remark 16, the assertion follows. We note that related results to the above corollary have appeared in Abernethy et al. (2009). They apply to the setting of (15) when the X i are rank one matrices. Interesting examples of monotone spectral functions are the Schatten L p norms and prenorms, Ω(W) = W p := σ(w) p, where p [0,+ ) and σ(w) denotes the min{d,n}-dimensional vector of the singular values of W. For instance, we have already mentioned in Section 4.1 that the representer theorem holds when the regularizer is the trace norm (the L 1 norm of the spectrum). But it also holds for the rank of a matrix. 2524

19 WHEN IS THERE A REPRESENTER THEOREM? VECTOR VERSUS MATRIX REGULARIZERS Rank minimization is an NP-hard optimization problem (Vandenberghe and Boyd, 1996), but the representer theorem has some interesting implications. First, that the problem can be reformulated in terms of reproducing kernels and second, that an equivalent problem in as few variables as the total sample size can be obtained. If we exclude spectral functions, the functions that remain are invariant under left multiplication with an orthogonal matrix. Examples of such functions are Schatten norms and prenorms composed with right matrix scaling, Ω(W) = WM p, (25) where M M n,n. In this case, the corresponding h is the function S λ(m SM) p. To see that this function is matrix nondecreasing, observe that if A,B S n + and A B then 0 M AM M BM and hence λ(m AM) λ(m BM) by Weyl s monotonicity theorem (Horn and Johnson, 1985, Cor ). Therefore, λ(m AM) p λ(m BM) p. For instance, the matrix M above can be used to select a subset of the columns of W. In addition, more complicated structures can be obtained by summation of matrix nondecreasing functions and by taking minima or maxima over sets. For example, we can obtain a regularizer such as { } Ω(W) = min W(I k ) 1 : {I 1,...,I K } P, k N K wherep is the set of partitions of N n in K subsets and W(I k ) denotes the submatrix of W formed by just the columns indexed by I k. This regularizer is an extension of the trace norm (K = 1) and can be used for multi-task learning via dimensionality reduction on multiple subspaces (Argyriou et al., 2008b). Yet another example of a valid regularizer is the one considered in (Evgeniou et al., 2005, Sec. 3.1), which encourages the tasks to be close to each other, namely Ω(W) = w t 1 n t N n s N n w s This regularizer immediately verifies property (22), and so by Theorem 15 it is a matrix nondecreasing function of W W. One can also verify that this regularizer is the square of the form (25) with p = 2 and M = I n 1 n 11, where 1 denotes the n-dimensional vector all of whose components are equal to one. Finally, it is worth noting that the representer theorem does not apply to a family of mixed matrix norms of the form Ω(W) = i N d w i p 2, where w i denotes the i-th row of W and p > 1, p Conclusion We have characterized the classes of vector and matrix regularizers which lead to certain forms of the solution of the associated interpolation problems. In the vector case, we have proved the necessity of a well-known sufficient condition for the standard representer theorem, which is encountered in many learning and statistical estimation problems. In the matrix case, we have described

20 ARGYRIOU, MICCHELLI AND PONTIL a novel class of regularizers which lead to a modified representer theorem. This class, which relies upon the notion of matrix nondecreasing function, includes and extends significantly the vector class. To motivate the need for our study, we have discussed some examples of regularizers which have been recently used in the context of multi-task learning and collaborative filtering. In the future, it would be valuable to study in more detail special cases of the matrix regularizers which we have encountered, such as those based on orthogonally invariant functions. It would also be interesting to investigate how the presence of additional constraints affects the representer theorem. In particular, we have in mind the possibility that the matrix may be constrained to be in a convex cone, such as the set of positive semidefinite matrices. Finally, we leave to future studies the extension of the ideas presented here to the case in which matrices are replaced by operators between two Hilbert spaces. Acknowledgments The work of the first and third authors was supported by EPSRC Grant EP/D052807/1 and by the IST Programme of the European Union, under the PASCAL Network of Excellence IST The second author is supported by NSF grant DMS Appendix A. Here we collect some auxiliary results which are used in the above analysis. The first lemma states a basic property of connectedness through rotations. Lemma 20 Let w,v R d and d 2. Then there exists U O d with determinant 1 such that v = Uw if and only if w = v. Proof If v = Uw we have that v v = w w. Conversely, if w = v 0, we may choose orthonormal ( vectors ) {x l : l ( N d 1 } w and {z ) l : l N d 1 } v and form the matrices R = w x1... x d 1 and S = v z1... z d 1. We have that R R = S S. We wish to solve the equation UR = S. For this purpose we choose U = SR 1 and note that U O d because U U = (R 1 ) S T SR 1 = (R 1 ) R RR 1 = I. Since d 2, in the case that det(u) = 1 we can simply change the sign of one of the x l or z l to get det(u) = 1 as required. The second lemma concerns the monotonicity of the trace norm. Lemma 21 Let W,P M d,n such that W P = 0. Then W + P 1 W 1. Proof Weyl s monotonicity theorem (Horn and Johnson, 1985, Cor ) implies that if A,B S n + and A B then λ(a) λ(b) and hence tra 1 2 trb 1 2. We apply this fact to the matrices W W +P P and P P to obtain that W + P 1 = tr((w + P) (W + P)) 1 2 = tr(w W + P P) 1 2 tr(w W) 1 2 = W

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### BANACH AND HILBERT SPACE REVIEW

BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

### Summary of week 8 (Lectures 22, 23 and 24)

WEEK 8 Summary of week 8 (Lectures 22, 23 and 24) This week we completed our discussion of Chapter 5 of [VST] Recall that if V and W are inner product spaces then a linear map T : V W is called an isometry

### ISOMETRIES OF R n KEITH CONRAD

ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### 4: SINGLE-PERIOD MARKET MODELS

4: SINGLE-PERIOD MARKET MODELS Ben Goldys and Marek Rutkowski School of Mathematics and Statistics University of Sydney Semester 2, 2015 B. Goldys and M. Rutkowski (USydney) Slides 4: Single-Period Market

### CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in

### Density Level Detection is Classification

Density Level Detection is Classification Ingo Steinwart, Don Hush and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov Abstract We

### Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

### THE DIMENSION OF A VECTOR SPACE

THE DIMENSION OF A VECTOR SPACE KEITH CONRAD This handout is a supplementary discussion leading up to the definition of dimension and some of its basic properties. Let V be a vector space over a field

### Row Ideals and Fibers of Morphisms

Michigan Math. J. 57 (2008) Row Ideals and Fibers of Morphisms David Eisenbud & Bernd Ulrich Affectionately dedicated to Mel Hochster, who has been an inspiration to us for many years, on the occasion

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### Notes on Symmetric Matrices

CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Vector Spaces II: Finite Dimensional Linear Algebra 1

John Nachbar September 2, 2014 Vector Spaces II: Finite Dimensional Linear Algebra 1 1 Definitions and Basic Theorems. For basic properties and notation for R N, see the notes Vector Spaces I. Definition

### We call this set an n-dimensional parallelogram (with one vertex 0). We also refer to the vectors x 1,..., x n as the edges of P.

Volumes of parallelograms 1 Chapter 8 Volumes of parallelograms In the present short chapter we are going to discuss the elementary geometrical objects which we call parallelograms. These are going to

### Nonlinear Iterative Partial Least Squares Method

Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

### Absolute Value Programming

Computational Optimization and Aplications,, 1 11 (2006) c 2006 Springer Verlag, Boston. Manufactured in The Netherlands. Absolute Value Programming O. L. MANGASARIAN olvi@cs.wisc.edu Computer Sciences

### Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms

### Duality of linear conic problems

Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### MACHINE LEARNING. Introduction. Alessandro Moschitti

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule Lectures Tuesday, 14:00-16:00

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### A note on companion matrices

Linear Algebra and its Applications 372 (2003) 325 33 www.elsevier.com/locate/laa A note on companion matrices Miroslav Fiedler Academy of Sciences of the Czech Republic Institute of Computer Science Pod

### Continuity of the Perron Root

Linear and Multilinear Algebra http://dx.doi.org/10.1080/03081087.2014.934233 ArXiv: 1407.7564 (http://arxiv.org/abs/1407.7564) Continuity of the Perron Root Carl D. Meyer Department of Mathematics, North

### Linear Codes. In the V[n,q] setting, the terms word and vector are interchangeable.

Linear Codes Linear Codes In the V[n,q] setting, an important class of codes are the linear codes, these codes are the ones whose code words form a sub-vector space of V[n,q]. If the subspace of V[n,q]

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain

Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain 1. Orthogonal matrices and orthonormal sets An n n real-valued matrix A is said to be an orthogonal

### Recall that the gradient of a differentiable scalar field ϕ on an open set D in R n is given by the formula:

Chapter 7 Div, grad, and curl 7.1 The operator and the gradient: Recall that the gradient of a differentiable scalar field ϕ on an open set D in R n is given by the formula: ( ϕ ϕ =, ϕ,..., ϕ. (7.1 x 1

### AN ITERATIVE METHOD FOR THE HERMITIAN-GENERALIZED HAMILTONIAN SOLUTIONS TO THE INVERSE PROBLEM AX = B WITH A SUBMATRIX CONSTRAINT

Bulletin of the Iranian Mathematical Society Vol. 39 No. 6 (2013), pp 129-1260. AN ITERATIVE METHOD FOR THE HERMITIAN-GENERALIZED HAMILTONIAN SOLUTIONS TO THE INVERSE PROBLEM AX = B WITH A SUBMATRIX CONSTRAINT

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### 1 Orthogonal projections and the approximation

Math 1512 Fall 2010 Notes on least squares approximation Given n data points (x 1, y 1 ),..., (x n, y n ), we would like to find the line L, with an equation of the form y = mx + b, which is the best fit

### CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize

### α = u v. In other words, Orthogonal Projection

Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

### The determinant of a skew-symmetric matrix is a square. This can be seen in small cases by direct calculation: 0 a. 12 a. a 13 a 24 a 14 a 23 a 14

4 Symplectic groups In this and the next two sections, we begin the study of the groups preserving reflexive sesquilinear forms or quadratic forms. We begin with the symplectic groups, associated with

### Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

### The Dirichlet Unit Theorem

Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if

### THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear

### Metric Spaces. Chapter 7. 7.1. Metrics

Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### 1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

### Maximum-Margin Matrix Factorization

Maximum-Margin Matrix Factorization Nathan Srebro Dept. of Computer Science University of Toronto Toronto, ON, CANADA nati@cs.toronto.edu Jason D. M. Rennie Tommi S. Jaakkola Computer Science and Artificial

### AN ALGORITHM FOR DETERMINING WHETHER A GIVEN BINARY MATROID IS GRAPHIC

AN ALGORITHM FOR DETERMINING WHETHER A GIVEN BINARY MATROID IS GRAPHIC W. T. TUTTE. Introduction. In a recent series of papers [l-4] on graphs and matroids I used definitions equivalent to the following.

### Orthogonal Projections

Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors

### Classification of Cartan matrices

Chapter 7 Classification of Cartan matrices In this chapter we describe a classification of generalised Cartan matrices This classification can be compared as the rough classification of varieties in terms

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### WHICH LINEAR-FRACTIONAL TRANSFORMATIONS INDUCE ROTATIONS OF THE SPHERE?

WHICH LINEAR-FRACTIONAL TRANSFORMATIONS INDUCE ROTATIONS OF THE SPHERE? JOEL H. SHAPIRO Abstract. These notes supplement the discussion of linear fractional mappings presented in a beginning graduate course

The Hadamard Product Elizabeth Million April 12, 2007 1 Introduction and Basic Results As inexperienced mathematicians we may have once thought that the natural definition for matrix multiplication would

### Chapter 17. Orthogonal Matrices and Symmetries of Space

Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### Week 5 Integral Polyhedra

Week 5 Integral Polyhedra We have seen some examples 1 of linear programming formulation that are integral, meaning that every basic feasible solution is an integral vector. This week we develop a theory

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

### Sign pattern matrices that admit M, N, P or inverse M matrices

Sign pattern matrices that admit M, N, P or inverse M matrices C Mendes Araújo, Juan R Torregrosa CMAT - Centro de Matemática / Dpto de Matemática Aplicada Universidade do Minho / Universidad Politécnica

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### Math 312 Homework 1 Solutions

Math 31 Homework 1 Solutions Last modified: July 15, 01 This homework is due on Thursday, July 1th, 01 at 1:10pm Please turn it in during class, or in my mailbox in the main math office (next to 4W1) Please

### Recall that two vectors in are perpendicular or orthogonal provided that their dot

Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal

### 1 Sets and Set Notation.

LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most

### Linear Maps. Isaiah Lankham, Bruno Nachtergaele, Anne Schilling (February 5, 2007)

MAT067 University of California, Davis Winter 2007 Linear Maps Isaiah Lankham, Bruno Nachtergaele, Anne Schilling (February 5, 2007) As we have discussed in the lecture on What is Linear Algebra? one of

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### THREE DIMENSIONAL GEOMETRY

Chapter 8 THREE DIMENSIONAL GEOMETRY 8.1 Introduction In this chapter we present a vector algebra approach to three dimensional geometry. The aim is to present standard properties of lines and planes,

### Generalized Inverse Computation Based on an Orthogonal Decomposition Methodology.

International Conference on Mathematical and Statistical Modeling in Honor of Enrique Castillo. June 28-30, 2006 Generalized Inverse Computation Based on an Orthogonal Decomposition Methodology. Patricia

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x

Lecture 4. LaSalle s Invariance Principle We begin with a motivating eample. Eample 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum Dynamics of a pendulum with friction can be written

### Ideal Class Group and Units

Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals

### Notes on Determinant

ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without

### 1 Polyhedra and Linear Programming

CS 598CSC: Combinatorial Optimization Lecture date: January 21, 2009 Instructor: Chandra Chekuri Scribe: Sungjin Im 1 Polyhedra and Linear Programming In this lecture, we will cover some basic material

### Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

### Algebraic Complexity Theory and Matrix Multiplication

Algebraic Complexity Theory and Matrix Multiplication [Handout for the Tutorial] François Le Gall Department of Computer Science Graduate School of Information Science and Technology The University of

### Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i.

Math 5A HW4 Solutions September 5, 202 University of California, Los Angeles Problem 4..3b Calculate the determinant, 5 2i 6 + 4i 3 + i 7i Solution: The textbook s instructions give us, (5 2i)7i (6 + 4i)(

### 5. Convergence of sequences of random variables

5. Convergence of sequences of random variables Throughout this chapter we assume that {X, X 2,...} is a sequence of r.v. and X is a r.v., and all of them are defined on the same probability space (Ω,

### Sec 4.1 Vector Spaces and Subspaces

Sec 4. Vector Spaces and Subspaces Motivation Let S be the set of all solutions to the differential equation y + y =. Let T be the set of all 2 3 matrices with real entries. These two sets share many common

### Mathematical Induction

Chapter 2 Mathematical Induction 2.1 First Examples Suppose we want to find a simple formula for the sum of the first n odd numbers: 1 + 3 + 5 +... + (2n 1) = n (2k 1). How might we proceed? The most natural

### Math 231b Lecture 35. G. Quick

Math 231b Lecture 35 G. Quick 35. Lecture 35: Sphere bundles and the Adams conjecture 35.1. Sphere bundles. Let X be a connected finite cell complex. We saw that the J-homomorphism could be defined by

### 3 Orthogonal Vectors and Matrices

3 Orthogonal Vectors and Matrices The linear algebra portion of this course focuses on three matrix factorizations: QR factorization, singular valued decomposition (SVD), and LU factorization The first

### THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING 1. Introduction The Black-Scholes theory, which is the main subject of this course and its sequel, is based on the Efficient Market Hypothesis, that arbitrages

### A Introduction to Matrix Algebra and Principal Components Analysis

A Introduction to Matrix Algebra and Principal Components Analysis Multivariate Methods in Education ERSH 8350 Lecture #2 August 24, 2011 ERSH 8350: Lecture 2 Today s Class An introduction to matrix algebra

### MATRIX ALGEBRA by Karim M. Abadir and Jan R. Magnus Cambridge University Press, 2005

Econometric Theory, 22, 2006, 968 972+ Printed in the United States of America+ DOI: 10+10170S0266466606000454 MATRIX ALGEBRA by Karim M. Abadir and Jan R. Magnus Cambridge University Press, 2005 REVIEWED

### Section 6.1 - Inner Products and Norms

Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,

### Applied Linear Algebra I Review page 1

Applied Linear Algebra Review 1 I. Determinants A. Definition of a determinant 1. Using sum a. Permutations i. Sign of a permutation ii. Cycle 2. Uniqueness of the determinant function in terms of properties

### Orthogonal Diagonalization of Symmetric Matrices

MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### A Semi-parametric Approach for Decomposition of Absorption Spectra in the Presence of Unknown Components

A Semi-parametric Approach for Decomposition of Absorption Spectra in the Presence of Unknown Components Payman Sadegh 1,2, Henrik Aalborg Nielsen 1, and Henrik Madsen 1 Abstract Decomposition of absorption

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### Date: April 12, 2001. Contents

2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........