Introduction: Overview of Kernel Methods

Size: px

Start display at page:

Download "Introduction: Overview of Kernel Methods"

Derek Boone
8 years ago
Views:

1 Introduction: Overview of Kernel Methods Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies October 6-10, 2008, Kyushu University

Statistical Mathematics, ROIS Department of Statistical Science,

2 Outline Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 2 / 25

3 Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 3 / 25

4 Nonlinear Data Analysis I Classical linear methods Data is expressed by a matrix. X1 1 X1 2 X m 1 X2 1 X2 2 X2 m X =. XN 1 XN 2 XN m (m dimensional, N data) Linear operations (matrix operations) are used for data analysis. e.g. - Principal component analysis (PCA) - Canonical correlation analysis (CCA) - Linear regression analysis - Fisher discriminant analysis (FDA) - Logistic regression, etc. 4 / 25

XN 1 XN 2 XN m (m dimensional, N data) Linear operations (matrix operations) are used for data

5 Nonlinear Data Analysis II Are linear methods sufficient? Nonlinear transform can help. Example 1: classification linearly inseparable linearly separable x z z z 2 x 1 (x 1, x 2 ) (z 1, z 2, z 3 ) = (x 2 1, x 2 2, 2x 1 x 2 ) (Unclear? watch 5 / 25

3-5 -10-2 -15 0 5 20-4 10 15-6 -6-4 -2 0 2 4 6 z 1 15 20 0 5 10 z 2 x 1 (x 1, x 2 ) (z 1, z

6 Example 2: dependence of two data Correlation ρ XY = Cov[X, Y ] E[(X E[X])(Y E[Y ])] = Var[X]Var[Y ] E[(X E[X])2 ]E[(Y E[Y ]) 2 ]. Transforming data to incorporate high-order moments seems attractive. 6 / 25

7 Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 7 / 25

8 Feature space for transforming data Kernel methodology = a systematic way of analyzing data by transforming them into a high-dimensional feature space. Apply linear methods on the feature space. Which type of space serves as a feature space? The space should incorporate various nonlinear information of the original data. The inner product of the feature space is essential for data analysis (seen in the next subsection). 8 / 25

Which type of space serves as a feature space?

9 Computational problem of inner product For example, how about this? (X, Y, Z) (X, Y, Z, X 2, Y 2, Z 2, XY, Y Z, ZX,...). But, for high-dimensional data, the above expansion makes the feature space very huge! e.g. If X is 100 dimensional and the moments up to the third order are used, the dimensionality of feature space is 100C C C 3 = This causes a serious computational problem in working on the inner product of the feature space. We need a cleverer way of computing it. Kernel method. 9 / 25

e.g. If X is 100 dimensional and the moments up to the third order are used, the dimensionality of feature space is 100C 1 + 100

10 Inner product by positive definite kernel A positive definite kernel gives efficient computation of the inner product: With special choice of the feature space, we have a function k(x, y) such that where Φ(X i ), Φ(X j ) = k(x i, X j ), positive definite kernel X x Φ(x) H (feature space). Many linear methods use only the inner product without necessity of the explicit form of the vector Φ(X). 10 / 25

where Φ(X i ), Φ(X j ) = k(x i, X j ), positive definite kernel X x Φ(x) H (feature space).

11 Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 11 / 25

12 X 1,..., X N : m-dimensional data. Review of PCA I Principal Component Analysis (PCA) Find d-directions to maximize the variance. Purpose: represent the structure of the data in a low dimensional space. 12 / 25

13 The first principal direction: Review of PCA II { 1 N u 1 = arg max u =1 N i=1 ut (X i 1 N N j=1 X j) } 2 = arg max u =1 ut V u, where V is the variance-covariance matrix: V = 1 N N i=1 (X i 1 N N j=1 X j)(x i 1 N N j=1 X j) T. - Eigenvectors u 1,..., u m of V (in descending order). - The p-th principal axis = u p. - The p-th principal component of X i = u T p X i Observation: PCA can be done if we can compute the inner product covariance matrix V, inner product between the unit eigenvector and the data. 13 / 25

.., u m of V (in descending order). - The p-th principal axis = u p.

14 Kernel PCA I X 1,..., X N : m-dimensional data. Transform the data by a feature map Φ into a feature space H: X 1,..., X N Φ(X 1 ),..., Φ(X N ) Assume that the feature space has the inner product,. Apply PCA to the transformed data: Maximize the variance of the projection onto the unit vector f. 1 N max Var[ f, Φ(X) ] = max f =1 f =1 N i=1( f, Φ(Xi ) 1 N N j=1 Φ(X j) ) 2 Note: it suffices to use f = n i=1 a i Φ(X i ), where Φ(X i ) = Φ(X i ) 1 N N j=1 Φ(X j). The direction orthogonal to Span{ Φ(X 1 ),..., Φ(X N )} does not contribute. 14 / 25

Apply PCA to the transformed data: Maximize the variance of the projection onto the unit vector f.

15 Kernel PCA II The PCA solution: max a T K2 a subject to a T Ka = 1, where K is N N matrix with K ij = Φ(X i ), Φ(X j ). Note: 1 N N i=1 f, Φ(X i ) 2 = 1 N N i=1 N j=1 a Φ(X j j ), Φ(X i ) 2 = 1 N at K2 a, f 2 = n i=1 a i Φ(X i ), n i=1 a i Φ(X i ) = a T Ka. The first principal component of the data X i is Φ(X i ), ˆf = N i=1 λ1 u 1 i, where K = N i=1 λ iu i u it is the eigen decomposition. 15 / 25

Note: 1 N N i=1 f, Φ(X i ) 2 = 1 N N i=1 N j=1 a Φ(X j j ), Φ(X i ) 2 = 1 N at K2 a, f 2 = n i=1 a

16 Observation: Kernel PCA III PCA in the feature space can be done if we can compute Φ(X i ), Φ(X j ) or Φ(X i ), Φ(X j ) = k(x i, X j ). The principal direction is obtained in the form f = i a i Φ(X i ), i.e., in the linear hull of the data. Note: K ij = Φ(X i), Φ(X j) = Φ(X i), Φ(X j) 1 N N b=1 Φ(Xi), Φ(X b) 1 N N a=1 1 Φ(Xa), Φ(Xj) + N N 2 a=1 Φ(Xa), Φ(X b) = k(x i, X j) 1 N N b=1 k(xi, X b) 1 N N 1 a=1k(xa, Xj) + N N 2 a=1 k(xa, X b) 16 / 25

Note: K ij = Φ(X i), Φ(X j) = Φ(X i), Φ(X j) 1 N N b=1 Φ(Xi), Φ(X b) 1 N N a=1 1 Φ(Xa), Φ(Xj) + N N 2 a=1

17 Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 17 / 25

18 Linear regression Review: Linear Regression I Data: (X 1, Y 1 ),..., (X N, Y N ): data X i: explanatory variable, covariate (m-dimensional) Y i: response variable, (1 dimensional) Regression model: find the best linear relation Y i = a T X i + ε i 18 / 25

(m-dimensional) Y i: response variable, (1 dimensional)

19 Review: Linear Regression II Least square method: min a N i=1 Y i a T X i 2. Matrix expression X1 1 X1 2 X1 m Y 1 X2 1 X2 2 X2 m X =., Y = Y 2.. XN 1 X2 N Xm N Y N Solution: â = (X T X) 1 X T Y ŷ = â T x = Y T X(X T X) 1 x. Observation: Linear regressio can be done if we can compute the inner product X T X, â T x and so on. 19 / 25

. XN 1 X2 N Xm N Y N Solution: â = (X T X) 1 X T Y ŷ = â T x = Y T X(X T X) 1 x.

20 Ridge Regression Ridge regression: Find a linear relation by min a N i=1 Y i a T X i 2 + λ a 2. Solution For a general x, â = (X T X + λi N ) 1 X T Y λ: regularization coefficient. ŷ(x) = â T x = Y T X(X T X + λi N ) 1 x. Ridge regression is useful when (X T X) 1 does not exist, or inversion is numerically unstable. 20 / 25

21 Kernelization of Ridge Regression I (X 1, Y 1 )..., (X N, Y N ) (Y i : 1-dimensional) Transform X i by a feature map Φ into a feature space H: X 1,..., X N Φ(X 1 ),..., Φ(X N ) Assume that the feature space has the inner product,. Apply ridge regression to the transformed data: Find the vector f such that N min i=1 Y i f, Φ(X i ) H 2 + λ f 2 H. f H Similarly to kernel PCA, we can assume f = n j=1 c jφ(x j ). N min c i=1 Y i N j=1 c jφ(x j ), Φ(X i ) H 2 + λ N j=1 c jφ(x j ) 2 H 21 / 25

22 Kernelization of Ridge Regression II Solution: ĉ = (K+λI N ) 1 Y, where K ij = Φ(X i ), Φ(X j ) H = k(x i, X j ). For a general x, ŷ(x) = f, Φ(x) H = jĉjφ(x j ), Φ(x) H = Y T (K + λi N ) 1 k, where k = Φ(X 1 ), Φ(x). Φ(X N ), Φ(x) = k(x 1, x).. k(x N, x) 22 / 25

23 Kernelization of Ridge Regression III Proof. Matrix expression gives N i=1 Y i N j=1 c jφ(x j ), Φ(X i ) H 2 + λ N j=1 c jφ(x j ) 2 H = (Y Kc) T (Y Kc) + λc T Kc = c T (K 2 + λk)c 2Y T Kc + Y T Y. It follows that the optimal c is given by ĉ = (K + λi N ) 1 Y. Inserting this to ŷ(x) = j ĉjφ(x j ), Φ(x) H, we have the claim. 23 / 25

24 Kernelization of Ridge Regression IV Observation: Ridge regression in the feature space can be done if we can compute the inner product Φ(X i ), Φ(X j ) = k(x i, X j ). The resulting coefficient is of the form f = i c iφ(x i ), i.e., in the linear hull of the data. The orthogonal directions do not contribute to the objective function. 24 / 25

25 Kernel methodology A feature space H with inner product,. Mapping of the data into a feature space: X 1,..., X N Φ(X 1 ),..., Φ(X N ) H. If the computation of the inner product Φ(X i ), Φ(X i ) is tractable, various linear methods can be extended to the feature space. Give Methods of nonlinear data analysis. How can we prepare such a feature space? Positive definite kernel! 25 / 25

Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes