Introduction: Overview of Kernel Methods



Similar documents
Statistical Machine Learning

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Dimensionality Reduction: Principal Components Analysis

Introduction to Matrix Algebra

Correlation in Random Variables

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

STA 4273H: Statistical Machine Learning

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Lecture 3: Linear methods for classification

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Canonical Correlation Analysis

Orthogonal Diagonalization of Symmetric Matrices

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

An Introduction to Machine Learning

Object Recognition and Template Matching

A Simple Introduction to Support Vector Machines

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Some probability and statistics

How To Understand Multivariate Models

High-Dimensional Data Visualization by PCA and LDA

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Java Modules for Time Series Analysis

Data Mining: Algorithms and Applications Matrix Math Review

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

4. Matrix Methods for Analysis of Structure in Data Sets:

October 3rd, Linear Algebra & Properties of the Covariance Matrix

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Factor Analysis. Factor Analysis

Introduction to Principal Components and FactorAnalysis

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

MATH 425, PRACTICE FINAL EXAM SOLUTIONS.

The Bivariate Normal Distribution

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

A repeated measures concordance correlation coefficient

Support Vector Machines Explained

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Multivariate Analysis (Slides 13)

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Introduction to Principal Component Analysis: Stock Market Values

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Sections 2.11 and 5.8

University of Lille I PC first year list of exercises n 7. Review

1.5. Factorisation. Introduction. Prerequisites. Learning Outcomes. Learning Style

Visualization by Linear Projections as Information Retrieval

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Fitting Subject-specific Curves to Grouped Longitudinal Data

NOTES ON LINEAR TRANSFORMATIONS

Several Views of Support Vector Machines

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Tutorial on Exploratory Data Analysis

Clustering in the Linear Model

Numerical Methods I Eigenvalue Problems

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R)

Manifold Learning Examples PCA, LLE and ISOMAP

Linear Models for Classification

Principal Component Analysis

Machine Learning and Pattern Recognition Logistic Regression

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Part 2: Analysis of Relationship Between Two Variables

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Principal Component Analysis

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Similarity and Diagonalization. Similar Matrices

Unsupervised and supervised dimension reduction: Algorithms and connections

Data visualization and dimensionality reduction using kernel maps with a reference point

Factor analysis. Angela Montanari

1 Introduction to Matrices

Understanding and Applying Kalman Filtering

Inner product. Definition of inner product

Econometrics Simple Linear Regression

A Multivariate Statistical Analysis of Stock Trends. Abstract

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Christfried Webers. Canberra February June 2015

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Factor Analysis. Chapter 420. Introduction

CSCI567 Machine Learning (Fall 2014)

LINEAR ALGEBRA W W L CHEN

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

More than you wanted to know about quadratic forms

4.7. Canonical ordination

Least Squares Estimation

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Least-Squares Intersection of Lines

Support Vector Machines

Section 5.3. Section 5.3. u m ] l jj. = l jj u j + + l mj u m. v j = [ u 1 u j. l mj

Transcription:

Introduction: Overview of Kernel Methods Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies October 6-10, 2008, Kyushu University

Outline Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 2 / 25

Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 3 / 25

Nonlinear Data Analysis I Classical linear methods Data is expressed by a matrix. X1 1 X1 2 X m 1 X2 1 X2 2 X2 m X =. XN 1 XN 2 XN m (m dimensional, N data) Linear operations (matrix operations) are used for data analysis. e.g. - Principal component analysis (PCA) - Canonical correlation analysis (CCA) - Linear regression analysis - Fisher discriminant analysis (FDA) - Logistic regression, etc. 4 / 25

Nonlinear Data Analysis II Are linear methods sufficient? Nonlinear transform can help. Example 1: classification linearly inseparable linearly separable 6 15 4 10 x 2 2 0 5 z 0 3-5 -10-2 -15 0 5 20-4 10 15-6 -6-4 -2 0 2 4 6 z 1 15 20 0 5 10 z 2 x 1 (x 1, x 2 ) (z 1, z 2, z 3 ) = (x 2 1, x 2 2, 2x 1 x 2 ) (Unclear? watch http://jp.youtube.com/watch?v=3licbrzprza) 5 / 25

Example 2: dependence of two data Correlation ρ XY = Cov[X, Y ] E[(X E[X])(Y E[Y ])] = Var[X]Var[Y ] E[(X E[X])2 ]E[(Y E[Y ]) 2 ]. Transforming data to incorporate high-order moments seems attractive. 6 / 25

Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 7 / 25

Feature space for transforming data Kernel methodology = a systematic way of analyzing data by transforming them into a high-dimensional feature space. Apply linear methods on the feature space. Which type of space serves as a feature space? The space should incorporate various nonlinear information of the original data. The inner product of the feature space is essential for data analysis (seen in the next subsection). 8 / 25

Computational problem of inner product For example, how about this? (X, Y, Z) (X, Y, Z, X 2, Y 2, Z 2, XY, Y Z, ZX,...). But, for high-dimensional data, the above expansion makes the feature space very huge! e.g. If X is 100 dimensional and the moments up to the third order are used, the dimensionality of feature space is 100C 1 + 100 C 2 + 100 C 3 = 166750. This causes a serious computational problem in working on the inner product of the feature space. We need a cleverer way of computing it. Kernel method. 9 / 25

Inner product by positive definite kernel A positive definite kernel gives efficient computation of the inner product: With special choice of the feature space, we have a function k(x, y) such that where Φ(X i ), Φ(X j ) = k(x i, X j ), positive definite kernel X x Φ(x) H (feature space). Many linear methods use only the inner product without necessity of the explicit form of the vector Φ(X). 10 / 25

Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 11 / 25

X 1,..., X N : m-dimensional data. Review of PCA I Principal Component Analysis (PCA) Find d-directions to maximize the variance. Purpose: represent the structure of the data in a low dimensional space. 12 / 25

The first principal direction: Review of PCA II { 1 N u 1 = arg max u =1 N i=1 ut (X i 1 N N j=1 X j) } 2 = arg max u =1 ut V u, where V is the variance-covariance matrix: V = 1 N N i=1 (X i 1 N N j=1 X j)(x i 1 N N j=1 X j) T. - Eigenvectors u 1,..., u m of V (in descending order). - The p-th principal axis = u p. - The p-th principal component of X i = u T p X i Observation: PCA can be done if we can compute the inner product covariance matrix V, inner product between the unit eigenvector and the data. 13 / 25

Kernel PCA I X 1,..., X N : m-dimensional data. Transform the data by a feature map Φ into a feature space H: X 1,..., X N Φ(X 1 ),..., Φ(X N ) Assume that the feature space has the inner product,. Apply PCA to the transformed data: Maximize the variance of the projection onto the unit vector f. 1 N max Var[ f, Φ(X) ] = max f =1 f =1 N i=1( f, Φ(Xi ) 1 N N j=1 Φ(X j) ) 2 Note: it suffices to use f = n i=1 a i Φ(X i ), where Φ(X i ) = Φ(X i ) 1 N N j=1 Φ(X j). The direction orthogonal to Span{ Φ(X 1 ),..., Φ(X N )} does not contribute. 14 / 25

Kernel PCA II The PCA solution: max a T K2 a subject to a T Ka = 1, where K is N N matrix with K ij = Φ(X i ), Φ(X j ). Note: 1 N N i=1 f, Φ(X i ) 2 = 1 N N i=1 N j=1 a Φ(X j j ), Φ(X i ) 2 = 1 N at K2 a, f 2 = n i=1 a i Φ(X i ), n i=1 a i Φ(X i ) = a T Ka. The first principal component of the data X i is Φ(X i ), ˆf = N i=1 λ1 u 1 i, where K = N i=1 λ iu i u it is the eigen decomposition. 15 / 25

Observation: Kernel PCA III PCA in the feature space can be done if we can compute Φ(X i ), Φ(X j ) or Φ(X i ), Φ(X j ) = k(x i, X j ). The principal direction is obtained in the form f = i a i Φ(X i ), i.e., in the linear hull of the data. Note: K ij = Φ(X i), Φ(X j) = Φ(X i), Φ(X j) 1 N N b=1 Φ(Xi), Φ(X b) 1 N N a=1 1 Φ(Xa), Φ(Xj) + N N 2 a=1 Φ(Xa), Φ(X b) = k(x i, X j) 1 N N b=1 k(xi, X b) 1 N N 1 a=1k(xa, Xj) + N N 2 a=1 k(xa, X b) 16 / 25

Basic idea of kernel methods Linear and nonlinear Data Analysis Essence of kernel methodology Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization 17 / 25

Linear regression Review: Linear Regression I Data: (X 1, Y 1 ),..., (X N, Y N ): data X i: explanatory variable, covariate (m-dimensional) Y i: response variable, (1 dimensional) Regression model: find the best linear relation Y i = a T X i + ε i 18 / 25

Review: Linear Regression II Least square method: min a N i=1 Y i a T X i 2. Matrix expression X1 1 X1 2 X1 m Y 1 X2 1 X2 2 X2 m X =., Y = Y 2.. XN 1 X2 N Xm N Y N Solution: â = (X T X) 1 X T Y ŷ = â T x = Y T X(X T X) 1 x. Observation: Linear regressio can be done if we can compute the inner product X T X, â T x and so on. 19 / 25

Ridge Regression Ridge regression: Find a linear relation by min a N i=1 Y i a T X i 2 + λ a 2. Solution For a general x, â = (X T X + λi N ) 1 X T Y λ: regularization coefficient. ŷ(x) = â T x = Y T X(X T X + λi N ) 1 x. Ridge regression is useful when (X T X) 1 does not exist, or inversion is numerically unstable. 20 / 25

Kernelization of Ridge Regression I (X 1, Y 1 )..., (X N, Y N ) (Y i : 1-dimensional) Transform X i by a feature map Φ into a feature space H: X 1,..., X N Φ(X 1 ),..., Φ(X N ) Assume that the feature space has the inner product,. Apply ridge regression to the transformed data: Find the vector f such that N min i=1 Y i f, Φ(X i ) H 2 + λ f 2 H. f H Similarly to kernel PCA, we can assume f = n j=1 c jφ(x j ). N min c i=1 Y i N j=1 c jφ(x j ), Φ(X i ) H 2 + λ N j=1 c jφ(x j ) 2 H 21 / 25

Kernelization of Ridge Regression II Solution: ĉ = (K+λI N ) 1 Y, where K ij = Φ(X i ), Φ(X j ) H = k(x i, X j ). For a general x, ŷ(x) = f, Φ(x) H = jĉjφ(x j ), Φ(x) H = Y T (K + λi N ) 1 k, where k = Φ(X 1 ), Φ(x). Φ(X N ), Φ(x) = k(x 1, x).. k(x N, x) 22 / 25

Kernelization of Ridge Regression III Proof. Matrix expression gives N i=1 Y i N j=1 c jφ(x j ), Φ(X i ) H 2 + λ N j=1 c jφ(x j ) 2 H = (Y Kc) T (Y Kc) + λc T Kc = c T (K 2 + λk)c 2Y T Kc + Y T Y. It follows that the optimal c is given by ĉ = (K + λi N ) 1 Y. Inserting this to ŷ(x) = j ĉjφ(x j ), Φ(x) H, we have the claim. 23 / 25

Kernelization of Ridge Regression IV Observation: Ridge regression in the feature space can be done if we can compute the inner product Φ(X i ), Φ(X j ) = k(x i, X j ). The resulting coefficient is of the form f = i c iφ(x i ), i.e., in the linear hull of the data. The orthogonal directions do not contribute to the objective function. 24 / 25

Kernel methodology A feature space H with inner product,. Mapping of the data into a feature space: X 1,..., X N Φ(X 1 ),..., Φ(X N ) H. If the computation of the inner product Φ(X i ), Φ(X i ) is tractable, various linear methods can be extended to the feature space. Give Methods of nonlinear data analysis. How can we prepare such a feature space? Positive definite kernel! 25 / 25