Data visualization and dimensionality reduction using kernel maps with a reference point

Transcription

1 Data visualization and dimensionality reduction using kernel maps with a reference point Johan Suykens K.U. Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 1 B-31 Leuven (Heverlee), Belgium Tel: 32/16/ Fax: 32/16/ [email protected] International Conference on Computational Harmonic Analysis Shanghai June 27 ICCHA 27 Shanghai Johan Suykens

2 Contents Context: support vector machines and kernel based learning Core problems: least squares support vector machines Classification and kernel principal component analysis Data visualization Kernel eigenmap methods Kernel maps with a reference point: linear system solution Examples ICCHA 27 Shanghai Johan Suykens 1

3 biomedical Living in a data world energy process industry bio-informatics multimedia traffic ICCHA 27 Shanghai Johan Suykens 2

4 Support vector machines and kernel methods: context With new technologies (e.g. in microarrays, proteomics) massive data sets become available that are high dimensional. Tasks and objectives: predictive modelling, knowledge discovery and integration, data fusion (classification, feature selection, prior knowledge incorporation, correlation analysis, ranking, robustness). Supervised, unsupervised or semi-supervised learning depending on the given data and problem. Need for modelling techniques that are able to operate on different data types (sequences, graphs, numerical, categorical,...) Linear as well as nonlinear models Reliable methods: numerically, computationally, statistically ICCHA 27 Shanghai Johan Suykens 3

5 Kernel based learning: interdisciplinary challenges neural networks data mining linear algebra pattern recognition mathematics SVM & Kernel Methods machine learning statistics optimization signal processing systems and control theory ICCHA 27 Shanghai Johan Suykens 4

6 Estimation in Reproducing Kernel Hilbert Spaces (RKHS) Variational problem: [Wahba, 199; Poggio & Girosi, 199; Evgeniou et al., 2] find function f such that min f H 1 N N L(y i, f(x i )) + λ f 2 K i=1 with L(, ) the loss function. f K is norm in RKHS H defined by K. Representer theorem: for convex loss function, solution of the form f(x) = NX α i K(x, x i ) i=1 Reproducing property f(x) = f, K x K with K x ( ) = K(x, ) Some special cases: L(y, f(x)) = (y f(x)) 2 : regularization network L(y, f(x)) = y f(x) ǫ : SVM regression with ǫ-insensitive loss function ε +ε ICCHA 27 Shanghai Johan Suykens 5

7 Different views on kernel based models SVM LS SVM Some early history on RKHS: Kriging RKHS Gaussian Processes : Moore 194: Aronszajn 1951: Krige 197: Parzen 1971: Kimeldorf & Wahba Obtaining complementary insights from different perspectives: kernels are used in different methodologies Support vector machines (SVM): optimization approach (primal/dual) Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysis Gaussian processes (GP): probabilistic/bayesian approach ICCHA 27 Shanghai Johan Suykens 6

8 SVMs: living in two worlds... Primal space: x x x x x o o o o Feature space ϕ(x) x x x o oo o x x Input space y(x) = sign[w T ϕ(x) + b] Dual space: y(x) K(x i, x j ) = ϕ(x i ) T ϕ(x j ) ( Kernel trick ) y(x) = sign[ P #sv i=1 α iy i K(x, x i ) + b] y(x) w 1 w nh α 1 ϕ 1 (x) ϕ nh (x) K(x, x 1 ) x x α #sv K(x, x #sv ) ICCHA 27 Shanghai Johan Suykens 7

9 Least Squares Support Vector Machines: core problems Regression (RR) min w,b,e wt w + γ i Classification (FDA) min w,b,e wt w + γ i e 2 i e 2 i s.t. y i = w T ϕ(x i ) + b + e i, i s.t. y i (w T ϕ(x i ) + b) = 1 e i, i Principal component analysis (PCA) min w,b,e wt w + γ i e 2 i s.t. e i = w T ϕ(x i ) + b, i Canonical correlation analysis/partial least squares (CCA/PLS) min w,v,b,d,e,r wt w+v T v+ν 1 e 2 i+ν 2 ri 2 γ { ei = w e i r i s.t. T ϕ 1 (x i ) + b r i = v T ϕ 2 (y i ) + d i i i partially linear models, spectral clustering, subspace algorithms,... ICCHA 27 Shanghai Johan Suykens 8

10 LS-SVM classifier Preserve support vector machine [Vapnik, 1995] methodology, but simplify via least squares and equality constraints [Suykens, 1999] Primal problem: min w,b,e 1 2 wt w + γ 1 2 N i=1 e 2 i such that y i [w T ϕ(x i ) + b]=1 e i, i = 1,...,N Dual problem: [ y T y Ω + I/γ ] [ b α ] = [ 1 N ] where Ω ij = y i y j ϕ(x i ) T ϕ(x j ) = y i y j K(x i, x j ) and y = [y 1 ;...;y N ]. LS-SVM classifiers perform very well on 2 UCI data sets [Van Gestel et al., ML 24] Winning results in competition WCCI 26 by [Cawley, 26] ICCHA 27 Shanghai Johan Suykens 9

11 Kernel PCA: primal and dual problem linear PCA kernel PCA (RBF kernel) Primal problem: [Suykens et al., 23] min 1 w,b,e 2 wt w + 1 N 2 γ i=1 e 2 i such that e i = w T ϕ(x i ) + b, i = 1,...,N. KPCA [Schölkopf et al., 1998]: Dual problem = kernel PCA: Ω c α = λα with λ = 1/γ with Ω c,ij = (ϕ(x i ) ˆµ ϕ ) T (ϕ(x j ) ˆµ ϕ ) the centered kernel matrix. Underlying LS-SVM model allows to make out-of-sample extensions. ICCHA 27 Shanghai Johan Suykens 1

12 Core models + additional constraints Monoticity constraints: [Pelckmans et al., 25] min w,b,e wt w + γ NX i=1 e 2 i s.t. j yi = w T ϕ(x i ) + b + e i, (i = 1,..., N) w T ϕ(x i ) w T ϕ(x i+1 ), (i = 1,..., N 1) Structure detection: [Pelckmans et al., 25; Tibshirani, 1996] min ρ X P w,e,t p=1 t p + PX w (p)t w (p) +γ p=1 NX i=1 e 2 i s.t. Autocorrelated errors: [Espinoza et al., 26] ( y i = P P p=1 w(p)t ϕ (p) (x (p) i ) + e i, ( i) t p w (p)t ϕ (p) (x (p) i ) t p, ( i, p) min w,b,r,e wt w + γ NX i=1 r 2 i s.t. j yi = w T ϕ(x i ) + b + e i, (i = 1,.., N) e i = ρe i 1 + r i, (i = 2,..., N) Spectral clustering: [Alzate & Suykens, 26; Chung, 1997; Shi & Malik, 2] min w,b,e wt w + γe T D 1 e s.t. e i = w T ϕ(x i ) + b, (i = 1,..., N) ICCHA 27 Shanghai Johan Suykens 11

13 Dimensionality reduction and data visualization Traditionally: commonly used techniques are e.g. principal component analysis, multidimensional scaling, self-organizing maps More recently: isomap, locally linear embedding, Hessian locally linear embedding, diffusion maps, Laplacian eigenmaps ( kernel eigenmap methods and manifold learning ) [Roweis & Saul, 2; Coifman et al., 25; Belkin et al., 26] Relevant issues: - learning and generalization [Cucker & Smale, 22; Poggio et al., 24] - model representations and out-of-sample extensions - convex/non-convex problems, computational complexity [Smale, 1997] Kernel maps with reference point (KMref) [Suykens, 27]: data visualization and dimensionality reduction by solving linear system ICCHA 27 Shanghai Johan Suykens 12

14 x (3D given).2 x x x z z 1 x 1 3 (2D KMref result) ICCHA 27 Shanghai Johan Suykens 13

15 A criterion related to locally linear embedding Given training data set {x i } N i=1 with x i R p. Dimensionality reduction to {z i } N i=1 with z i R d (d = 2 or d = 3). Objective min z i R d γ 2 N z i i=1 N z i i=1 N s ij z j 2 2 j=1 where e.g. s ij = exp( x i x j 2 2/σ 2 ) Solution follows from eigenvalue problem Rz = γz with z = [z 1 ;z 2 ;...;z N ] and R = (I P) T (I P) where P = [s ij I d ]. ICCHA 27 Shanghai Johan Suykens 14

16 Introducing a core model Realize the nonlinear mapping x z through a least squares support vector machine regression: min z,w j,e i,j γ 2 zt z (z Pz)T (z Pz) + ν 2 d wj T w j + η 2 j=1 such that c T i,j z = wt j ϕ j(x i ) + e i,j, i = 1,...,N; j = 1,...,d N i=1 d j=1 e 2 i,j Primal model representation with evaluation at point x R p : ẑ,j = w T j ϕ j (x ) with w j R n h j and feature maps ϕ j ( ) : R p R n h j (j = 1,...,d) ICCHA 27 Shanghai Johan Suykens 15

17 Kernel maps and eigenvalue problem Solution follows from eigenvalue problem, e.g. for d = 2: ( R + V 1 ( 1 ν Ω η I) 1 V1 T + V 2 ( 1 ν Ω ) η I) 1 V2 T z = γz with kernel matrices Ω 1, Ω 2 : matrices V 1, V 2 : Ω 1,ij = K 1 (x i,x j ) = ϕ 1 (x i ) T ϕ 1 (x j ) Ω 2,ij = K 2 (x i,x j ) = ϕ 2 (x i ) T ϕ 2 (x j ) V 1 = [c 1,1 c 2,1...c N,1 ],V 2 = [c 1,2 c 2,2... c N,2 ] However, selection of the best solution from this pool of 2N candidates is not straightforward (the best solution is not necessarily given by the largest or smallest eigenvalue here). ICCHA 27 Shanghai Johan Suykens 16

18 Kernel maps with reference point: problem statement Kernel maps with reference point: - LS-SVM core part: realize dimensionality reduction x z - reference point q (e.g. first point; sacrificed in the visualization) Example: d = 2 1 min z,w 1,w 2,b 1,b 2,e i,1,e i,2 2 (z P Dz) T (z P D z) + ν 2 (wt 1 w 1 + w T 2 w 2) + η 2 such that c T 1,1 z = q 1 + e 1,1 c T 1,2 z = q 2 + e 1,2 c T i,1 z = wt 1 ϕ 1(x i ) + b 1 + e i,1, i = 2,..., N c T i,2 z = wt 2 ϕ 2(x i ) + b 2 + e i,2, i = 2,..., N NX (e 2 i,1 + e2 i,2 ) i=1 Coordinates in low dimensional space: z = [z 1 ;z 2 ;...;z N ] R dn Regularization term: (z P D z) T (z P D z) = P N i=1 z i P N j=1 s ijdz j 2 2 with D diagonal matrix and s ij = exp( x i x j 2 2/σ 2 ). ICCHA 27 Shanghai Johan Suykens 17

19 Kernel maps with reference point: solution The unique solution to the problem is given by the linear system U 1 T M1 1 V 1 T 1 T M 1 V 1 M V 2M T M2 1 2 T 1 T M 1 with matrices 2 1 z b 1 b 2 = η(q 1 c 1,1 + q 2 c 1,2 ) U = (I P D ) T (I P D ) γi + V 1 M 1 1 V T 1 + V 2 M 1 2 V T 2 + ηc 1,1 c T 1,1 + ηc 1,2 c T 1,2 M 1 = 1 ν Ω η I, M 2 = 1 ν Ω η I V 1 = [c 2,1...c N,1 ], V 2 = [c 2,2...c N,2 ] kernel matrices Ω 1,Ω 2 R (N 1) (N 1) : Ω 1,ij = K 1 (x i, x j ) = ϕ 1 (x i ) T ϕ 1 (x j ), Ω 2,ij = K 2 (x i, x j ) = ϕ 2 (x i ) T ϕ 2 (x j ) positive definite kernel functions K 1 (, ), K 2 (, ). ICCHA 27 Shanghai Johan Suykens 18

20 Kernel maps with reference point: model representations The primal and dual model representations allow making out-ofsample extensions. Evaluation at point x R p : ẑ,1 = w T 1 ϕ 1 (x ) + b 1 = 1 ν ẑ,2 = w T 2 ϕ 2 (x ) + b 2 = 1 ν N α i,1 K 1 (x i,x ) + b 1 i=2 N α i,2 K 2 (x i,x ) + b 2 i=2 Estimated coordinates for visualization: ẑ = [ẑ,1 ; ẑ,2 ]. α 1,α 2 R N 1 are the unique solutions to the linear systems M 1 α 1 = V T 1 z b 1 1 N 1 and M 2 α 2 = V T 2 z b 2 1 N 1 and α 1 = [α 2,1 ;...;α N,1 ], α 2 = [α 2,2 ;...;α N,2 ], 1 N 1 = [1;1;...,;1]. ICCHA 27 Shanghai Johan Suykens 19

21 Proof - Lagrangian Only equality constraints: optimal model representation and solution is obtained in a systematic and straightforward way. Lagrangian: L(z,w 1, w 2,b 1,b 2,e i,1,e i,2 ; β 1,1, β 1,2, α i,1,α i,2 ) = γ 2 zt z (z P Dz) T (z P D z) + ν 2 (wt 1 w 1 + w2 T w 2 )+ η N 2 i=1 (e2 i,1 + e2 i,2 ) + β 1,1(c T 1,1z q 1 e 1,1 ) + β 1,2 (c T 1,2z q 2 e 1,2 ) + N i=2 α i,1(c T i,1 z wt 1 ϕ 1 (x i ) b 1 e i,1 ) + N i=2 α i,2(c T i,2 z wt 2 ϕ 2 (x i ) b 2 e i,2 ) Conditions for optimality [Fletcher, 1987]: z =, w 1 =, w 2 =, β 1,1 =, =, =, b 1 b 2 =, α i,1 β 1,2 =, =, e 1,1 = α i,2 e 1,2 =, ICCHA 27 Shanghai Johan Suykens 2

22 Proof - conditions for optimality 8 >< >: z = γz + (I P D ) T (I P D )z + β 1,1 c 1,1 + β 1,2 c 1,2 + P N i=2 α i,1c i,1 + P N i=2 α i,2c i,2 = w = νw 1 P N 1 i=2 α i,1ϕ 1 (x i ) = w = νw 2 P N 2 i=2 α i,2ϕ 2 (x i ) = b = P N 1 i=2 α i,1 = 1 T N 1 α 1 = b = P N 2 i=2 α i,2 = 1 T N 1 α 2 = e = ηe 1,1 β 1,1 = 1,1 e 1,2 = ηe 1,2 β 1,2 = e i,1 = ηe i,1 α i,1 =, i = 2,..., N e i,2 = ηe i,2 α i,2 =, i = 2,..., N β 1,1 = c T 1,1 z q 1 e 1,1 = β 1,2 = c T 1,2 z q 2 e 1,2 = α i,1 = c T i,1 z w 1 T ϕ 1 (x i ) b 1 e i,1 =, i = 2,..., N α i,2 = c T i,2 z w 2 T ϕ 2 (x i ) b 2 e i,2 =, i = 2,..., N. ICCHA 27 Shanghai Johan Suykens 21

23 Eliminate w 1,w 2, e i,1, e i,2 Proof - elimination step Express in term of kernel functions Express set of equations in terms of z, b 1, b 2, α 1,α 2 One obtains γz+(i P D ) T (I P D )z+v 1 α 1 +V 2 α 2 +ηc 1,1 c T 1,1z+ηc 1,2 c T 1,2z = η(q 1 c 1,1 +q 2 c 1,2 ) and V T V T 1 z 1 ν Ω 1α 1 1 η α 1 b 1 1 N 1 = 2 z 1 ν Ω 2α 2 1 η α 2 b 2 1 N 1 = β 1,1 = η(c T 1,1z q 1 ) β 1,2 = η(c T 1,2z q 2 ). The dual model representation follows from the conditions for optimality. ICCHA 27 Shanghai Johan Suykens 22

24 Model selection by validation Model selection criterion: ( min Θ i,j ẑ T i ẑj ẑ i 2 ẑ j 2 ) 2 xt i x j x i 2 x j 2 Tuning parameters Θ: Kernels tuning parameters in s ij, K 1, K 2,(K 3 ) Regularization constants ν, η (take γ = ) Choice of the diagonal matrix D Choice of reference point q, e.g. q {[+1;+1], [+1; 1],[ 1; +1],[ 1, 1]} Stable results, finding a good range is satisfactory. ICCHA 27 Shanghai Johan Suykens 23

25 KMref: spiral example 2 x x 3 z x x z 1 training data (blue *), validation data (magenta o), test data (red +) Model selection: min i,j ( ẑ T i ẑj ẑ i 2 ẑ j 2 ) 2 xt i x j x i 2 x j 2 ICCHA 27 Shanghai Johan Suykens 24

26 KMref: swiss roll example 3 x x z x x z 1 x 1 3 Given 3D swiss roll data KMref result - 2D projection 6 training data, 1 validation data ICCHA 27 Shanghai Johan Suykens 25

27 KMref: visualizing gene distributions x x z z x z z 1 x x 1 3 z 1 z x 1 3 KMref 3D projection (Alon colon cancer microarray data set) Dimension input space: 62 Number of genes: 15 (training: 5, validation: 5, test: 5) Model selection: σ 2 = 1 4, σ 2 1 = 1 3, σ 2 2 =.5σ 2 1, σ 2 3 =.1σ 2 1, η = 1, ν = 1, D = diag{1,5, 1}, q = [+1; 1; 1]. ICCHA 27 Shanghai Johan Suykens 26

28 KMref: Santa Fe laser data 3 x z discrete time k x 1 3 z 1 2 z x 1 3 original time-series {y t } t=t t=1 3D projection construct y t t m = [y t ; y t 1 ; y t 2 ;...;y t m ] with m = 9 given data {y t t m } t=m+n tot t=m+1 in a p = 1 dimensional space 2 validation data (first part), 7 training data points ICCHA 27 Shanghai Johan Suykens 27

29 Conclusions Trend: Kernelizing classical methods (FDA, PCA, CCA, ICA,...) Kernel methods: complementary views (LS-)SVM, RKHS, GP Least squares support vector machines as core problems in supervised and unsupervised learning, and beyond LS-SVM provides methodology for optimization modelling Kernel maps with reference point: LS-SVM core part Computational complexity: similar to regression/classification Reference point: converts eigenvalue problem into linear system Read more: Matlab demo file: ICCHA 27 Shanghai Johan Suykens 28