Kernel methods for exploratory data analysis and community detection

Similar documents

Data visualization and dimensionality reduction using kernel maps with a reference point

Kernel methods for complex networks and big data

Data Visualization and Dimensionality Reduction. using Kernel Maps with a Reference Point

Introduction to Support Vector Machines. Colin Campbell, Bristol University

So which is the best?

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Kernel Spectral Clustering for Big Data Networks

Large-Scale Sparsified Manifold Regularization

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Maximum Margin Clustering

1 Spectral Methods for Dimensionality

Multidimensional data and factorial methods

Part 2: Community Detection

Appendix A. Rayleigh Ratios and the Courant-Fischer Theorem

DATA ANALYSIS II. Matrix Algorithms

A Computational Framework for Exploratory Data Analysis

Statistical Machine Learning

Support Vector Machines with Clustering for Training with Very Large Datasets

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Visualization by Linear Projections as Information Retrieval

Soft Clustering with Projections: PCA, ICA, and Laplacian

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Lecture 3: Linear methods for classification

Lecture Topic: Low-Rank Approximations

Complex Networks Analysis: Clustering Methods

Representative Subsets For Big Data Learning using

Linear Threshold Units

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Tree based ensemble models regularization by convex optimization

Semi-Supervised Support Vector Machines and Application to Spam Filtering

A Simple Introduction to Support Vector Machines

DISSERTATION EXPLOITING GEOMETRY, TOPOLOGY, AND OPTIMIZATION FOR KNOWLEDGE DISCOVERY IN BIG DATA. Submitted by. Lori Beth Ziegelmeier

Linear Algebra Review. Vectors

An Introduction to Machine Learning

Least-Squares Intersection of Lines

Efficient online learning of a non-negative sparse autoencoder

Francesco Sorrentino Department of Mechanical Engineering

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Image Segmentation and Registration

Linear Classification. Volker Tresp Summer 2015

Proximal mapping via network optimization

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Support Vector Machines

Manifold Learning Examples PCA, LLE and ISOMAP

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Admin stuff. 4 Image Pyramids. Spatial Domain. Projects. Fourier domain 2/26/2008. Fourier as a change of basis

Component Ordering in Independent Component Analysis Based on Data Power

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Data, Measurements, Features

Lecture 11: 0-1 Quadratic Program and Lower Bounds

Support Vector Machines for Classification and Regression

Learning with Local and Global Consistency

Learning with Local and Global Consistency

Machine Learning and Pattern Recognition Logistic Regression

HT2015: SC4 Statistical Data Mining and Machine Learning

ADVANCED MACHINE LEARNING. Introduction

Cyber-Security Analysis of State Estimators in Power Systems

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Learning gradients: predictive models that infer geometry and dependence

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

jorge s. marques image processing

Two-Stage Stochastic Linear Programs

Statistical Machine Learning from Data

Support Vector Machines Explained

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Lecture 2: The SVM classifier

Convolution. 1D Formula: 2D Formula: Example on the web:

STA 4273H: Statistical Machine Learning

Maximum Likelihood Graph Structure Estimation with Degree Distributions

Tutorial on Exploratory Data Analysis

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning in FX Carry Basket Prediction

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

Exploratory data analysis for microarray data

An Initial Study on High-Dimensional Data Visualization Through Subspace Clustering

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Randomization Approaches for Network Revenue Management with Customer Choice Behavior

Constrained Least Squares

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Visualizing Data using t-sne

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Introduction: Overview of Kernel Methods

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

A Solution Manual and Notes for: Exploratory Data Analysis with MATLAB by Wendy L. Martinez and Angel R. Martinez.

Neural Network Add-in

Subspace Analysis and Optimization for AAM Based Face Alignment

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

Principal components analysis

Nonlinear Iterative Partial Least Squares Method

Cluster Analysis: Advanced Concepts

Data a systematic approach

Transcription:

Kernel methods for exploratory data analysis and community detection Johan Suykens KU Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg B-3 Leuven (Heverlee), Belgium Email: johan.suykens@esat.kuleuven.be http://www.esat.kuleuven.be/scd/ VUB Leerstoel 22-23 - Oct. 24 22 Kernel methods for exploratory data analysis and community detection

Overview Principal component analysis Kernel principal component analysis and LS-SVM; sparse and robust extensions Kernel spectral clustering Community detection in complex networks Data visualization using kernel maps with reference point Kernel methods for exploratory data analysis and community detection

Principal component analysis () Given a data cloud, potentially in a high dimensional input space assume an ellipsoidal data cloud search for direction(s) in the data of maximal variance Kernel methods for exploratory data analysis and community detection

Principal component analysis (2) Given data {x i } N i= with x i R n (assumed zero mean) Find projected variables w T x i with maximal variance max w E{(wT x) 2 } = w T E{xx T }w = w T C w with covariance matrix C = E{xx T } and E{ } the expected value. For N given data points one has C N N x i x T i. i= Problem: the optimal solution for w in the above problem is unbounded. Therefore an additional constraint should be taken: a common choice is to impose w T w =. Kernel methods for exploratory data analysis and community detection 2

Principal component analysis (2) Given data {x i } N i= with x i R n (assumed zero mean) Find projected variables w T x i with maximal variance max w E{(wT x) 2 } = w T E{xx T }w = w T C w with covariance matrix C = E{xx T } and E{ } the expected value. For N given data points one has C N N x i x T i. i= Problem: the optimal solution for w in the above problem is unbounded. Therefore an additional constraint should be taken: a common choice is to impose w T w =. Kernel methods for exploratory data analysis and community detection 2

Principal component analysis (3) The problem formulation becomes then: max w wt C w subject to w T w = Constrained optimization problem solved by taking the Lagrangian L: L(w;λ) = 2 wt Cw λ(w T w ) with Lagrange multiplier λ. Solution is given by the eigenvalue problem Cw = λw with C = C T, obtained from setting L/ w =, L/ λ =. Kernel methods for exploratory data analysis and community detection 3

Principal component analysis (4) x2 u2 u λ /2 2 λ /2 µ Illustration of an eigenvalue decomposition Cu = λu - the eigenvalues λ i are real and positive - the eigenvectors u and u 2 are orthogonal with respect to each other - maximal variance solution is the direction u corresponding to λ max = λ. - note that µ = (the data should be made zero-mean beforehand) x Kernel methods for exploratory data analysis and community detection 4

Principal component analysis: dimensionality reduction () Aim: Decreasing the dimensionality of the given input space by mapping vectors x R n to z R m with m < n. A point x is mapped to z in the lower dimensional space by z (j) = u T j x where u j are the eigenvectors corresponding to the m largest eigenvalues and z = [z () z (2)...z (m) ] T. The error resulting from the dimensionality reduction is characterized by the neglected eigenvalues, i.e. n i=m+ λ i. Kernel methods for exploratory data analysis and community detection 5

Principal component analysis: dimensionality reduction (2) λ i In this example reducing the original 6-dimensional space to a 2-dimensional space is a good choice because the largest two eigenvalues λ and λ 2 are much larger than the other ones. i Kernel methods for exploratory data analysis and community detection 6

Principal component analysis: reconstruction problem () Consider x R n and z R m with m n (dimensionality reduction). Encoder mapping: Decoder mapping: z = G(x) x = F(z) Objective: squared distortion error (reconstruction error) min E = N N i= x i x i 2 2 = N N i= x i F(G(x i )) 2 2 Taking the mappings F,G linear corresponds to linear PCA analysis Kernel methods for exploratory data analysis and community detection 7

Principal component analysis: reconstruction problem (2) x G(x) z F(z) x Information bottleneck Kernel methods for exploratory data analysis and community detection 8

Principal component analysis: denoising example Images: 6 5 pixels (n = 24) Training on N = 2 clean digits (not containing digit 9) Test data: (bottom-left) denoised digit 9 after reconstruction using principal components; (bottom-right) 8 principal components Kernel methods for exploratory data analysis and community detection 9

Overview Principal component analysis Kernel principal component analysis and LS-SVM; sparse and robust extensions Kernel spectral clustering Community detection in complex networks Data visualization using kernel maps with reference point Kernel methods for exploratory data analysis and community detection 9

Kernel principal component analysis.5.5.5.5 x (2).5.5.5 2.5 2.5.5.5.5 x () linear PCA 2.5.5.5 kernel PCA (RBF kernel) Kernel PCA [Schölkopf et al., 998]: by eigenvalue decomposition of K(x,x )... K(x, x N ).. K(x N,x )... K(x N,x N ) Kernel methods for exploratory data analysis and community detection

Kernel PCA: primal and dual problem Underlying primal problem [Suykens et al., IEEE-TNN 23] Primal problem: min w,b,e 2 wt w ± N 2 γ i= e 2 i s.t. e i = w T ϕ(x i ) + b, i =,...,N. (Lagrange) dual problem = kernel PCA : Ω c α = λα with λ = /γ with Ω c,ij = (ϕ(x i ) ˆµ ϕ ) T (ϕ(x j ) ˆµ ϕ ) the centered kernel matrix. Interpretation:. pool of candidates components (objective function equals zero) 2. select relevant components (components with high variance) Kernel methods for exploratory data analysis and community detection

Kernel PCA: model representations Primal and dual model representations: M ր ց (P) : (D) : ê = w T ϕ(x ) + b ê = i α ik(x, x i ) + b which can be evaluated at any point x R d, where K(x,x i ) = ϕ(x ) T ϕ(x i ) with K(, ) a positive definite kernel and feature map ϕ( ) : R d R n h. Kernel methods for exploratory data analysis and community detection 2

Generalizations to Kernel PCA: other loss functions Consider general loss function L: min w,b,e 2 wt w + N 2 γ L(e i ) s.t. e i = w T ϕ(x i ) + b, i =,...,N. i= Generalizations of KPCA that lead to robustness and sparseness, e.g. Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 26]. Weighted least squares versions and incorporation of constraints: e i = w min w,b,e 2 wt w + N 2 γ T ϕ(x i ) + b, i =,...,N N v i e 2 i s.t. i= e ie () i =... i= N i= e ie (l ) i = Find l-th PC w.r.t. l orthogonality constraints (previous PC e (j) i ). The solution is given by a generalized eigenvalue problem. Kernel methods for exploratory data analysis and community detection 3

Robustness: Kernel Component Analysis original image corrupted image KPCA reconstruction Weighted LS-SVM: robustness [Alzate & Suykens, IEEE-TNN 28] Kernel methods for exploratory data analysis and community detection 4

Robustness: Kernel Component Analysis original image corrupted image KPCA reconstruction KCA reconstruction Weighted LS-SVM: robustness and sparsity [Alzate & Suykens, IEEE-TNN 28] Kernel methods for exploratory data analysis and community detection 4

Generalizations to Kernel PCA: sparseness 2.5 2.5 x 2.5.5.5.5.5 2 2.5 x 2.5 2.5 2.5 2 2 2.5.5.5 x 2 x 2 x 2.5.5.5.5.5.5.5.5.5 2 2.5 x.5.5.5 2 2.5 x.5.5.5 2 2.5 x PC PC2 PC3 Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 26] (top figure: denoising; bottom figures: different support vectors (in black) per principal component vector) Kernel methods for exploratory data analysis and community detection 5

Overview Principal component analysis Kernel principal component analysis and LS-SVM; sparse and robust extensions Kernel spectral clustering Community detection in complex networks Data visualization using kernel maps with reference point Kernel methods for exploratory data analysis and community detection 5

Spectral graph clustering Minimal cut: given the graph G = (V,E), find clusters A, A 2 min q i {,+} 2 w ij (q i q j ) 2 i,j with cluster membership indicator q i (q i = if i A, q i = if i A 2 ) and W = [w ij ] the weighted adjacency matrix. 4 5 2 3 cut of size (minimal cut) 6 cut of size 2 Kernel methods for exploratory data analysis and community detection 6

Spectral graph clustering Relaxation to Min-cut spectral clustering problem min q T q= q T L q with L = D W the unnormalized graph Laplacian, degree matrix D = diag(d,...,d N ), d i = j w ij, giving L q = λ q. Cluster member indicators: ˆq i = sign( q i θ) with threshold θ. Normalized cut L q = λd q [Fiedler, 973; Shi & Malik, 2; Ng et al. 22; Chung, 997; von Luxburg, 27] Discrete version to continuous problem (Laplace operator) [Belkin & Niyogi, 23; von Luxburg et al., 28; Smale & Zhou, 27] Kernel methods for exploratory data analysis and community detection 7

Spectral clustering + K-means Kernel methods for exploratory data analysis and community detection 8

Kernel spectral clustering: case of two clusters Underlying model (primal representation): ê = w T ϕ(x ) + b with ˆq = sign[ê ] the estimated cluster indicator at any x R d. Primal problem: training on given data {x i } N i= min w,b,e 2 wt w + γ 2 N i= v i e 2 i subject to e i = w T ϕ(x i ) + b, i =,...,N with positive weights v i (will be related to inverse degree matrix). [Alzate & Suykens, IEEE-PAMI, 2] Kernel methods for exploratory data analysis and community detection 9

Lagrangian: Lagrangian and conditions for optimality L(w,b, e; α) = 2 wt w + γ 2 N v i e 2 i i= N α i (e i w T ϕ(x i ) b) i= Conditions for optimality: L w = w = i α iϕ(x i ) L b = i α i = L = α i = γv i e i, i =,...,N e i L = e i = w T ϕ(x i ) + b, i =,...,N α i Eliminate w,b,e, write solution in α. Kernel methods for exploratory data analysis and community detection 2

Kernel-based model representation Dual problem: with V M V Ωα = λα λ = /γ M V = I N T N V N T NV : weighted centering matrix N Ω = [Ω ij ]: kernel matrix with Ω ij = ϕ(x i ) T ϕ(x j ) = K(x i,x j ) Dual model representation: ê = N α i K(x i, x ) + b i= with K(x i,x ) = ϕ(x i ) T ϕ(x ). Kernel methods for exploratory data analysis and community detection 2

Choice of weights v i Take V = D where D = diag{d,...,d N } and d i = N j= Ω ij This gives the generalized eigenvalue problem: M D Ωα = λdα with M D = I N T N D N N T N D This is a modified version of random walks spectral clustering. Note that sign[e i ] = sign[α i ] if γv i > (on training data)... but sign[e ] applies beyond training data Kernel methods for exploratory data analysis and community detection 22

Kernel spectral clustering: more clusters Case of k clusters: additional sets of constraints min w (l),e (l),b l 2 k l= w (l)t w (l) + 2 k l= γ l e (l)t D e (l) subject to e () = Φ Nnh w () + b N e (2) = Φ Nnh w (2) + b 2 N. e (k ) = Φ Nnh w (k ) + b k N where e (l) = [e (l) ;...;e(l) N ] and Φ Nn h = [ϕ(x ) T ;...;ϕ(x N ) T ] R Nn h. Dual problem: M D Ωα (l) = λdα (l), l =,...,k. [Alzate & Suykens, IEEE-PAMI, 2] Kernel methods for exploratory data analysis and community detection 23

Primal and dual model representations k clusters k sets of constraints (index l =,...,k ) M ր ց (P) : sign[ê (l) ] = sign[w (l)t ϕ(x ) + b l ] (D) : sign[ê (l) ] = sign[ j α(l) j K(x,x j ) + b l ] Note: additional sets of constraints also in multi-class and vector-valued output LS- SVMs [Suykens et al., 999] Advantages: out-of-sample extensions, model selection procedures, large scale methods Kernel methods for exploratory data analysis and community detection 24

Out-of-sample extension and coding 8 8 6 6 4 4 2 2 x (2) 2 x (2) 2 4 4 6 6 8 8 2 8 6 4 2 2 4 6 x () 2 8 6 4 2 2 4 6 x () Kernel methods for exploratory data analysis and community detection 25

Out-of-sample extension and coding 8 8 6 6 4 4 2 2 x (2) 2 x (2) 2 4 4 6 6 8 8 2 8 6 4 2 2 4 6 x () 2 8 6 4 2 2 4 6 x () Kernel methods for exploratory data analysis and community detection 25

Piecewise constant eigenvectors and extension () Definition. [Meila & Shi, 2] Vector α is called piecewise constant relative to a partition (A,..., A k ) iff α i = α j x i,x j A p,p =,...,k. Proposition. [Alzate & Suykens, 2] Assume (i) a training set D = {x i } N i= and validation set Dv = {x v m} N v m= i.i.d. sampled from the same underlying distribution; (ii) a set of k clusters {A,..., A k } with k > 2; (iii) an isotropic kernel function such that K(x, z) = when x and z belong to different clusters; (iv) the eigenvectors α (l) for l =,...,k are piecewise constant. Then validation set points belonging to the same cluster are collinear in the k dimensional subspace spanned by the columns of E v R Nv(k ) where Eml v = e(l) m = N i= α(l) i K(x i, x v m) + b l. Kernel methods for exploratory data analysis and community detection 26

Piecewise constant eigenvectors and extension (2) Key aspect of the proof: for x A p one has e (l) = N i= α(l) i K(x i,x ) + b (l) = c p (l) i A p K(x i, x ) + N = c p (l) i A p K(x i, x ) + b (l) i/ A p α (l) i K(x i, x ) + b (l) Model selection to determine kernel parameters and k: Looking for line structures in the space (e () i,e (2) i,...,e (k ) i ), evaluated on validation data (aiming for good generalization) Choice kernel: Gaussian RBF kernel χ 2 -kernel for images Kernel methods for exploratory data analysis and community detection 27

Model selection (looking for lines): toy problem.4.3.2. e (2) i,val..2.3 σ 2 =.5, BLF =.56.4.4.2.2.4.6.8.8.6.4.2 e (2) i,val.2.4.6 e () i,val σ 2 =.6, BLF =..8.4.2.2 e () i,val validation set x (2) x (2) 25 2 5 5 5 5 2 x () 25 3 2 2 3 25 2 5 5 5 5 2 x () 25 3 2 2 3 train + validation + test data Kernel methods for exploratory data analysis and community detection 28

Model selection (looking for lines): toy problem 2 8 σ 2 =.2, BLF =.49 6 4 3 i,val e (2) 2 2 x (3) 2 2 2 4 6 8 6 4 2 2 4 e () i,val 3 2 x (2) 2 2 x ().3 σ 2 =.3, BLF =..2. 3 e (2) i,val..2.3.4.4.3.2...2.3 e () i,val validation set x (3) 2 2 3 2 x (2) 2 2 x () train + validation + test data 2 Kernel methods for exploratory data analysis and community detection 29

Example: image segmentation (looking for lines) 4 3 2 i,val e (3) 2 3 2 e (2) i,val 2 3 5 4 3 2 e () i,val 2 Kernel methods for exploratory data analysis and community detection 3

Image ID Image Proposed method Nyström method Human 4586 4249 6762 479 9673 6296 982 396 29587 3773 Kernel methods for exploratory data analysis and community detection 3

Example: power grid - identifying customer profiles () Power load: 245 substations, hourly data (5 years), d = 43.824 Periodic AR modelling: dimensionality reduction 43.824 24 k-means clustering applied after dimensionality reduction.9 normalized load.8.7.6.5.4.3.2.9 normalized load.8.7.6.5.4.3.2.9 normalized load.8.7.6.5.4.3.2.9 normalized load.8.7.6.5.4.3.2.... hour 2 4 6 8 2 4 6 8 2 22 24 hour 2 4 6 8 2 4 6 8 2 22 24 hour 2 4 6 8 2 4 6 8 2 22 24 hour 2 4 6 8 2 4 6 8 2 22 24.9 normalized load.8.7.6.5.4.3.2.9 normalized load.8.7.6.5.4.3.2.9 normalized load.8.7.6.5.4.3.2.9 normalized load.8.7.6.5.4.3.2.... hour 2 4 6 8 2 4 6 8 2 22 24 hour 2 4 6 8 2 4 6 8 2 22 24 hour 2 4 6 8 2 4 6 8 2 22 24 hour 2 4 6 8 2 4 6 8 2 22 24 Kernel methods for exploratory data analysis and community detection 32

Example: power grid - identifying customer profiles (2) Application of kernel spectral clustering, directly on d = 43.824 Model selection on kernel parameter and number of clusters [Alzate, Espinoza, De Moor, Suykens, 29] normalized load.9.8.7.6.5.4.3.2. normalized load.9.8.7.6.5.4.3.2. normalized load.9.8.7.6.5.4.3.2. normalized load.9.8.7.6.5.4.3.2. 5 5 2 hour 5 5 2 hour 5 5 2 hour 5 5 2 hour normalized load.9.8.7.6.5.4.3.2. normalized load.9.8.7.6.5.4.3.2. normalized load.9.8.7.6.5.4.3.2. 5 5 2 hour 5 5 2 hour 5 5 2 hour Kernel methods for exploratory data analysis and community detection 33

Example: power grid - identifying customer profiles (3).9.9.9 normalized load.8.7.6.5.4.3.2 normalized load.8.7.6.5.4.3.2 normalized load.8.7.6.5.4.3.2... 5 5 2 hour 5 5 2 hour 5 5 2 hour Electricity load: 245 substations in Belgian grid (/2 train, /2 validation) x i R 43.824 : spectral clustering on high dimensional data (5 years) 3 of 7 detected clusters: - : Residential profile: morning and evening peaks - 2: Business profile: peaked around noon - 3: Industrial profile: increasing morning, oscillating afternoon and evening Kernel methods for exploratory data analysis and community detection 34

Kernel spectral clustering: sparse kernel models original image binary clustering Incomplete Cholesky decomposition: Ω GG T 2 η with G R NR and R N Image (Berkeley image dataset): 32 48 (54, 4 pixels), 75 SV e (l) = i S SV α (l) i K(x i, x ) + b l Kernel methods for exploratory data analysis and community detection 35

Kernel spectral clustering: sparse kernel models original image sparse kernel model Incomplete Cholesky decomposition: Ω GG T 2 η with G R NR and R N Image (Berkeley image dataset): 32 48 (54, 4 pixels), 75 SV e (l) = i S SV α (l) i K(x i, x ) + b l Kernel methods for exploratory data analysis and community detection 35

Highly sparse kernel models on images application on images: x i R 3 (r,g,b values per pixel), i =,...,N pre-processed into z i R 8 (quantization to 8 colors) χ 2 -kernel to compare two local color histograms (5 5 pixels window) N >., select subset M N based on quadratic Renyi entropy as in the fixed-size method [Suykens et al., 22] Highly sparse representations: # SV = 3 k Completion of cluster indicators based on out-of-sample extensions sign[ê (l) ] = sign[ j S SV α (l) j K(x,x j ) + b l ] applied to the full image [Alzate & Suykens, Neurocomputing 2] Kernel methods for exploratory data analysis and community detection 36

Highly sparse kernel models: toy example 8 4 6 3 4 2 2 x (2) e (2) i 2 2 4 3 6 5 5 x () 4 3 2 2 3 e () i only 3k = 9 support vectors Kernel methods for exploratory data analysis and community detection 37

Highly sparse kernel models: toy example 2 4 3 2 x (2) 2 3 4 4 3 2 2 3 4 x () Kernel methods for exploratory data analysis and community detection 38

Highly sparse kernel models: toy example 2 5 4 3 2 x(2) 2 3 4 5 5 4 3 2 2 3 4 5 x () only 3k = 2 support vectors Kernel methods for exploratory data analysis and community detection 38

Highly sparse kernel models: toy example 2 4 2 ê (3) i 2 4 4 2 ê (2) i 2 4 4 2 ê () i 2 4 6 Kernel methods for exploratory data analysis and community detection 38

Highly sparse kernel models: image segmentation.5.5.5 3 2 2 3 2.5 2.5.5.5 e () i e (2) i e (3) i Kernel methods for exploratory data analysis and community detection 39

Highly sparse kernel models: image segmentation.5.5 e (3) i.5 2 2.5 3 2 e (2) i 2 3.5 e () i.5.5 only 3k = 2 support vectors Kernel methods for exploratory data analysis and community detection 39

Hierarchical kernel spectral clustering Hierarchical kernel spectral clustering: - looking at different scales - use of model selection and validation data [Alzate & Suykens, Neural Networks, 22] Kernel methods for exploratory data analysis and community detection 4

Kernel spectral clustering: adding prior knowledge Pair of points x, x : c = must-link, c = cannot-link Primal problem [Alzate & Suykens, IJCNN 29] min w (l),e (l),b l 2 k l= w (l)t w (l) + 2 k l= γ l e (l)t D e (l) subject to e () = Φ Nnh w () + b N. e (k ) = Φ Nnh w (k ) + b k N w ()T ϕ(x ) = cw ()T ϕ(x ). w (k )T ϕ(x ) = cw (k )T ϕ(x ) Dual problem: yields rank-one downdate of the kernel matrix Kernel methods for exploratory data analysis and community detection 4

Kernel spectral clustering: adding prior knowledge Pair of points x, x : c = must-link, c = cannot-link Primal problem [Alzate & Suykens, IJCNN 29] min w (l),e (l),b l 2 k l= w (l)t w (l) + 2 k l= γ l e (l)t D e (l) subject to e () = Φ Nnh w () + b N. e (k ) = Φ Nnh w (k ) + b k N w ()T ϕ(x ) = cw ()T ϕ(x ). w (k )T ϕ(x ) = cw (k )T ϕ(x ) Dual problem: yields rank-one downdate of the kernel matrix Kernel methods for exploratory data analysis and community detection 4

Adding prior knowledge original image without constraints Kernel methods for exploratory data analysis and community detection 42

Adding prior knowledge original image with constraints Kernel methods for exploratory data analysis and community detection 42

Semi-supervised learning N unlabeled data, but additional labels on M N data X = {x,...,x N,x N+,...,x M } Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 22]: min w,e,b 2 wt w γ 2 et D e+ρ 2 M m=n+ subject to e i = w T ϕ(x i ) + b, i =,...,M (e m y m ) 2 Dual solution is characterized by a linear system. Other approaches in semi-supervised learning, e.g. [Belkin et al., 26] Kernel methods for exploratory data analysis and community detection 43

Semi-supervised learning N unlabeled data, but additional labels on M N data X = {x,...,x N,x N+,...,x M } Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 22]: min w,e,b 2 wt w γ 2 et D e+ρ 2 M m=n+ subject to e i = w T ϕ(x i ) + b, i =,...,M (e m y m ) 2 Dual solution is characterized by a linear system. Other approaches in semi-supervised learning, e.g. [Belkin et al., 26] Kernel methods for exploratory data analysis and community detection 43

Overview Principal component analysis Kernel principal component analysis and LS-SVM; sparse and robust extensions Kernel spectral clustering Community detection in complex networks Data visualization using kernel maps with reference point Kernel methods for exploratory data analysis and community detection 43

Modularity and community detection Modularity for two-group case [Newman, 26]: Q = 4m i,j (A ij d id j 2m )q iq j with A adjacency matrix, d i degree of node i, m = 2 i d i, q i = if node i belongs to group and q i = for group 2. Use of modularity within kernel spectral clustering [Langone et al., 22]: use at the level of model validation finding representative subgraph using a fixed-size method by maximizing the expansion factor [Maiya, 2] N(G) G with a subgraph G and its neighborhood N(G). definition data in unweighted networks: x i = A(:, i); use of a community kernel function [Kang, 29]. Kernel methods for exploratory data analysis and community detection 44

Protein interaction network () Pajek Yeast interaction network: 24 nodes, 448 edges [Barabasi et al., 2] Kernel methods for exploratory data analysis and community detection 45

Protein interaction network (2) - Yeast interaction network: 24 nodes, 448 edges [Barabasi et al., 2] - KSC community detection, representative subgraph [Langone et al., 22] 7 detected clusters.45.4 modularity.35.3.25.2.5. 5 5 2 25 3 35 4 45 5 55 6 65 7 75 8 85 9 95 number of clusters Kernel methods for exploratory data analysis and community detection 46

Power grid network () Pajek Western USA power grid: 494 nodes, 6594 edges [Watts & Strogatz, 998] Kernel methods for exploratory data analysis and community detection 47

Power grid network (2) - Western USA power grid: 494 nodes, 6594 edges [Watts & Strogatz, 998] - KSC community detection, representative subgraph [Langone et al., 22] 6 detected clusters.55.5 modularity.45.4.35 2 4 6 8 2 4 6 8 2 22 24 number of clusters Kernel methods for exploratory data analysis and community detection 48

Evolving networks Binary clustering case: adding a memory effect [Langone et al, 22] min w,e,b 2 wt w γ 2 et D e ν w T w old subject to e i = w T ϕ(x i ) + b, i =,...,N with w old the previous result in time. Aims at including temporal smoothness Smoothed modularity criterion Kernel methods for exploratory data analysis and community detection 49

Overview Principal component analysis Kernel principal component analysis and LS-SVM; sparse and robust extensions Kernel spectral clustering Community detection in complex networks Data visualization using kernel maps with reference point Kernel methods for exploratory data analysis and community detection 49

Dimensionality reduction and data visualization Traditionally: commonly used techniques are e.g. principal component analysis (PCA), multi-dimensional scaling (MDS), self-organizing maps (SOM) More recently: isomap, locally linear embedding (LLE), Hessian locally linear embedding, diffusion maps, Laplacian eigenmaps ( kernel eigenmap methods and manifold learning ) [Roweis & Saul, 2; Coifman et al., 25; Belkin et al., 26] Kernel maps with reference point [Suykens, IEEE-TNN 28]: data visualization and dimensionality reduction by solving linear system Kernel methods for exploratory data analysis and community detection 5

Kernel maps with reference point: formulation Kernel maps with reference point [Suykens, IEEE-TNN 28]: - LS-SVM core part: realize dimensionality reduction x z - Regularization term: (z P D z) T (z P D z) = P N i= z i P N j= s ijdz j 2 2 with D diagonal matrix and s ij = exp( x i x j 2 2/σ 2 ) - reference point q (e.g. first point; sacrificed in the visualization) Example: d = 2 min z,w,w 2,b,b 2,e i,,e i,2 2 (z P Dz) T (z P D z)+ ν 2 (wt w + w T 2 w 2) + η 2 such that c T, z = q + e, c T,2 z = q 2 + e,2 c T i, z = wt ϕ (x i ) + b + e i,, i = 2,..., N c T i,2 z = wt 2 ϕ 2(x i ) + b 2 + e i,2, i = 2,..., N NX (e 2 i, + e2 i,2 ) i= Coordinates in low dimensional space: z = [z ;z 2 ;...;z N ] R dn Kernel methods for exploratory data analysis and community detection 5

Kernel maps with reference point: formulation Kernel maps with reference point [Suykens, IEEE-TNN 28]: - LS-SVM core part: realize dimensionality reduction x z - Regularization term: (z P D z) T (z P D z) = P N i= z i P N j= s ijdz j 2 2 with D diagonal matrix and s ij = exp( x i x j 2 2/σ 2 ) - reference point q (e.g. first point; sacrificed in the visualization) Example: d = 2 min z,w,w 2,b,b 2,e i,,e i,2 2 (z P Dz) T (z P D z)+ ν 2 (wt w + w T 2 w 2) + η 2 such that c T, z = q + e, c T,2 z = q 2 + e,2 c T i, z = wt ϕ (x i ) + b + e i,, i = 2,..., N c T i,2 z = wt 2 ϕ 2(x i ) + b 2 + e i,2, i = 2,..., N NX (e 2 i, + e2 i,2 ) i= Coordinates in low dimensional space: z = [z ;z 2 ;...;z N ] R dn Kernel methods for exploratory data analysis and community detection 5

Kernel maps with reference point: formulation Kernel maps with reference point [Suykens, IEEE-TNN 28]: - LS-SVM core part: realize dimensionality reduction x z - Regularization term: (z P D z) T (z P D z) = P N i= z i P N j= s ijdz j 2 2 with D diagonal matrix and s ij = exp( x i x j 2 2/σ 2 ) - reference point q (e.g. first point; sacrificed in the visualization) Example: d = 2 min z,w,w 2,b,b 2,e i,,e i,2 2 (z P Dz) T (z P D z)+ ν 2 (wt w + w T 2 w 2) + η 2 such that c T, z = q + e, c T,2 z = q 2 + e,2 c T i, z = wt ϕ (x i ) + b + e i,, i = 2,..., N c T i,2 z = wt 2 ϕ 2(x i ) + b 2 + e i,2, i = 2,..., N NX (e 2 i, + e2 i,2 ) i= Coordinates in low dimensional space: z = [z ;z 2 ;...;z N ] R dn Kernel methods for exploratory data analysis and community detection 5

Model selection by validation Model selection criterion: ( min Θ i,j ẑ T i ẑj ẑ i 2 ẑ j 2 ) 2 xt i x j x i 2 x j 2 Tuning parameters Θ: Kernels tuning parameters in s ij, K, K 2,(K 3 ) Regularization constants ν, η Choice of the diagonal matrix D Choice of reference point q, e.g. q {[+;+], [+; ],[ ; +],[, ]} Stable results, finding a good range is satisfactory. Kernel methods for exploratory data analysis and community detection 52

Kernel maps: spiral example.5 x 3 q = [+; ] q = [ ; ] 2 x 3 2 x 3.5.5.5 x 2.5.5.5 x.5 8 6 8 6 z 2 hat z 2 hat 4 4 2 2 2.2.5..5.5. z hat 2..5.5..5.2 z hat training data (blue ), validation data (magenta o), test data (red +) Model selection: min i,j ( ẑ T i ẑj ẑ i 2 ẑ j 2 ) 2 xt i x j x i 2 x j 2 Kernel methods for exploratory data analysis and community detection 53

Kernel maps: swiss roll example 3 x 3.6 2.5.4 x 3.2.2.4.6 z 2 2.5.8.6.4.2.2 x 2.4.6.8.5 x.5.5 3.5 3 2.5 2.5.5 z x 3 Given 3D swiss roll data Kernel map result - 2D Matlab demo: http://www.esat.kuleuven.be/sista/lssvmlab/kmref/demoswisskmref.m Kernel methods for exploratory data analysis and community detection 54

Kernel maps: visualizing gene distribution x 3 2. 2 z 3.9.8.7 2.35 2.3 2.25 2.2 2.5 2. 2.5 2.95.9 x 3 z z 2 2 2. 2.2 2.3 x 3 Alon colon cancer microarray data set: 3D projections Dimension input space: 62 Number of genes: 5 (training: 5, validation: 5, test: 5) Kernel methods for exploratory data analysis and community detection 55

Kernel maps: time-series data visualization Santa Fe laser data 3 x 3 25 2.9 2.8 2 2.7 5 z 3 hat 2.6 2.5 2.4 5 2 3 4 5 6 7 8 9 discrete time k 2.3 x 3.5 2 2.5 z 2 hat 3 3.5 4.5 2 z hat 2.5 3 3.5 4 x 3 Data {y k k 9 } N k= : train, validation, test Tuning parameters (kernel & regularization) based on validation set Model is able to make out of sample extensions Kernel methods for exploratory data analysis and community detection 56

Conclusions From PCA to KPCA LS-SVM model framework with primal-dual setting Out-of-sample extensions, model selection procedures and large scale methods From spectral clustering to kernel spectral clustering Applications in complex networks Data visualization problems: learning and generalization Reference point to convert eigenvalue problem into linear system Kernel methods for exploratory data analysis and community detection 57

Acknowledgements () Colleagues at ESAT-SCD (especially research units: systems, models, control - biomedical data processing - bioinformatics): C. Alzate, A. Argyriou, J. De Brabanter, K. De Brabanter, L. De Lathauwer, B. De Moor, M. Diehl, Ph. Dreesen, M. Espinoza, T. Falck, D. Geebelen, X. Huang, B. Hunyadi, A. Installe, V. Jumutc, P. Karsmakers, R. Langone, J. Lopez, J. Luts, R. Mall, S. Mehrkanoon, M. Moonen, Y. Moreau, K. Pelckmans, J. Puertas, L. Shi, M. Signoretto, P. Tsiaflakis, V. Van Belle, R. Van de Plas, S. Van Huffel, J. Vandewalle, T. van Waterschoot, C. Varon, S. Yu, and others Topics of this lecture: Carlos Alzate and Rocco Langone. Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet, COE Optimization in Engineering OPTEC, IUAP DYSCO, FWO projects, IWT, IBBT ehealth, COST Kernel methods for exploratory data analysis and community detection 58

Acknowledgements (2) Kernel methods for exploratory data analysis and community detection 59

Thank you Kernel methods for exploratory data analysis and community detection 6