Kernel methods for complex networks and big data

Similar documents

Kernel methods for exploratory data analysis and community detection

Data visualization and dimensionality reduction using kernel maps with a reference point

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Support Vector Machines with Clustering for Training with Very Large Datasets

Introduction to Support Vector Machines. Colin Campbell, Bristol University

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

A Simple Introduction to Support Vector Machines

Complex Networks Analysis: Clustering Methods

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Online Semi-Supervised Learning

Learning with Local and Global Consistency

Learning with Local and Global Consistency

Part 2: Community Detection

DATA ANALYSIS II. Matrix Algorithms

Network Intrusion Detection using Semi Supervised Support Vector Machine

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Semi-Supervised Support Vector Machines and Application to Spam Filtering

An Introduction to Machine Learning

Tree based ensemble models regularization by convex optimization

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Data, Measurements, Features

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Social Media Mining. Data Mining Essentials

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

A scalable multilevel algorithm for graph clustering and community structure detection

Support Vector Machines Explained

Multiclass Classification Class 06, 25 Feb 2008 Ryan Rifkin

Support Vector Machine (SVM)

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

SCAN: A Structural Clustering Algorithm for Networks

Support Vector Machines

Several Views of Support Vector Machines

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

HT2015: SC4 Statistical Data Mining and Machine Learning

Support Vector Machines for Classification and Regression

The Artificial Prediction Market

Machine Learning in Spam Filtering

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Soft Clustering with Projections: PCA, ICA, and Laplacian

Scaling Kernel-Based Systems to Large Data Sets

A fast multi-class SVM learning method for huge databases

Francesco Sorrentino Department of Mechanical Engineering

Gaussian Processes in Machine Learning

ADVANCED MACHINE LEARNING. Introduction

Big Data - Lecture 1 Optimization reminders

Making Sense of the Mayhem: Machine Learning and March Madness

E-commerce Transaction Anomaly Classification

Unsupervised Data Mining (Clustering)

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Protein Protein Interaction Networks

Big Data Analytics: Optimization and Randomization

Statistical Machine Learning

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Classification algorithm in Data mining: An Overview

High Productivity Data Processing Analytics Methods with Applications

Lecture 6: Logistic Regression

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Hadoop SNS. renren.com. Saturday, December 3, 11

Journée Thématique Big Data 13/03/2015

Proximal mapping via network optimization

Feature Engineering in Machine Learning

Online learning of multi-class Support Vector Machines

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Scalable Developments for Big Data Analytics in Remote Sensing

How To Cluster Of Complex Systems

Machine Learning in FX Carry Basket Prediction

Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition

5.1 Bipartite Matching

Distributed Machine Learning and Big Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

Predict Influencers in the Social Network

Report: Declarative Machine Learning on MapReduce (SystemML)

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Large-Scale Similarity and Distance Metric Learning

Supervised and unsupervised learning - 1

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Neural Networks Lesson 5 - Cluster Analysis

LINEAR-ALGEBRAIC GRAPH MINING

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Transcription:

Kernel methods for complex networks and big data Johan Suykens KU Leuven, ESAT-STADIUS Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Email: johan.suykens@esat.kuleuven.be http://www.esat.kuleuven.be/stadius/ STATLEARN 2015, Grenoble Kernel methods for complex networks and big data - Johan Suykens

Introduction and motivation Kernel methods for complex networks and big data - Johan Suykens

Data world (a) C 1 4 8 14 22 3 2 13 1 12 20 18 5 11 C 2 17 16 19 21 23 9 33 31 34 27 26 6 29 7 24 30 15 10 28 25 32 C 3 C 4 Kernel methods for complex networks and big data - Johan Suykens 1

Challenges data-driven general methodology scalability need for new mathematical frameworks Kernel methods for complex networks and big data - Johan Suykens 2

Sparsity big data requires sparsity Kernel methods for complex networks and big data - Johan Suykens 3

Different paradigms SVM & Kernel methods Convex Optimization Sparsity & Compressed sensing Kernel methods for complex networks and big data - Johan Suykens 4

Different paradigms SVM & Kernel methods Convex Optimization? Sparsity & Compressed sensing Kernel methods for complex networks and big data - Johan Suykens 4

Sparsity through regularization or loss function Kernel methods for complex networks and big data - Johan Suykens 4

Sparsity: through regularization or loss function through regularization: model ŷ = w T x + b min j w j + γ i e 2 i sparse w through loss function: model ŷ = i α ik(x, x i ) + b min w T w + γ i L(e i ) sparse α ε 0 +ε Kernel methods for complex networks and big data - Johan Suykens 5

Function estimation in RKHS Find function f such that min f H 1 N N L(y i, f(x i )) + λ f 2 K i=1 with L(, ) the loss function. f K is norm in RKHS H defined by K. Representer theorem: for convex loss function, solution of the form f(x) = NX α i K(x, x i ) i=1 Reproducing property f(x) = f, K x K with K x ( ) = K(x, ) [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000] Kernel methods for complex networks and big data - Johan Suykens 6

Learning with primal and dual model representations Kernel methods for complex networks and big data - Johan Suykens 6

Learning models from data: alternative views - Consider model ŷ = f(x; w), given input/output data {(x i, y i )} N i=1 : min w wt w + γ N (y i f(x i ;w)) 2 i=1 - Rewrite the problem as min w,e w T w + γ N i=1 (y i f(x i ;w)) 2 subject to e i = y i f(x i ;w), i = 1,...,N - Construct Lagrangian and take condition for optimality - For a model f(x;w) = h j=1 w jϕ j (x) = w T ϕ(x) one obtains then ˆf(x) = N i=1 α ik(x, x i ) with K(x, x i ) = ϕ(x) T ϕ(x i ). Kernel methods for complex networks and big data - Johan Suykens 7

Learning models from data: alternative views - Consider model ŷ = f(x; w), given input/output data {(x i, y i )} N i=1 : min w wt w + γ N (y i f(x i ; w)) 2 i=1 - Rewrite the problem as min w,e w T w + γ N i=1 e2 i subject to e i = y i f(x i ;w),i = 1,...,N - Express the solution and the model in terms of Lagrange multipliers α i - For a model f(x;w) = h j=1 w jϕ j (x) = w T ϕ(x) one obtains then ˆf(x) = N i=1 α ik(x, x i ) with K(x, x i ) = ϕ(x) T ϕ(x i ). Kernel methods for complex networks and big data - Johan Suykens 7

Least Squares Support Vector Machines: core models Regression Classification min w,b,e wt w + γ i min w,b,e wt w + γ i e 2 i e 2 i s.t. y i = w T ϕ(x i ) + b + e i, i s.t. y i (w T ϕ(x i ) + b) = 1 e i, i Kernel pca (V = I), Kernel spectral clustering (V = D 1 ) min w,b,e wt w + γ i v i e 2 i s.t. e i = w T ϕ(x i ) + b, i Kernel canonical correlation analysis/partial least squares min w,v,b,d,e,r wt w + v T v + ν i (e i r i ) 2 s.t. { ei = w T ϕ (1) (x i ) + b = v T ϕ (2) (y i ) + d r i [Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010] Kernel methods for complex networks and big data - Johan Suykens 8

Probability and quantum mechanics Kernel pmf estimation Primal: 1 min w,p 2 w, w subject to p i = w,ϕ(x i ), i = 1,...,N and N i=1 p i = 1 i Dual: p i = P Nj=1 K(x j,x i ) P Ni=1 P Nj=1 K(x j,x i ) Quantum measurement: state vector ψ, measurement operators M i Primal: 1 min w,p 2 w w subject to p i = Re( w M i ψ ), i = 1,...,N and N i=1 p i = 1 i Dual: p i = ψ M i ψ (Born rule, orthogonal projective measurement) [Suykens, Physical Review A, 2013] Kernel methods for complex networks and big data - Johan Suykens 9

SVMs: living in two worlds... Primal space x x x x x o o o o Feature space ϕ(x) x x x o oo x x o Input space Parametric Dual space ŷ = sign[w T ϕ(x) + b] Nonparametric ŷ K(x i, x j ) = ϕ(x i ) T ϕ(x j ) (Mercer) ŷ = sign[ P #sv i=1 α iy i K(x, x i ) + b] ŷ w 1 w nh α 1 ϕ 1 (x) ϕ nh (x) K(x, x 1 ) x x α #sv K(x, x #sv ) Kernel methods for complex networks and big data - Johan Suykens 10

SVMs: living in two worlds... Primal space x x x x x o o o o Feature space ϕ(x) x x x o oo x x o Input space Parametric Dual space ŷ = sign[w T ϕ(x) + b] Nonparametric ŷ K(x i, x j ) = ϕ(x i ) T ϕ(x j ) ( Kernel trick ) ŷ = sign[ P #sv i=1 α iy i K(x, x i ) + b] ŷ w 1 w nh α 1 ϕ 1 (x) ϕ nh (x) K(x, x 1 ) x x Parametric Non parametric α #sv K(x, x #sv ) Kernel methods for complex networks and big data - Johan Suykens 10

Linear model: solving in primal or dual? inputs x R d, output y R training set {(x i, y i )} N i=1 Model ր ց (P) : ŷ = w T x + b, w R d (D) : ŷ = i α i x T i x + b, α RN Kernel methods for complex networks and big data - Johan Suykens 11

Linear model: solving in primal or dual? inputs x R d, output y R training set {(x i, y i )} N i=1 Model ր ց (P) : ŷ = w T x + b, w R d (D) : ŷ = i α i x T i x + b, α RN Kernel methods for complex networks and big data - Johan Suykens 11

Linear model: solving in primal or dual? few inputs, many data points: d N primal : w R d dual: α R N (large kernel matrix: N N) Kernel methods for complex networks and big data - Johan Suykens 12

Linear model: solving in primal or dual? many inputs, few data points: d N primal: w R d dual : α R N (small kernel matrix: N N) Kernel methods for complex networks and big data - Johan Suykens 12

From linear to nonlinear model: Feature map and kernel Model ր ց (P) : (D) : ŷ = w T ϕ(x) + b ŷ = i α i K(x i, x) + b Mercer theorem: K(x, z) = ϕ(x) T ϕ(z) Feature map ϕ(x) = [ϕ 1 (x); ϕ 2 (x);...;ϕ h (x)] Kernel function K(x, z) (e.g. linear, polynomial, RBF,...) Use of feature map and positive definite kernel [Cortes & Vapnik, 1995] Optimization in infinite dimensional spaces for LS-SVM [Signoretto, De Lathauwer, Suykens, 2011] Kernel methods for complex networks and big data - Johan Suykens 13

Sparsity by fixed-size kernel method Kernel methods for complex networks and big data - Johan Suykens 13

Fixed-size method: steps 1. selection of a subset from the data 2. kernel matrix on the subset 3. eigenvalue decomposition of kernel matrix 4. approximation of the feature map based on the eigenvectors (Nyström approximation) 5. estimation of the model in the primal using the approximate feature map (applicable to large data sets) [Suykens et al., 2002] (ls-svm book) Kernel methods for complex networks and big data - Johan Suykens 14

Selection of subset random quadratic Renyi entropy Kernel methods for complex networks and big data - Johan Suykens 15

Subset selection using quadratic Renyi entropy: example 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 2 2.5 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2.5 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 2 2.5 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 Kernel methods for complex networks and big data - Johan Suykens 16

Fixed-size method: using quadratic Renyi entropy Algorithm: 1. Given N training data 2. Choose a working set with fixed size M (i.e. M support vectors) (typically M N). 3. Randomly select a SV x from the working set of M support vectors. 4. Randomly select a point x t from the training data and replace x by x t. If the entropy increases by taking the point x t instead of x then this point x t is accepted for the working set of M SVs, otherwise the point x t is rejected (and returned to the training data pool) and the SV x stays in the working set. 5. Calculate the entropy value for the present working set. The quadratic Renyi entropy equals H R = log 1 P M 2 ij Ω (M,M) ij. [Suykens et al., 2002] Kernel methods for complex networks and big data - Johan Suykens 17

Nyström method big kernel matrix: Ω (N,N) R N N small kernel matrix: Ω (M,M) R M M (on subset) Eigenvalue decompositions: Ω (N,N) Ũ = Ũ Λ and Ω (M,M) U = U Λ Relation to eigenvalues and eigenfunctions of the integral equation K(x, x )φ i (x)p(x)dx = λ i φ i (x ) with ˆλ i = 1 M λ i, ˆφi (x k ) = M u ki, ˆφi (x ) = M λ i M u ki K(x k,x ) k=1 [Williams & Seeger, 2001] (Nyström method in GP) Kernel methods for complex networks and big data - Johan Suykens 18

Fixed-size method: estimation in primal For the feature map ϕ( ) : R d R h obtain an approximation ϕ( ) : R d R M based on the eigenvalue decomposition of the kernel matrix with ϕ i (x ) = ˆλi ˆφi (x ) (on a subset of size M N). Estimate in primal: min w, b 1 2 wt w + γ 1 2 N (y i w T ϕ(x i ) b) 2 i=1 Sparse representation is obtained: w R M with M N and M h. [Suykens et al., 2002; De Brabanter et al., CSDA 2010] Kernel methods for complex networks and big data - Johan Suykens 19

Fixed-size method: performance in classification pid spa mgt adu ftc N 768 4601 19020 45222 581012 N cv 512 3068 13000 33000 531012 N test 256 1533 6020 12222 50000 d 8 57 11 14 54 FS-LSSVM (# SV) 150 200 1000 500 500 C-SVM (# SV) 290 800 7000 11085 185000 ν-svm (# SV) 331 1525 7252 12205 165205 RBF FS-LSSVM 76.7(3.43) 92.5(0.67) 86.6(0.51) 85.21(0.21) 81.8(0.52) Lin FS-LSSVM 77.6(0.78) 90.9(0.75) 77.8(0.23) 83.9(0.17) 75.61(0.35) RBF C-SVM 75.1(3.31) 92.6(0.76) 85.6(1.46) 84.81(0.20) 81.5(no cv) Lin C-SVM 76.1(1.76) 91.9(0.82) 77.3(0.53) 83.5(0.28) 75.24(no cv) RBF ν-svm 75.8(3.34) 88.7(0.73) 84.2(1.42) 83.9(0.23) 81.6(no cv) Maj. Rule 64.8(1.46) 60.6(0.58) 65.8(0.28) 83.4(0.1) 51.23(0.20) Fixed-size (FS) LSSVM: good performance and sparsity wrt C-SVM and ν-svm Challenging to achieve high performance by very sparse models [De Brabanter et al., CSDA 2010] Kernel methods for complex networks and big data - Johan Suykens 20

Two stages of sparsity stage 1 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS, in press], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Kernel methods for complex networks and big data - Johan Suykens 21

Two stages of sparsity stage 1 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS, in press], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Kernel methods for complex networks and big data - Johan Suykens 21

Two stages of sparsity stage 1 stage 2 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS, in press], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Kernel methods for complex networks and big data - Johan Suykens 21

Two stages of sparsity stage 1 stage 2 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS, in press], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Kernel methods for complex networks and big data - Johan Suykens 21

Big data: representative subsets using k-nn graphs (1) Convert the large scale dataset into a sparse undirected k-nn graph using a distributed network generation framework Julia language (http://julialang.org/) Large N N kernel matrix Ω for data set D with N data points Batch cluster-based approach: a batch subset D p D is loaded per node with P p=1d p = D; related matrix slice X p and Ω p. MapReduce and AllReduce settings implementable using Hadoop or Spark (see also [Agarwal et al., JMLR 2014] Terascale linear learning) Computational complexity: complexity for construction of the kernel matrix reduced from O(N 2 ) to O(N 2 (1 + log N)/P) for P nodes [Mall, Jumutc, Langone, Suykens, IEEE Bigdata 2014] Kernel methods for complex networks and big data - Johan Suykens 22

Big data: representative subsets using k-nn graphs (2) Map: for Silverman s Rule of Thumb, compute mean and standard deviation of the data per node; compute slice Ω p ; sort in ascending order the columns of Ω p (sortperm in Julia); pick indices for top k values. Reduce: merge k-nn subgraphs into aggregated k-nn graph [Mall, Jumutc, Langone, Suykens, IEEE Bigdata 2014] Kernel methods for complex networks and big data - Johan Suykens 23

Big data: representative subsets using k-nn graphs (3) Steps: 1. MapReduce for representative subsets using k-nn graphs 2. Subset selection using FURS [Mall et al., SNAM 2013] (which aims for good modelling of dense regions; deterministic method) 3. Use in fixed-size LS-SVM ([Mall & Suykens, IEEE-TNNLS, in press]) Dataset N SR SRE FURS k=10 FURS k=100 FURS k=500 Generalization or Test Error ± Standard Deviation 4G 28,500 0.395±0.235 0.898±0.375 0.252±0.023 0.331±0.063 0.414±0.070 5G 81,000 0.264±0.225 0.298±0.142 0.358±0.296 0.082±0.016 0.086±0.009 Magic 19,020 25.32±3.913 28.44±4.446 33.07±2.076 31.05±4.143 36.13±3.062 Shuttle 58,000 2.437±1.104 2.330±0.958 4.223±0.998 1.482±0.600 1.980±0.715 Skin 245,057 0.578±0.387 0.254±0.078 3.277±1.133 0.494±0.078 0.772±0.080 SR: stratified random sampling; SRE: stratified Renyi entropy FURS k=10 : FURS based on k-nn graphs with k = 10 [Mall, Jumutc, Langone, Suykens, IEEE Bigdata 2014] Kernel methods for complex networks and big data - Johan Suykens 24

Kernel-based models for spectral clustering Kernel methods for complex networks and big data - Johan Suykens 24

Spectral graph clustering (1) Minimal cut: given the graph G = (V,E), find clusters A 1, A 2 min q i { 1,+1} 1 2 w ij (q i q j ) 2 i,j with cluster membership indicator q i (q i = 1 if i A 1, q i = 1 if i A 2 ) and W = [w ij ] the weighted adjacency matrix. 1 4 5 2 3 cut of size 1 (minimal cut) 6 cut of size 2 Kernel methods for complex networks and big data - Johan Suykens 25

Spectral graph clustering (2) Min-cut spectral clustering problem min q T q=1 q T L q with L = D W the unnormalized graph Laplacian, degree matrix D = diag(d 1,...,d N ), d i = j w ij, giving L q = λ q. Cluster member indicators: ˆq i = sign( q i θ) with threshold θ. Normalized cut L q = λd q [Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997; von Luxburg, 2007] Discrete version to continuous problem (Laplace operator) [Belkin & Niyogi, 2003; von Luxburg et al., 2008; Smale & Zhou, 2007] Kernel methods for complex networks and big data - Johan Suykens 26

Kernel Spectral Clustering (KSC): case of two clusters Primal problem: training on given data {x i } N i=1 1 min w,b,e 2 wt w γ 1 2 et V e subject to e i = w T ϕ(x i ) + b, i = 1,...,N with weighting matrix V and ϕ( ) : R d R h the feature map. Dual: V M V Ωα = λα with λ = 1/γ, M V = I N 1 1 1 T N V 1 N 1 T NV weighted centering matrix, N Ω = [Ω ij ] kernel matrix with Ω ij = ϕ(x i ) T ϕ(x j ) = K(x i, x j ) Taking V = D 1 with degree matrix D = diag{d 1,...,d N } and d i = N j=1 Ω ij relates to random walks algorithm. [Alzate & Suykens, IEEE-PAMI, 2010] Kernel methods for complex networks and big data - Johan Suykens 27

Kernel Spectral Clustering (KSC): case of two clusters Primal problem: training on given data {x i } N i=1 1 min w,b,e 2 wt w γ 1 2 et V e subject to e i = w T ϕ(x i ) + b, i = 1,...,N with weighting matrix V and ϕ( ) : R d R h the feature map. Dual: V M V Ωα = λα with λ = 1/γ, M V = I N 1 1 1 T N V 1 N 1 T NV weighted centering matrix, N Ω = [Ω ij ] kernel matrix with Ω ij = ϕ(x i ) T ϕ(x j ) = K(x i, x j ) Taking V = D 1 with degree matrix D = diag{d 1,...,d N } and d i = N j=1 Ω ij relates to random walks algorithm. [Alzate & Suykens, IEEE-PAMI, 2010] Kernel methods for complex networks and big data - Johan Suykens 27

Kernel spectral clustering: more clusters Case of k clusters: additional sets of constraints 1 min w (l),e (l),b l 2 k 1 l=1 w (l)t w (l) 1 2 k 1 l=1 γ l e (l)t D 1 e (l) subject to e (1) = Φ N nh w (1) + b 1 1 N e (2) = Φ N nh w (2) + b 2 1 N. e (k 1) = Φ N nh w (k 1) + b k 1 1 N where e (l) = [e (l) 1 ;...;e(l) N ] and Φ N n h = [ϕ(x 1 ) T ;...;ϕ(x N ) T ] R N n h. Dual problem: M D Ωα (l) = λdα (l), l = 1,...,k 1. [Alzate & Suykens, IEEE-PAMI, 2010] Kernel methods for complex networks and big data - Johan Suykens 28

Primal and dual model representations k clusters k 1 sets of constraints (index l = 1,...,k 1) M ր ց (P) : sign[ê (l) ] = sign[w (l)t ϕ(x ) + b l ] (D) : sign[ê (l) ] = sign[ j α(l) j K(x,x j ) + b l ] Kernel methods for complex networks and big data - Johan Suykens 29

Advantages of kernel-based setting model-based approach out-of-sample extensions, applying model to new data consider training, validation and test data (training problem corresponds to eigenvalue decomposition problem) model selection procedures sparse representations and large scale methods Kernel methods for complex networks and big data - Johan Suykens 30

Out-of-sample extension and coding 8 8 6 6 4 4 2 2 x (2) 0 2 x (2) 0 2 4 4 6 6 8 8 10 12 10 8 6 4 2 0 2 4 6 x (1) 10 12 10 8 6 4 2 0 2 4 6 x (1) Kernel methods for complex networks and big data - Johan Suykens 31

Out-of-sample extension and coding 8 8 6 6 4 4 2 2 x (2) 0 2 x (2) 0 2 4 4 6 6 8 8 10 12 10 8 6 4 2 0 2 4 6 x (1) 10 12 10 8 6 4 2 0 2 4 6 x (1) Kernel methods for complex networks and big data - Johan Suykens 31

Model selection: toy example 0.4 0.3 0.2 0.1 e (2) i,val 0 0.1 0.2 0.3 σ 2 = 0.5, BLF = 0.56 0.4 0.4 0.2 0 0.2 0.4 0.6 0.8 0.8 0.6 0.4 0.2 e (2) i,val 0 0.2 0.4 0.6 e (1) i,val σ 2 = 0.16, BLF = 1.0 0.8 0.4 0.2 0 0.2 e (1) i,val validation set x (2) x (2) 25 20 15 10 5 0 5 10 15 20 x (1) 25 30 20 10 0 10 20 30 25 20 15 10 5 0 5 10 15 20 x (1) 25 30 20 10 0 10 20 30 train + validation + test data BAD GOOD Kernel methods for complex networks and big data - Johan Suykens 32

Example: image segmentation 4 3 2 i,val e (3) 1 0 1 2 3 2 1 0 1 e (2) i,val 2 3 5 4 3 2 e (1) i,val 1 0 1 2 Kernel methods for complex networks and big data - Johan Suykens 33

Kernel spectral clustering: sparse kernel models original image binary clustering Incomplete Cholesky decomposition: Ω GG T with G R N R and R N Image (Berkeley image dataset): 321 481 (154, 401 pixels), 175 SV Time-complexity O(R 2 N 2 ) in [Alzate & Suykens, 2008] Time-complexity O(R 2 N) in [Novak, Alzate, Langone, Suykens, 2014] Kernel methods for complex networks and big data - Johan Suykens 34

Kernel spectral clustering: sparse kernel models original image sparse kernel model Incomplete Cholesky decomposition: Ω GG T with G R N R and R N Image (Berkeley image dataset): 321 481 (154, 401 pixels), 175 SV Time-complexity O(R 2 N 2 ) in [Alzate & Suykens, 2008] Time-complexity O(R 2 N) in [Novak, Alzate, Langone, Suykens, 2014] Kernel methods for complex networks and big data - Johan Suykens 34

Incomplete Cholesky decomposition and reduced set For KSC problem M D Ωα = λdα, solve the approximation U T M D UΛ 2 ζ = λζ from Ω GG T, singular value decomposition G = UΛV T and ζ = U T α. A smaller matrix of size R R is obtained instead of N N. Pivots are used as subset { x i } for the data Reduced set method [Scholkopf et al., 1999]: approximation of w = N i=1 α iϕ(x i ) by w = M j=1 β jϕ( x j ) in the sense min w w 2 2 β j β j Sparser solutions by adding l 1 penalty, reweighted l 1 or group Lasso. [Alzate & Suykens, 2008, 2011; Mall & Suykens, 2014] Kernel methods for complex networks and big data - Johan Suykens 35

Incomplete Cholesky decomposition and reduced set For KSC problem M D Ωα = λdα, solve the approximation U T M D UΛ 2 ζ = λζ from Ω GG T, singular value decomposition G = UΛV T and ζ = U T α. A smaller matrix of size R R is obtained instead of N N. Pivots are used as subset { x i } for the data Reduced set method [Scholkopf et al., 1999]: approximation of w = N i=1 α iϕ(x i ) by w = M j=1 β jϕ( x j ) in the sense min β w w 2 2 + ν j β j Sparser solutions by adding l 1 penalty, reweighted l 1 or group Lasso. [Alzate & Suykens, 2008, 2011; Mall & Suykens, 2014] Kernel methods for complex networks and big data - Johan Suykens 35

Big data: representative subsets using k-nn graphs Steps: 1. MapReduce for representative subsets using k-nn graphs 2. Subset selection using FURS [Mall et al., SNAM 2013] (which aims for good modelling of dense regions; deterministic method) 3. Use in Kernel Spectral Clustering KSC Dataset Points M Random Rènyi entropy FURS DB SIL DB SIL DB SIL 4G 28,500 285 0.804 ± 0.167 0.667 ± 0.059 2.909 ± 2.820 0.594 ± 0.148 0.616 0.765 5G 81,000 810 2.011 ± 0.998 0.649 ± 0.049 0.885 ± 0.276 0.601 ± 0.054 0.808 0.764 Batch 13,910 139 4.287 ± 0.764 0.503 ± 0.066 3.969 ± 0.783 0.460 ± 0.134 1.868 0.669 House 34,112 342 0.612 ± 0.154 0.679 ± 0.073 0.507 ± 0.028 0.579 ± 0.006 0.407 0.751 Mopsi Finland 13,467 135 0.897 ± 0.935 0.824 ± 0.085 0.526 ± 0.223 0.886 ± 0.010 0.568 0.920 DB DB DB KDDCupBio 145,751 1458 2.932 ± 1.205 4.233 ± 1.82 3.883 Power 2,049,280 2,050 2.558 ± 1.374 2.0619 ± 0.760 1.921 (quality metrics: Davies-Bouldin (DB) and silhouette (SIL)) [Mall, Jumutc, Langone, Suykens, IEEE Bigdata 2014] Kernel methods for complex networks and big data - Johan Suykens 36

Semi-supervised learning using KSC (1) N unlabeled data, but additional labels on M N data X = {x 1,...,x N,x N+1,...,x M } Kernel spectral clustering as core model (binary case [Alzate & Suykens, WCCI 2012], multi-way/multi-class [Mehrkanoon et al., TNNLS in press]) min w,e,b 1 2 wt w γ 1 2 et D 1 e + ρ 1 2 M m=n+1 subject to e i = w T ϕ(x i ) + b, i = 1,...,M (e m y m ) 2 Dual solution is characterized by a linear system. Suitable for clustering as well as classification. Other approaches in semi-supervised learning and manifold learning, e.g. [Belkin et al., 2006] Kernel methods for complex networks and big data - Johan Suykens 37

Semi-supervised learning using KSC (2) Dataset size n L /n U test (%) FS semi-ksc RD semi-ksc Lap-SVMp Spambase 4597 368/736 919 (20%) 0.885 ± 0.01 0.883 ± 0.01 0.880 ± 0.03 Satimage 6435 1030/1030 1287 (20%) 0.864 ± 0.006 0.831 ± 0.009 0.834 ± 0.007 Ring 7400 592/592 1480 (20%) 0.975 ± 0.005 0.974 ± 0.005 0.972 ± 0.006 Magic 19020 761/1522 3804 (20%) 0.836 ± 0.006 0.829 ± 0.006 0.827 ± 0.005 Cod-rna 331152 1325/1325 66230 (20%) 0.957 ± 0.006 0.947 ± 0.008 0.951 ± 0.001 Covertype 581012 2760/2760 29050 (5%) 0.715 ± 0.005 0.684 ± 0.008 0.697 ± 0.001 2760/27600 0.729 ± 0.04 0.709 ± 0.05 2760/82800 0.739 ± 0.04 0.716 ± 0.03 2760/138000 0.742 ± 0.05 0.723 ± 0.06 FS semi-ksc: Fixed-size semi-supervised KSC RD semi-ksc: other subset selection related to [Lee & Mangasarian, 2001] Lap-SVM: Laplacian support vector machine [Belkin et al., 2006] [Mehrkanoon & Suykens, 2014] Kernel methods for complex networks and big data - Johan Suykens 38

Semi-supervised learning using KSC (3) original image KSC given a few labels semi-supervised KSC [Mehrkanoon, Alzate, Mall, Langone, Suykens, IEEE-TNNLS, in press] Kernel methods for complex networks and big data - Johan Suykens 39

Fixed-kernel methods for complex networks Kernel methods for complex networks and big data - Johan Suykens 39

Modularity and community detection Modularity for two-group case [Newman, 2006]: Q = 1 4m i,j (A ij d id j 2m )q iq j with A adjacency matrix, d i degree of node i, m = 1 2 i d i, q i = 1 if node i belongs to group 1 and q i = 1 for group 2. Use of modularity within kernel spectral clustering [Langone et al., 2012]: use modularity at the level of model validation finding representative subgraph using a fixed-size method by maximizing the expansion factor [Maiya, 2010] N(G) G with a subgraph G and its neighborhood N(G). definition data in unweighted networks: x i = A(:, i); use of a community kernel function [Kang, 2009]. Kernel methods for complex networks and big data - Johan Suykens 40

Protein interaction network (1) Pajek Yeast interaction network: 2114 nodes, 4480 edges [Barabasi et al., 2001] Kernel methods for complex networks and big data - Johan Suykens 41

Protein interaction network (2) - Yeast interaction network: 2114 nodes, 4480 edges [Barabasi et al., 2001] - KSC community detection, representative subgraph [Langone et al., 2012] 7 detected clusters 0.45 0.4 modularity 0.35 0.3 0.25 0.2 0.15 0.1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 number of clusters Kernel methods for complex networks and big data - Johan Suykens 42

Power grid network (1) Pajek Western USA power grid: 4941 nodes, 6594 edges [Watts & Strogatz, 1998] Kernel methods for complex networks and big data - Johan Suykens 43

Power grid network (2) - Western USA power grid: 4941 nodes, 6594 edges [Watts & Strogatz, 1998] - KSC community detection, representative subgraph [Langone et al., 2012] 16 detected clusters 0.55 0.5 modularity 0.45 0.4 0.35 20 40 60 80 100 120 140 160 180 200 220 240 number of clusters Kernel methods for complex networks and big data - Johan Suykens 44

KSC for big data networks (1) YouTube Network: YouTube social network where users form friendship with each other and the users can create groups where others can join. RoadCA Network: a road network of California. Intersections and endpoints are represented by nodes and the roads connecting these intersections are represented as edges. Livejournal Network: free online social network where users are bloggers and they declare friendship among themselves. Dataset Nodes Edges YouTube 1,134,890 2,987,624 roadca 1,965,206 5,533,214 Livejournal 3,997,962 34,681,189 [Mall, Langone, Suykens, Entropy, special issue Big data, 2013] Kernel methods for complex networks and big data - Johan Suykens 45

BAF-KSC: KSC for big data networks (2) Select representative training subgraph using FURS (FURS = Fast and Unique Representative Subset selection [Mall et al., 2013]: selects nodes with high degree centrality belonging to different dense regions, using a deactivation and activation procedure) Perform model selection using BAF (BAF = Balanced Angular Fit: makes use of a cosine similarity measure related to the e-projection values on validation nodes) Train the KSC model by solving a small eigenvalue problem of size min(0.15n, 5000) 2 Apply out-of-sample extension to find cluster memberships of the remaining nodes [Mall, Langone, Suykens, Entropy, special issue Big data, 2013] Kernel methods for complex networks and big data - Johan Suykens 46

KSC for big data networks (3) BAF-KSC [Mall, Langone, Suykens, 2013] Louvain [Blondel et al., 2008] Infomap [Lancichinetti, Fortunato, 2009] CNM [Clauset, Newman, Moore, 2004] Dataset BAF-KSC Louvain Infomap CNM Cl Q Con Cl Q Con Cl Q Con Cl Q Con Openflight 5 0.533 0.002 109 0.61 0.02 18 0.58 0.005 84 0.60 0.016 PGPnet 8 0.58 0.002 105 0.88 0.045 84 0.87 0.03 193 0.85 0.041 Metabolic 10 0.22 0.028 10 0.43 0.03 41 0.41 0.05 11 0.42 0.021 HepTh 6 0.45 0.0004 172 0.65 0.004 171 0.3 0.004 6 0.423 0.0004 HepPh 5 0.56 0.0004 82 0.72 0.007 69 0.62 0.06 6 0.48 0.0007 Enron 10 0.4 0.002 1272 0.62 0.05 1099 0.37 0.27 6 0.25 0.0045 Epinion 10 0.22 0.0003 33 0.006 0.0003 17 0.18 0.0002 10 0.14 0.0 Condmat 6 0.28 0.0002 1030 0.79 0.03 1086 0.79 0.025 8 0.38 0.0003 Flight network (Openflights), network based on trust (PGPnet), biological network (Metabolic), citation networks (HepTh, HepPh), communication network (Enron), review based network (Epinion), collaboration network (Condmat) [snap.stanford.edu] Cl = Clusters, Q = modularity, Con = Conductance BAF-KSC usually finds a smaller number of clusters and achieves lower conductance Kernel methods for complex networks and big data - Johan Suykens 47

Multilevel Hierarchical KSC for complex networks (1) Generating a series of affinity matrices over different levels: communities at level h become nodes for next level h + 1 Kernel methods for complex networks and big data - Johan Suykens 48

Multilevel Hierarchical KSC for complex networks (2) Start at ground level 0 and compute cosine similarities on validation nodes S (0) val (i, j) = 1 et i e j/( e i e j ) Select projections e j satisfying S (0) val (i, j) < t(0) with t (0) a threshold value. Keep these nodes as the first cluster at level 0 and remove the nodes from matrix S (0) val to obtain a reduced matrix. Iterative procedure over different levels h: Communities at level h become nodes for the next level h+1, by creating a set of matrices at different levels of hierarchy: S (h) val (i, j) = 1 C (h 1) i C (h 1) j m C (h 1) i l C (h 1) j S (h 1) val (m, l) and working with suitable threshold values t (h) for each level. [Mall, Langone, Suykens, PLOS ONE, 2014] Kernel methods for complex networks and big data - Johan Suykens 49

Multilevel Hierarchical KSC for complex networks (3) MH-KSC on PGP network: fine intermediate intermediate coarse Multilevel Hierarchical KSC finds high quality clusters at coarse as well as fine and intermediate levels of hierarchy. [Mall, Langone, Suykens, PLOS ONE, 2014] Kernel methods for complex networks and big data - Johan Suykens 50

Multilevel Hierarchical KSC for complex networks (4) Louvain Infomap OSLOM Louvain, Infomap, and OSLOM seem biased toward a particular scale in comparison with MH-KSC, based upon ARI, V I, Q metrics Kernel methods for complex networks and big data - Johan Suykens 51

Conclusions Synergies parametric and kernel based-modelling Sparse kernel models using fixed-size method Supervised, semi-supervised and unsupervised learning applications Large scale complex networks and big data Software: see ERC AdG A-DATADRIVE-B website www.esat.kuleuven.be/stadius/adb/software.php Kernel methods for complex networks and big data - Johan Suykens 52

Acknowledgements (1) Current and former co-workers at ESAT-STADIUS: M. Agudelo, C. Alzate, A. Argyriou, R. Castro, M. Claesen, A. Daemen, J. De Brabanter, K. De Brabanter, J. de Haas, L. De Lathauwer, B. De Moor, F. De Smet, M. Espinoza, T. Falck, Y. Feng, E. Frandi, D. Geebelen, O. Gevaert, L. Houthuys, X. Huang, B. Hunyadi, A. Installe, V. Jumutc, Z. Karevan, P. Karsmakers, R. Langone, Y. Liu, X. Lou, R. Mall, S. Mehrkanoon, Y. Moreau, M. Novak, F. Ojeda, K. Pelckmans, N. Pochet, D. Popovic, J. Puertas, M. Seslija, L. Shi, M. Signoretto, P. Smyth, Q. Tran Dinh, V. Van Belle, T. Van Gestel, J. Vandewalle, S. Van Huffel, C. Varon, X. Xi, Y. Yang, S. Yu, and others Many others for joint work, discussions, invitations, organizations Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet, OPTEC, IUAP DYSCO, FWO projects, IWT, iminds, BIL, COST Kernel methods for complex networks and big data - Johan Suykens 53

Acknowledgements (2) Kernel methods for complex networks and big data - Johan Suykens 54

Thank you Kernel methods for complex networks and big data - Johan Suykens 55