Feature Extraction and Selection. More Info == Better Performance? Curse of Dimensionality. Feature Space. High-Dimensional Spaces

Size: px

Start display at page:

Download "Feature Extraction and Selection. More Info == Better Performance? Curse of Dimensionality. Feature Space. High-Dimensional Spaces"

Delilah Dean
7 years ago
Views:

1 More Info == Better Performance? Feature Extraction and Selection APR Course, Delft, The Netherlands Marco Loog Feature Space Curse of Dimensionality A p-dimensional space, in which each dimension is a feature containing N [labeled] samples [objects] Feature Feature - - A B Feature Problem: too few samples in too many dimensions [the curse of dimensionality] Let s discuss histogram-based density estimation with increasingly finer binning Why and how should we lower the number of features? Anyway : in high-dimensional spaces, our D/3D intuition does not work anymore... High-Dimensional Spaces Example: neighborhood capturing % of uniformly distributed data in hypercube R R R R p : sides of. /p,.6 e.g. in R : sides of.89 not a small block anymore!.6.6 High-Dimensional Spaces Example: all points are boundary points N(,) samples in R : % on convex hull N(,) samples in R : 95% on convex hull Fraction on boundary Dimensionality

2 High-Dimensional Spaces Example : points tend to have equal distances For squared distance to the mean, consider Dimensionality For squared Euclidean distances of points in R for standard normal, distribution is about N(, ) Coefficient of variation High-Dimensional Spaces For classification purposes, this means that for increasing dimensionality p : local, distance-based methods suffer most, e.g. NN-methods global, more restricted models suffer less e.g. linear models So... controlling classifier complexity important [later...] p should be kept as low as possible : dimensionality reduction Dimensionality Reduction Problem : too few samples in too many dimensions [the curse of dimensionality] Solution : drop dimensions [features] Feature selection Feature extraction Questions: Which dimensions could be dropped? What is the best subset of features to keep? Dimensionality Reduction Other uses : Fewer parameters give faster algorithms and parameters are easier to estimate Explaining which measurements are useful and which are not [reducing redundancy] Visualisation of data can be a powerful tool when designing pattern recognition systems Feature Selection vs Extraction Feature Selection vs Extraction Feature selection : select d out of p measurements Feature extraction: map p measurements to d measurements x p f f f f Think of selection and extraction as finding a mapping We need: criterion function, e.g. error, class overlap, information loss, Optimization or search algorithm to find mapping given criterion x p

3 Criteria Linear Feature Extraction The optimal criterion : final performance of the entire system [maybe calculated using cross-validation] Unsupervised : Principal Component Analysis [PCA] Approximate performance predictors : calculate the performance of an easy-to-use criterion, that gives an indication of how well a more powerful / realistic criterion may perform Supervised : Linear Discriminant Analysis [LDA] PCA is the most widely used feature extraction method LDA might be a good second PCA Principal component analysis [PCA, 9] : find directions in data which... - retain as much variation as possible - make projected data uncorrelated - minimise squared reconstruction error PCA Example E.g. NIST digits: samples, p = % Variance retained Intrinsic dimensionality? d 3 5 PCA Example For image data, principal components might also be interpretable... Here : largest occuring variations between digits Remarks on PCA Principal component analysis : global and linear unsupervised [but can be performed on average per-class covariance matrix : klm in PRTools] might need a lot of data 6 to estimate Σ well Danger : criterion is not necessarily - related to the goal; might - discard important directions

4 Supervised Linear Feature Extraction If a desired output y [or label ω] is present for each x, supervised criteria can be used One illustration only : Linear Discriminant Analysis [LDA, or in PRTools terms fisherm] LDA [or Fisher mapping] Find basis vector a for {x} such that in the projections, the classes are maximally separated Choose a to maximise Fisher criterion : T B T W J F ( a ) = a S a a S a 6 Solution: eigenanalysis on S W - S B LDA 3 LDA 3 Map down to a maximum of dimensions Why? 5 6 To avoid fitting noise, can do PCA first 5 6 Example : NIST digits If system underdetermined [n p], first doing PCA is required Nonlinear Feature Extraction Large collection of possible mappings Usually need an optimization algorithm Here : only unsupervised methods Nonlinear Feature Extraction Kernel Principal Component Analysis [KPCA] Topographic mapping Principal curve [PC] Self-organising map [SOM] Generative Topographic Map [GTM] Mixture of subspaces [MFA/MPCA] Autoregressive neural network Embedding Classical scaling [CS] Multidimensional scaling [MDS] Isometric mapping [Isomap] Locally linear embedding [LLE]

5 KPCA Use kernel trick, like in support vector classifier In effect: add results of nonlinear operations on features as features, and apply standard PCA Polynomial, d = d = d = 3 Autoencoding Neural Network Feedforward neural networks predicting their input Bottleneck layer: feature extraction Criterion like PCA [reconstruction error] Training: like standard NNs [back-propagation, ] ˆ Example: K(x,y) = (x T y+) d Similarly: kernel LDA, kernel CCA, z w ˆ Autoencoding Neural Network With multiple hidden layers: nonlinear feature extraction Embedding Find new representation directly, such that some properties [e.g. distances between samples] are preserved as well as possible x z ˆ z w ˆ x 3 z Isomap LLE Euclidean distance not suitable for preserving topology construct neighborhood graph calculate distance D ij between over graph perform classical scaling using D Globally preserve local structure around each sample Step I: for each sample x i, find weights w that best reconstruct it linearly from its kneighbors, x n() x n(k) w w w 3 Minimise:

6 LLE To calculate w for sample x i : LLE Result of Step I: sparse N x N matrix W, with with, solution is: where Qis the k x k local Gram matrix, or (note: often, Q = Q + r I) LLE LLE Step II: find a projection z i for each sample x i by minimizing where Z contains the z i as its columns Constraints: and Solutions are eigenvectors of M, corresponding to smallest eigenvalues Discard smallest eigenvalue to constrain d = k = 5 r = 3e-3 N =, p = d = k = 5 Summary Feature extraction, like selection : useful for visualization, necessary because of curse of dimensionality Feature Selection We need Feature extraction : linear vs. nonlinear supervised vs. unsupervised PCA is the most important feature extraction method, but t-sne is doing quite well as well :) Criterion function, e.g. error, class overlap, information loss Search algorithm, e.g. pick best single feature at each time

7 Criteria Actual classification performance : best possible criterion, but very expensive! Approximate performance predictors : calculate easyto-use measure that gives indication of real performance Probabilistic Criteria Probabilistic distance : Expresses distance between two distributions, e.g. divergence p( x ω ) J D ( ω, ω) = [ p( x ω) p( x ω ) ] ln dx p ( x ω ) Bound for classification error -Needs reasonable estimate of p(x ω i ) s... Scatter Matrices m : mean of all n samples (C classes) m i : mean of n i samples of class i Σ : covariance matrix of all n samples Σ i : covariance matrix of all n i samples of class i Scatter Matrices Σ total scatter, overall width S W average class width ; the smaller, the better S B average distance between class means ; the larger, the better Total scatter : Σ, equals sum of within and between Within-scatter : Between-scatter : C i w = i = S S B = = n Σ i n C ni ( m i i m)( mi m) n T S W S B Heuristic Scatter-based Criteria Scatter-based classification performance indicators J = trace (S W + S B ) = trace (Σ) J = trace (S B / S w ) J 3 = det (Σ ) / det (S w ) J 3 = trace (S W ) / trace (S B ) Yet Another Criterion Mahalanobis distance Assumes Gaussian distributions with equal covariance matrix Σ : T D = m m Σ m m Multi-class, e.g. sum ( maha-s ) or minimum ( maha-m ) D case : Fisher criterion M J F ( ) ( ) m = m ( σ + σ ) σ σ m m

Now, the Search Algorithms Feature selection : Select a subset of d out of p measurements which optimises chosen criterion Simplest solution : Look at all possible subsets p p!

8 Now, the Search Algorithms Feature selection : Select a subset of d out of p measurements which optimises chosen criterion Simplest solution : Look at all possible subsets p p! Problem : There are = subsets d ( p d)! d! Search Algorithms Sub-optimal algorithms : select one feature [or a few features] at a time Simplest : Best individual d but these are not necessarily the best d! dbest or Best d? More Sub-Optimal Strategies x 3 x 3 x 3 Forward selection Start with empty feature set x One at a time, keep adding feature that gives best performance considering entire chosen feature set More Sub-Optimal Strategies More Sub-Optimal Strategies Backward selection Plus-l-take-away-r Same as forward selection...but then the other way round Start with empty set [if l > r] or entire set [if l < r] Keep adding best l and removing worst r [...or vice versa] What may be the benefit?

9 Branch & Bound Backward search Preset # features Initial bound, backtrack, use backward search again, consider sets with criterion values better than bound Optimal if criterion is monotonic 8

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or