Data Mining in Bioinformatics Day 3: Feature Selection

Transcription

1 Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

2 What is feature selection? Abundance of features Usually, our output variable Y does not depend on all of our input features X Why is this? X usually includes all features that could determine Y according to our prior knowledge, but we do not know for sure. In fact, we perform supervised learning to determine this dependence between input variables and output variables (Supervised) feature selection means selecting the relevant subset of features for a particular learning task Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

3 Why feature selection? Reasons for feature selection to detect causal features to remove noisy features to reduce the set of features that has to be observed cost, speed, data understanding Two modes of feature selection Filter approaches: select interesting features a priori, based on a quality function (information criterion) Wrapper approaches: select special features that are interesting for one particular classifier Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

4 Optimisation problem Combinatorial problem Given a set of features D, and a quality function q, we try to find the subset S of D of cardinality n that maximises q Exponential runtime effort argmax S D S =n q(s) (1) The computational effort for enumerating all possibilities is exponential in n, and hence intractable for large D and n In practice, we have to find a workaround! Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

5 Greedy selection Take the currently best one Greedy selection is an alternative to exhaustive enumeration Idea is to iteratively add the currently most informative feature to the selected set or remove the currently most uninformative feature from the solution set These two variants of greedy feature selection are referred to as: forward feature selection backward elimination Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

6 Greedy selection Forward Feature Selection 1: S 2: repeat 3: j arg max j q(s j) 4: S S j 5: S S \ j 6: until S = n Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

7 Greedy selection Backward Elimination 1: S D, S 2: repeat 3: j arg max j q(s \ j) 4: S S \ j 5: S S j 6: until S = n Optimality of greedy selection Only optimal if q decomposes over the elements of S q(s) = X S q(x) (2) Near-optimal if q is submodular (more details later) Otherwise there is no guarantee for optimality Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

8 Correlation Coefficient Definition The correlation coefficient ρ X,Y between two random variables X and Y with expected values µ X and µ Y and standard deviations σ X and σ Y is defined as: ρ X,Y = cov(x, Y ) σ X σ Y (3) = E((X µ X)(Y µ Y )) σ X σ Y, (4) where E is the expected value operator and cov means covariance. Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

9 Mutual Information Definition Given two random variables X and Y, we define the mutual information I as I(X, Y ) = ( ) p(x, y) p(x, y) log, (5) p(x) p(y) y Y x X where X is the input variable, Y is the output variable, p(x, y) is the (joint) probability of observing x and y, p(x) and p(y) are the marginal probabilities of observing x and y, respectively. log is usually the logarithm with base 2. Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

10 HSIC Definition The Hilbert-Schmidt Independence Criterion (HSIC) measures the dependence of two random variables Given two random variables X and Y, an empirical estimate of the HSIC can be computed as trace(khlh) (6) where K is a kernel on X L is a kernel on Y H is a centering matrix with H(i, j) = δ(i, j) 1 m HSIC(X, Y ) = 0 iff X and Y are independent The larger HSIC(X, Y ), the larger the dependence between X and Y Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

11 Submodular Functions I Definition A function q on a set D is said to be submodular if q(s X) q(s) q(t X) q(t ) (7) where X D S D T D S T This is referred to as the property of diminishing returns : If S is a subset of T, then S benefits more from adding X than T Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

12 Submodular Functions II Near-optimality (Nemhauser, Wolsey, and Fisher, 1978) If q is a submodular, nondecreasing set function and q( ) = 0, then the greedy algorithm is guaranteed to find a set S such that q(s) (1 1 e ) max q(t ) (8) T = S This means that the solution of greedy selection reaches at least 63% of the quality of the optimal solution. Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

13 Submodular Functions III Example: Sensor Placement Imagine our features form a graph G = (D, E) Imagine the features are possible locations for a sensor. Each sensor may cover a node v and its neighbourhood N(v), that is q(s) = N(v) v. Now we want to pick locations in the graph such that our sensors cover as large an area of the graph as possible. q fulfills the following properties q( ) = 0 q is non-decreasing q is submodular Hence greedy selection will lead to near-optimal sensor placement! Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

14 Wrapper methods Two flavours: embedded: The selection process is really integrated into the learning algorithm not-embedded (Wrapper): The learning algorithm is employed as a quality measure Wrappers: Simple wrapper: do prediction using 1 feature only. Use classification accuracy as measure of quality Extend this to groups of features by heuristic search strategies (greedy, Monte-Carlo, etc.) Embedded: Typical example: Decision Trees! l 0 norm SVM Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

15 l 0 norm SVM 3 steps 1. Train a regular linear SVM (using l 1 -norm or l 2 -norm regularization) 2. Re-scale the input variables by multiplying them by the absolute values of the components of the weight vector w obtained. 3. Iterate the first 2 steps until convergence. Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

16 Unsupervised feature selection Problem setting Even without a target variable y, we can select features that are informative according to some criterion Criteria (Guyon and Elisseeff, 2003) Saliency: a feature is salient if it has a high variance or range Entropy: a feature has high entropy if the distribution of examples is uniform Smoothness: a feature in a time series is smooth if on average its local curvature is moderate Density: a feature is in a high-density region if it is highly connected with many other variables Reliability: a feature is reliable if the measurement error bars are smaller than the variability Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

17 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 1. Do you have domain knowledge? If yes, construct a better set of ad-hoc features. 2. Are your features commensurate? If no, consider normalizing them. 3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you. 4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features. Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

18 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 5. Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results. 6. Do you need a predictor? If no, stop. 7. Do you suspect your data is dirty (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them. Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

19 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 8. Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the probe method as a stopping criterion or use the l 0 -norm. embedded method. For comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset. 9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection. Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

20 Feature Selection in Practice Catalog of 10 questions by Guyon and Elisseeff 10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several bootstraps. Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

21 Number of features What if we don t know a reasonable choice of n? Use the probe method (Bi et al., 2003, Stoppiglia et al., 2003, Tusher et al., 2003): Insert fake features (= probes) into the set of features Fake features can be drawn randomly from a Gaussian distribution, or they can be created in a nonparametric manner by randomly shuffling existing features Stop feature selection when you select the first fake feature or when the proportion of fake features exceeds a certain threshold HSIC-based stopping criterion. Stop feature selection when there is (no more) dependence between features X and labels Y (Gretton et al., NIPS 2007) Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

22 Revealing examples Can presumably redundant variables help each other? Noise reduction and consequently better class separation may be obtained by adding variables that are presumably redundant. How does correlation impact variable redundancy? Perfectly correlated variables are truly redundant in the sense that no additional information is gained by adding them. Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity. Can a variable that is useless by itself be useful with others? Two variables that are useless by themselves can be useful together. Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

23 References and further reading References [1] Isabelle Guyon and Andre Elisseeff. An Introduction to Variable and Feature Selection. In Journal of Machine Learning Research 3, pages , Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

24 The end See you tomorrow! Next topic: Text Mining Karsten Borgwardt: Data Mining in Bioinformatics, Page 24