Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis

Transcription

1 Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Giorgio Valentini DSI, Dip. di Scienze dell Informazione Università degli Studi di Milano, Italy INFM, Istituto Nazionale di Fisica della Materia, Marco Muselli IEIIT, Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, Italy muselli@ice.ge.cnr.it Francesca Ruffino DIMA, Dipartimento di Matematica Università di Genova, Italy Abstract Extracting information from gene expression data is a difficult task, as these data are characterized by very high dimensional, small sized, samples and large degree of biological variability. However, a possible way of dealing with the curse of dimensionality is offered by feature selection algorithms, while variance problems arising from small samples and biological variability can be addressed through ensemble methods based on resampling techniques. These two approaches have been combined to improve the accuracy of Support Vector Machines (SVM) in the classification of malignant tissues from DNA microarray data. To assess the accuracy and the confidence of the predictions performed proper measures have been introduced. Presented results show that bagged ensembles of SVM are more reliable and achieve equal or better classification accuracy with respect to single SVM, whereas feature selection methods can further enhance classification accuracy. I. INTRODUCTION DNA microarray technology provides fundamental insights into the mrna levels of large sets of genes, offering in such a way an approximate picture of the proteins of a cell at one time [13]. The large amount of gene expression data produced requires statistical and machine learning methods to analyze and extract significant knowledge from DNA microarray experiments. Typical problems arising from this analysis range from prediction of malignancies [15], [17] (a classification problem from a machine learning point of view) to functional discovery of new classes or subclasses of diseases [1] (an unsupervised learning problem), to the identification of groups of genes responsible or correlated with malignancies or polygenic diseases [11] (a feature selection problem). Several supervised methods have been applied to the analysis of cdna microarrays and high density oligonucleotide chips. These methods include decision trees, Fisher linear discriminant, Multi-Layer Perceptrons (MLP), Nearest- Neighbour classifiers, linear discriminant analysis, Parzen windows and others [5], [8], [1], [12], [14]. In particular, Support Vector Machines (SVM) have been recently applied to the analysis of DNA microarray gene expression data in order to classify functional groups of genes, normal and malignant tissues and multiple tumor types [5], [9], [17]. Other works pointed out the importance of feature selection methods to reduce the high dimensionality of the input space and to select the most relevant genes associated with specific functional classes [11]. Furthermore, ensembles of learning machines are wellsuited for gene expression data analysis, as they can reduce the variance due to the low cardinality of available training sets, and the bias due to specific characteristics of the learning algorithm [7]. Indeed, in recent works, combinations of binary classifiers (one-versus-all and all-pairs) and Error Correcting Output Coding (ECOC) ensembles of MLP, as well as ensemble methods based on resampling techniques, such as bagging and boosting, have been applied to the analysis of DNA microarray data [8], [15], [17]. In this work we show that the combination of feature selection methods and bagged ensembles of SVM can enhance the accuracy and the reliability of predictions based on gene expression data. In the next section the standard technique for training SVM with soft margin is presented together with a description of the considered method for feature selection. Then, procedure for bagging SVM is introduced examining different possible choices for the combination of classifiers. Finally, proper measures are employed to evaluate the performance of the proposed approach on two data sets available online, concerning tumor detection based on gene expression data produced by DNA microarrays. II. SVM TRAINING AND FEATURE SELECTION We can represent the output of a single experiment with a DNA microarray as a pair (x, y), being x R d a vector containing the expression levels for d selected genes and y { 1, +1} a binary variable determining the classification of the considered cell. As an example, y = +1 can be used to denote a tumoral cell and y = 1 for a normal cell. It is then evident that in our analysis every cell is associated with an input vector x containing the gene expression levels. When n different experiments are performed, we obtain a collection of n pairs T = {(x j, y j ) : j = 1,..., n} (training set); suppose, without loss of generality, that the first n + pairs have y j = +1, whereas the remaining n = n n + possess a negative output y j = 1. The target of a machine learning method is to construct from the pairs {(x j, y j )} n a classifier, i.e. a decision function h : R d { 1, +1}, that gives the correct classification y = h(x) for every cell (determined by x). To achieve this target, many available techniques generate a discriminant function f : R d R from the sample T at hand /3/$ IEEE 1844

2 and build h by employing the formula h(x) = sign(f(x)) (1) where the function sign(z) gives as output +1 if z and 1 otherwise. Among these techniques, SVM [6] turn out to be a promising approach, due to their theoretical motivations and their practical efficiency. They employ the following expression for the discriminant function f(x) = b + α j y j K(x j, x) (2) where the scalars α j are obtained, in the soft margin version, through the solution of the following quadratic programming problem: minimize the cost function W (α) = 1 α j α k y j y k K(x j, x k ) α j 2 k=1 subject to the constraints α j y j =, α j C for j = 1,..., n being C a regularization parameter. The symmetric function K(, ) must be chosen among the kernels of Reproducing Kernel Hilbert Spaces [16]; three possible choices are: Linear kernel: K(u, v) = u v Polynomial kernel: K(u, v) = (u v + 1) γ Gaussian kernel: K(u, v) = exp( u v 2 /σ 2 ) Since the point α of minimum of the quadratic programming problem can have several null components α j =, the sum in Eq. 2 receives the contribution of a subset V of patterns x j in T, called support vectors. The bias b in the SVM classifier is usually set to b = 1 V x V α j y j K(x j, x) where V denotes the number of elements of the set V. The accuracy of a classifier is affected by the dimension d of the input vector; roughly, the greater is d the lower is the probability of correctly classifying a pattern x. For this reason, feature selection methods are employed to choose a subset of relevant inputs (genes) for the problem at hand, so as to reduce the number of components x i. A simple feature selection method, originally proposed in [1], associates with every gene expression level x i a quantity c i given by c i = µ+ i σ + i µ i + σ i where µ + i and µ i are the mean value of x i across all the input patterns in T with positive and negative output, respectively µ + i = 1 n + n + x ji, µ i = 1 n j=n + +1 x ji (3) having denoted with x ji the ith component of the input vector x j. Similarly, σ + i and σ i are the standard deviation of x i computed in the set of pairs with positive and negative output, respectively. Then, the genes are ranked according to their c i value, and the first m and the last m genes are selected, thus obtaining a set of 2m inputs. The main problem of this approach is the underlying independence assumption of the expression patterns of each gene: indeed it fails in detecting the role of coordinately expressed genes in carcinogenic processes. Eq. 3 can also be used to compute the weights for weighted gene voting [1], a minor variant of diagonal linear discriminant analysis [8]. III. BAGGED ENSEMBLES OF SVM The low cardinality of the available data and the large degree of biological variability in gene expression suggest to apply variance-reduction methods, such as bagging, to these tasks. Denote with {T b } B b=1 a set of B (bootstrapped) samples, whose elements are drawn with replacement from the training set T according to a uniform probability distribution. Let f b be the discriminant function obtained by applying the softmargin SVM learning algorithm on the bootstrapped sample T b. The corresponding decision function h b is computed as usual through Eq. 1. The generalization ability of classifiers h b (base learners) can be improved by aggregating them through the standard formula (for two class classification problems) [3]: ( B ) h st (x) = sign h b (x) (4) b=1 In this way the decision function h st (x) of the bagged ensemble selects the most voted class among the B classifiers h b. Other choices of discriminant function for the bagged ensemble are possible, some of which lead to the above standard decision function h st (x) through Eq. 1. The following three expressions allow also to evaluate the quality of the classification offered by the bagged ensemble: f avg (x) = 1 B f win (x) = B f b (x) b=1 1 B b B f b (x) f max (x) = h st (x) max b B f b(x) where the set B = {b : h b (x) = h st (x)} contains the indices b of the base learners that vote for the class h st (x). Note that f avg (x) is the average of the f b (x), whereas f win (x) and f max (x) are, respectively, the average of the discriminant functions of the classifiers having indices in B and the signed maximum of their absolute value. 1845

3 Succ Acc.6 Mext Mmed (a).8 (b) Succ Acc M ext M med (c) (d) Fig. 1. Results obtained with for different numbers of selected genes. Colon data set: (a) Success and acceptance rate (b) Extremal and median margin. Leukemia data set: (c) Success and acceptance rate (d) External and median margin. The corresponding decision functions are given by h avg (x) = sign(f avg (x)) h win (x) = sign(f win (x)) = h st (x) h max (x) = sign(f max (x)) = h st (x) While h win (x) and h max (x) are equivalent to the standard choice h st (x), h avg (x) selects the class associated with the average of the discriminant functions computed by the base learners. Thus, the decision of each classifier in the ensemble is weighted via its prediction strength, measured by the value of the discriminant function f b ; on the contrary, in the decision function h st (x) each base learner receives the same weight. IV. ASSESSMENT OF CLASSIFIERS QUALITY Besides the success rate Succ = 1 2n y j + h(x j ) which is an estimate of the generalization error, several alternative measures can be used to assess the quality of classifiers producing a discriminant function f(x). These measures can then be directly applied to evaluate the confidence of the classification performed by simple SVM and bagged ensembles of SVM. By generalizing a definition introduced in [1], [11], a first choice is the extremal margin M ext, defined as θ + θ M ext = max f(x j) min f(x j) 1 j n 1 j n (5) 1846

4 SUCC ACC , 5.65 (a) (b) M EXT.15 M MED (c).15 (d) Fig. 2. Comparison of results obtained with single and bagged SVM on the Leukemia data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin. where the quantities θ + and θ are given by θ + = min 1 j n +f(x j), θ = max f(x j ) n + +1 j n It can be easily seen that the larger is the value of M ext, more confident is the classifier; note that if there are no classification errors M ext is positive. An alternative measure, less sensitive to outliers, is the median margin M med, which is defined as λ + λ M med = max f(x j) min f(x j) 1 j n 1 j n where λ + and λ are the median value of f(x) for the positive and negative class, respectively: λ + = min{λ R : J + λ n+ /2} λ = max{λ R : J λ n /2} (6) The sets J + λ (resp. J λ ) contain the indices j of the input patterns x j in the training set, where the discriminant function f(x j ) is greater (resp. lower) than λ: J + λ = {j : f(x j) > λ}, J λ = {j : f(x j) < λ} Finally, the acceptance rate Acc measures the fraction of samples that are correctly classified with high confidence. It is defined by the expression Acc = J + θ + J θ n where θ = max{ θ +, θ } is the smallest symmetric rejection zone to get zero error. It is important to remark that the acceptance rate is highly sensitive to the presence of outliers. (7) 1847

5 , 5.84 SUCC.82 ACC (a) (b) M EXT 5.6 M MED (c) (d) Fig. 3. Comparison of results obtained with single and bagged SVM on the Colon data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin. V. NUMERICAL EXPERIMENTS Here we present the results about the classification of DNA microarray data using the proposed techniques. We applied SVM linear classifiers to separate normal and malignant tissues with and without feature selection. Then we compare the results obtained with single and bagged SVM, using in all cases the filter method for feature selection described in Sec. II. A. Data sets The proposed approach has been tested on DNA microarray data available on-line. In particular, we used the Colon cancer data set [2] constituted by 62 samples including 22 normal and 4 colon cancer tissues. The data matrix contains expression values of 2 genes and has been preprocessed by taking the logarithm of all values and by normalizing feature (gene) vectors. This has been performed by subtracting the mean over all training values, by dividing by the corresponding standard deviation and finally by passing the result through a squashing arctan function to diminish the importance of outliers. The whole data set has been randomly split into a training and a test set of equal size, each one with the same proportion of normal and malignant examples. We also compared the different classifiers on the Leukemia data set [1]. It is composed by two variants of leukemia, ALL and AML, for a total of 72 examples split into a training set of 38 samples and a test set of 34 examples, with 7129 different genes. 1848

6 B. Results Fig. 1 summarizes the results with, obtained by varying the number of genes selected with the filter method described in Sec. II and by using the measures for classifier assessment introduced in Sec. IV. With the Colon data set, the accuracy does not change significantly when the feature selection method is applied; however, the prediction is more reliable, as attested by the higher values of Acc and M med (Fig. 1a and 1b), when the number of inputs lies beyond 256. On the contrary, we obtain the highest success rate on the Leukemia data set with only 16 selected genes; the corresponding acceptance rate is also significantly high (Fig. 1c). The extremal margin is negative but very close to, thus showing that the Leukemia data set is near linearly separable, with a relatively high confidence (Fig. 1d). Fig. 2 and 3 compare the results obtained through the application of bagged ensembles of SVM (for different choice of the decision function) with those achieved by. On the Leukemia data set, bagging seems not to improve the success rate, even if the predictions are more reliable, especially when a small number of selected genes is used (Fig. 2). On the contrary, bagging significantly improves the success rate scored on the Colon data set, both with and without feature selection (Fig. 3a). Considering the acceptance rate, there are no significant difference between bagged SVM employing f avg or f win and, whereas bagged SVM adopting f max achieve the highest values of Acc if the number of genes is less or equal to 512; for higher values the opposite situation occurs (Fig. 3b). While bagged SVM (especially when f max is used) show better values of the extremal margin with respect to single SVM when small numbers of genes are selected, we observe the opposite behavior if the number of considered genes is relatively large (Fig. 3c). Finally, bagged ensembles show clearly larger median margins with respect to, confirming a more overall reliability (Fig. 3d). Summarizing, bagged ensembles seem to be more accurate and confident in predictions with respect to. The simple gene selection method adopted is effective with the Leukemia data set, both when single and bagged SVM are used, while the accuracy for the Colon data set seems to be independent of the application of feature selection. The results obtained with are comparable to those presented in [11]; however, the application of the recursive feature elimination method allows to achieve better results than those obtained with bagged ensembles of SVM, at least for the Leukemia data set. Anyway, it is difficult to establish if a statistical significant difference between the two approaches exists, given the small size of the available samples. VI. CONCLUSIONS The results show that bagged ensembles of SVM are more reliable than in classifying DNA microarray data. Moreover they obtain an equivalent or a better accuracy in separating normal from malignant tissues, at least with Colon and Leukemia data sets. In fact, bagging is a variance reduction method which is able to improve the stability of classifiers [4], especially when the training set at hand has small size and large dimensionality, as in the present case. Despite its simplicity, the application of the feature selection method we used in our experiments allows to achieve better value of the success rate. However, it does not take into account the interactions of the expression levels between different genes. In order to manage this effect, we plan to employ more refined gene selection methods [11], in combination with bagging, to further improve the accuracy and the reliability of the predictions based on DNA microarray data. ACKNOWLEDGMENT This work was partially funded by INFM, unità di Genova. REFERENCES [1] A. Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 43:53 511, 2. [2] U. Alon et al. Broad patterns of gene expressions revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS, 96: , [3] L. Breiman. Bagging predictors. Machine Learning, 24(2):123 14, [4] L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):81 849, [5] M. Brown et al. Knowledge-base analysis of microarray gene expression data by using support vector machines. PNAS, 97(1): , 2. [6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 2: , [7] T.G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages Springer-Verlag, 2. [8] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. JASA, 97(457):77 87, 22. [9] T.S. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(1):96 914, 2. [1] T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286: , [11] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1/3): , 22. [12] J. Khan et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6): , 21. [13] D.J. Lockhart and E.A. Winzeler. Genomics, gene expression and DNA arrays. Nature, 45: , 2. [14] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Gene functional classification from heterogenous data. In Fifth International Conference on Computational Molecular Biology, 21. [15] G. Valentini. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artificial Intelligence in Medicine, 26(3):283 36, 22. [16] G Wahba. Spline models for observational data. In SIAM, Philadelphia, USA, 199. [17] C. Yeang et al. Molecular classification of multiple tumor types. In ISMB 21, Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, pages , Copenaghen, Denmark, 21. Oxford University Press. 1849