Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis
|
|
- Rosaline Malone
- 8 years ago
- Views:
Transcription
1 Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Giorgio Valentini DSI, Dip. di Scienze dell Informazione Università degli Studi di Milano, Italy INFM, Istituto Nazionale di Fisica della Materia, Marco Muselli IEIIT, Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, Italy muselli@ice.ge.cnr.it Francesca Ruffino DIMA, Dipartimento di Matematica Università di Genova, Italy Abstract Extracting information from gene expression data is a difficult task, as these data are characterized by very high dimensional, small sized, samples and large degree of biological variability. However, a possible way of dealing with the curse of dimensionality is offered by feature selection algorithms, while variance problems arising from small samples and biological variability can be addressed through ensemble methods based on resampling techniques. These two approaches have been combined to improve the accuracy of Support Vector Machines (SVM) in the classification of malignant tissues from DNA microarray data. To assess the accuracy and the confidence of the predictions performed proper measures have been introduced. Presented results show that bagged ensembles of SVM are more reliable and achieve equal or better classification accuracy with respect to single SVM, whereas feature selection methods can further enhance classification accuracy. I. INTRODUCTION DNA microarray technology provides fundamental insights into the mrna levels of large sets of genes, offering in such a way an approximate picture of the proteins of a cell at one time [13]. The large amount of gene expression data produced requires statistical and machine learning methods to analyze and extract significant knowledge from DNA microarray experiments. Typical problems arising from this analysis range from prediction of malignancies [15], [17] (a classification problem from a machine learning point of view) to functional discovery of new classes or subclasses of diseases [1] (an unsupervised learning problem), to the identification of groups of genes responsible or correlated with malignancies or polygenic diseases [11] (a feature selection problem). Several supervised methods have been applied to the analysis of cdna microarrays and high density oligonucleotide chips. These methods include decision trees, Fisher linear discriminant, Multi-Layer Perceptrons (MLP), Nearest- Neighbour classifiers, linear discriminant analysis, Parzen windows and others [5], [8], [1], [12], [14]. In particular, Support Vector Machines (SVM) have been recently applied to the analysis of DNA microarray gene expression data in order to classify functional groups of genes, normal and malignant tissues and multiple tumor types [5], [9], [17]. Other works pointed out the importance of feature selection methods to reduce the high dimensionality of the input space and to select the most relevant genes associated with specific functional classes [11]. Furthermore, ensembles of learning machines are wellsuited for gene expression data analysis, as they can reduce the variance due to the low cardinality of available training sets, and the bias due to specific characteristics of the learning algorithm [7]. Indeed, in recent works, combinations of binary classifiers (one-versus-all and all-pairs) and Error Correcting Output Coding (ECOC) ensembles of MLP, as well as ensemble methods based on resampling techniques, such as bagging and boosting, have been applied to the analysis of DNA microarray data [8], [15], [17]. In this work we show that the combination of feature selection methods and bagged ensembles of SVM can enhance the accuracy and the reliability of predictions based on gene expression data. In the next section the standard technique for training SVM with soft margin is presented together with a description of the considered method for feature selection. Then, procedure for bagging SVM is introduced examining different possible choices for the combination of classifiers. Finally, proper measures are employed to evaluate the performance of the proposed approach on two data sets available online, concerning tumor detection based on gene expression data produced by DNA microarrays. II. SVM TRAINING AND FEATURE SELECTION We can represent the output of a single experiment with a DNA microarray as a pair (x, y), being x R d a vector containing the expression levels for d selected genes and y { 1, +1} a binary variable determining the classification of the considered cell. As an example, y = +1 can be used to denote a tumoral cell and y = 1 for a normal cell. It is then evident that in our analysis every cell is associated with an input vector x containing the gene expression levels. When n different experiments are performed, we obtain a collection of n pairs T = {(x j, y j ) : j = 1,..., n} (training set); suppose, without loss of generality, that the first n + pairs have y j = +1, whereas the remaining n = n n + possess a negative output y j = 1. The target of a machine learning method is to construct from the pairs {(x j, y j )} n a classifier, i.e. a decision function h : R d { 1, +1}, that gives the correct classification y = h(x) for every cell (determined by x). To achieve this target, many available techniques generate a discriminant function f : R d R from the sample T at hand /3/$ IEEE 1844
2 and build h by employing the formula h(x) = sign(f(x)) (1) where the function sign(z) gives as output +1 if z and 1 otherwise. Among these techniques, SVM [6] turn out to be a promising approach, due to their theoretical motivations and their practical efficiency. They employ the following expression for the discriminant function f(x) = b + α j y j K(x j, x) (2) where the scalars α j are obtained, in the soft margin version, through the solution of the following quadratic programming problem: minimize the cost function W (α) = 1 α j α k y j y k K(x j, x k ) α j 2 k=1 subject to the constraints α j y j =, α j C for j = 1,..., n being C a regularization parameter. The symmetric function K(, ) must be chosen among the kernels of Reproducing Kernel Hilbert Spaces [16]; three possible choices are: Linear kernel: K(u, v) = u v Polynomial kernel: K(u, v) = (u v + 1) γ Gaussian kernel: K(u, v) = exp( u v 2 /σ 2 ) Since the point α of minimum of the quadratic programming problem can have several null components α j =, the sum in Eq. 2 receives the contribution of a subset V of patterns x j in T, called support vectors. The bias b in the SVM classifier is usually set to b = 1 V x V α j y j K(x j, x) where V denotes the number of elements of the set V. The accuracy of a classifier is affected by the dimension d of the input vector; roughly, the greater is d the lower is the probability of correctly classifying a pattern x. For this reason, feature selection methods are employed to choose a subset of relevant inputs (genes) for the problem at hand, so as to reduce the number of components x i. A simple feature selection method, originally proposed in [1], associates with every gene expression level x i a quantity c i given by c i = µ+ i σ + i µ i + σ i where µ + i and µ i are the mean value of x i across all the input patterns in T with positive and negative output, respectively µ + i = 1 n + n + x ji, µ i = 1 n j=n + +1 x ji (3) having denoted with x ji the ith component of the input vector x j. Similarly, σ + i and σ i are the standard deviation of x i computed in the set of pairs with positive and negative output, respectively. Then, the genes are ranked according to their c i value, and the first m and the last m genes are selected, thus obtaining a set of 2m inputs. The main problem of this approach is the underlying independence assumption of the expression patterns of each gene: indeed it fails in detecting the role of coordinately expressed genes in carcinogenic processes. Eq. 3 can also be used to compute the weights for weighted gene voting [1], a minor variant of diagonal linear discriminant analysis [8]. III. BAGGED ENSEMBLES OF SVM The low cardinality of the available data and the large degree of biological variability in gene expression suggest to apply variance-reduction methods, such as bagging, to these tasks. Denote with {T b } B b=1 a set of B (bootstrapped) samples, whose elements are drawn with replacement from the training set T according to a uniform probability distribution. Let f b be the discriminant function obtained by applying the softmargin SVM learning algorithm on the bootstrapped sample T b. The corresponding decision function h b is computed as usual through Eq. 1. The generalization ability of classifiers h b (base learners) can be improved by aggregating them through the standard formula (for two class classification problems) [3]: ( B ) h st (x) = sign h b (x) (4) b=1 In this way the decision function h st (x) of the bagged ensemble selects the most voted class among the B classifiers h b. Other choices of discriminant function for the bagged ensemble are possible, some of which lead to the above standard decision function h st (x) through Eq. 1. The following three expressions allow also to evaluate the quality of the classification offered by the bagged ensemble: f avg (x) = 1 B f win (x) = B f b (x) b=1 1 B b B f b (x) f max (x) = h st (x) max b B f b(x) where the set B = {b : h b (x) = h st (x)} contains the indices b of the base learners that vote for the class h st (x). Note that f avg (x) is the average of the f b (x), whereas f win (x) and f max (x) are, respectively, the average of the discriminant functions of the classifiers having indices in B and the signed maximum of their absolute value. 1845
3 Succ Acc.6 Mext Mmed (a).8 (b) Succ Acc M ext M med (c) (d) Fig. 1. Results obtained with for different numbers of selected genes. Colon data set: (a) Success and acceptance rate (b) Extremal and median margin. Leukemia data set: (c) Success and acceptance rate (d) External and median margin. The corresponding decision functions are given by h avg (x) = sign(f avg (x)) h win (x) = sign(f win (x)) = h st (x) h max (x) = sign(f max (x)) = h st (x) While h win (x) and h max (x) are equivalent to the standard choice h st (x), h avg (x) selects the class associated with the average of the discriminant functions computed by the base learners. Thus, the decision of each classifier in the ensemble is weighted via its prediction strength, measured by the value of the discriminant function f b ; on the contrary, in the decision function h st (x) each base learner receives the same weight. IV. ASSESSMENT OF CLASSIFIERS QUALITY Besides the success rate Succ = 1 2n y j + h(x j ) which is an estimate of the generalization error, several alternative measures can be used to assess the quality of classifiers producing a discriminant function f(x). These measures can then be directly applied to evaluate the confidence of the classification performed by simple SVM and bagged ensembles of SVM. By generalizing a definition introduced in [1], [11], a first choice is the extremal margin M ext, defined as θ + θ M ext = max f(x j) min f(x j) 1 j n 1 j n (5) 1846
4 SUCC ACC , 5.65 (a) (b) M EXT.15 M MED (c).15 (d) Fig. 2. Comparison of results obtained with single and bagged SVM on the Leukemia data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin. where the quantities θ + and θ are given by θ + = min 1 j n +f(x j), θ = max f(x j ) n + +1 j n It can be easily seen that the larger is the value of M ext, more confident is the classifier; note that if there are no classification errors M ext is positive. An alternative measure, less sensitive to outliers, is the median margin M med, which is defined as λ + λ M med = max f(x j) min f(x j) 1 j n 1 j n where λ + and λ are the median value of f(x) for the positive and negative class, respectively: λ + = min{λ R : J + λ n+ /2} λ = max{λ R : J λ n /2} (6) The sets J + λ (resp. J λ ) contain the indices j of the input patterns x j in the training set, where the discriminant function f(x j ) is greater (resp. lower) than λ: J + λ = {j : f(x j) > λ}, J λ = {j : f(x j) < λ} Finally, the acceptance rate Acc measures the fraction of samples that are correctly classified with high confidence. It is defined by the expression Acc = J + θ + J θ n where θ = max{ θ +, θ } is the smallest symmetric rejection zone to get zero error. It is important to remark that the acceptance rate is highly sensitive to the presence of outliers. (7) 1847
5 , 5.84 SUCC.82 ACC (a) (b) M EXT 5.6 M MED (c) (d) Fig. 3. Comparison of results obtained with single and bagged SVM on the Colon data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin. V. NUMERICAL EXPERIMENTS Here we present the results about the classification of DNA microarray data using the proposed techniques. We applied SVM linear classifiers to separate normal and malignant tissues with and without feature selection. Then we compare the results obtained with single and bagged SVM, using in all cases the filter method for feature selection described in Sec. II. A. Data sets The proposed approach has been tested on DNA microarray data available on-line. In particular, we used the Colon cancer data set [2] constituted by 62 samples including 22 normal and 4 colon cancer tissues. The data matrix contains expression values of 2 genes and has been preprocessed by taking the logarithm of all values and by normalizing feature (gene) vectors. This has been performed by subtracting the mean over all training values, by dividing by the corresponding standard deviation and finally by passing the result through a squashing arctan function to diminish the importance of outliers. The whole data set has been randomly split into a training and a test set of equal size, each one with the same proportion of normal and malignant examples. We also compared the different classifiers on the Leukemia data set [1]. It is composed by two variants of leukemia, ALL and AML, for a total of 72 examples split into a training set of 38 samples and a test set of 34 examples, with 7129 different genes. 1848
6 B. Results Fig. 1 summarizes the results with, obtained by varying the number of genes selected with the filter method described in Sec. II and by using the measures for classifier assessment introduced in Sec. IV. With the Colon data set, the accuracy does not change significantly when the feature selection method is applied; however, the prediction is more reliable, as attested by the higher values of Acc and M med (Fig. 1a and 1b), when the number of inputs lies beyond 256. On the contrary, we obtain the highest success rate on the Leukemia data set with only 16 selected genes; the corresponding acceptance rate is also significantly high (Fig. 1c). The extremal margin is negative but very close to, thus showing that the Leukemia data set is near linearly separable, with a relatively high confidence (Fig. 1d). Fig. 2 and 3 compare the results obtained through the application of bagged ensembles of SVM (for different choice of the decision function) with those achieved by. On the Leukemia data set, bagging seems not to improve the success rate, even if the predictions are more reliable, especially when a small number of selected genes is used (Fig. 2). On the contrary, bagging significantly improves the success rate scored on the Colon data set, both with and without feature selection (Fig. 3a). Considering the acceptance rate, there are no significant difference between bagged SVM employing f avg or f win and, whereas bagged SVM adopting f max achieve the highest values of Acc if the number of genes is less or equal to 512; for higher values the opposite situation occurs (Fig. 3b). While bagged SVM (especially when f max is used) show better values of the extremal margin with respect to single SVM when small numbers of genes are selected, we observe the opposite behavior if the number of considered genes is relatively large (Fig. 3c). Finally, bagged ensembles show clearly larger median margins with respect to, confirming a more overall reliability (Fig. 3d). Summarizing, bagged ensembles seem to be more accurate and confident in predictions with respect to. The simple gene selection method adopted is effective with the Leukemia data set, both when single and bagged SVM are used, while the accuracy for the Colon data set seems to be independent of the application of feature selection. The results obtained with are comparable to those presented in [11]; however, the application of the recursive feature elimination method allows to achieve better results than those obtained with bagged ensembles of SVM, at least for the Leukemia data set. Anyway, it is difficult to establish if a statistical significant difference between the two approaches exists, given the small size of the available samples. VI. CONCLUSIONS The results show that bagged ensembles of SVM are more reliable than in classifying DNA microarray data. Moreover they obtain an equivalent or a better accuracy in separating normal from malignant tissues, at least with Colon and Leukemia data sets. In fact, bagging is a variance reduction method which is able to improve the stability of classifiers [4], especially when the training set at hand has small size and large dimensionality, as in the present case. Despite its simplicity, the application of the feature selection method we used in our experiments allows to achieve better value of the success rate. However, it does not take into account the interactions of the expression levels between different genes. In order to manage this effect, we plan to employ more refined gene selection methods [11], in combination with bagging, to further improve the accuracy and the reliability of the predictions based on DNA microarray data. ACKNOWLEDGMENT This work was partially funded by INFM, unità di Genova. REFERENCES [1] A. Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 43:53 511, 2. [2] U. Alon et al. Broad patterns of gene expressions revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS, 96: , [3] L. Breiman. Bagging predictors. Machine Learning, 24(2):123 14, [4] L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):81 849, [5] M. Brown et al. Knowledge-base analysis of microarray gene expression data by using support vector machines. PNAS, 97(1): , 2. [6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 2: , [7] T.G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages Springer-Verlag, 2. [8] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. JASA, 97(457):77 87, 22. [9] T.S. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(1):96 914, 2. [1] T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286: , [11] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1/3): , 22. [12] J. Khan et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6): , 21. [13] D.J. Lockhart and E.A. Winzeler. Genomics, gene expression and DNA arrays. Nature, 45: , 2. [14] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Gene functional classification from heterogenous data. In Fifth International Conference on Computational Molecular Biology, 21. [15] G. Valentini. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artificial Intelligence in Medicine, 26(3):283 36, 22. [16] G Wahba. Spline models for observational data. In SIAM, Philadelphia, USA, 199. [17] C. Yeang et al. Molecular classification of multiple tumor types. In ISMB 21, Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, pages , Copenaghen, Denmark, 21. Oxford University Press. 1849
diagnosis through Random
Convegno Calcolo ad Alte Prestazioni "Biocomputing" Bio-molecular diagnosis through Random Subspace Ensembles of Learning Machines Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini DSI Dipartimento
More informationAn unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis
An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis Roberto Avogadri 1, Giorgio Valentini 1 1 DSI, Dipartimento di Scienze dell Informazione, Università degli Studi di Milano,Via
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationGene Selection for Cancer Classification using Support Vector Machines
Gene Selection for Cancer Classification using Support Vector Machines Isabelle Guyon+, Jason Weston+, Stephen Barnhill, M.D.+ and Vladimir Vapnik* +Barnhill Bioinformatics, Savannah, Georgia, USA * AT&T
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationMathematical Models of Supervised Learning and their Application to Medical Diagnosis
Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationClassification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationBayesian learning with local support vector machines for cancer classification with gene expression data
Bayesian learning with local support vector machines for cancer classification with gene expression data Elena Marchiori 1 and Michèle Sebag 2 1 Department of Computer Science Vrije Universiteit Amsterdam,
More informationData Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov
Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationSupervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationReliable classification of two-class cancer data using evolutionary algorithms
BioSystems 72 (23) 111 129 Reliable classification of two-class cancer data using evolutionary algorithms Kalyanmoy Deb, A. Raji Reddy Kanpur Genetic Algorithms Laboratory (KanGAL), Indian Institute of
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationMining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationExploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationGene expression analysis. Ulf Leser and Karin Zimmermann
Gene expression analysis Ulf Leser and Karin Zimmermann Ulf Leser: Bioinformatics, Wintersemester 2010/2011 1 Last lecture What are microarrays? - Biomolecular devices measuring the transcriptome of a
More informationFeature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationMS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationCombining SVM classifiers for email anti-spam filtering
Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More information203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationEnsemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationHow To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
More informationHow To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationPenalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
More informationTree based ensemble models regularization by convex optimization
Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationScalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
More informationdegrees of freedom and are able to adapt to the task they are supposed to do [Gupta].
1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationAn Introduction to the Use of Bayesian Network to Analyze Gene Expression Data
n Introduction to the Use of ayesian Network to nalyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co. Università degli Studi Milano-icocca
More informationEnsemble Learning of Colorectal Cancer Survival Rates
Ensemble Learning of Colorectal Cancer Survival Rates Chris Roadknight School of Computing Science University of Nottingham Malaysia Campus Malaysia Chris.roadknight@nottingham.edu.my Uwe Aickelin School
More informationAUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
More informationCase Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets
Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets Ricardo Ramos Guerra Jörg Stork Master in Automation and IT Faculty of Computer Science and Engineering
More informationLesson19: Comparing Predictive Accuracy of two Forecasts: Th. Diebold-Mariano Test
Lesson19: Comparing Predictive Accuracy of two Forecasts: The Diebold-Mariano Test Dipartimento di Ingegneria e Scienze dell Informazione e Matematica Università dell Aquila, umberto.triacca@univaq.it
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
More informationA Hybrid Data Mining Technique for Improving the Classification Accuracy of Microarray Data Set
I.J. Information Engineering and Electronic Business, 2012, 2, 43-50 Published Online April 2012 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijieeb.2012.02.07 A Hybrid Data Mining Technique for Improving
More informationRobust Feature Selection Using Ensemble Feature Selection Techniques
Robust Feature Selection Using Ensemble Feature Selection Techniques Yvan Saeys, Thomas Abeel, and Yves Van de Peer Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium and
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationMachine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu
Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill
More informationBeating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationEarly defect identification of semiconductor processes using machine learning
STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew
More informationData Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More informationGeneralizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
More informationA Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationData Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
More informationNew Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
More informationFeature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
More informationSUPPORT VECTOR MACHINE (SVM) is the optimal
130 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008 Multiclass Posterior Probability Support Vector Machines Mehmet Gönen, Ayşe Gönül Tanuğur, and Ethem Alpaydın, Senior Member, IEEE
More informationEnsemble Approach for the Classification of Imbalanced Data
Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au
More informationMolecular Genetics: Challenges for Statistical Practice. J.K. Lindsey
Molecular Genetics: Challenges for Statistical Practice J.K. Lindsey 1. What is a Microarray? 2. Design Questions 3. Modelling Questions 4. Longitudinal Data 5. Conclusions 1. What is a microarray? A microarray
More informationStatistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationL25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
More informationSub-class Error-Correcting Output Codes
Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationA Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
More informationNon-Parametric Tests (I)
Lecture 5: Non-Parametric Tests (I) KimHuat LIM lim@stats.ox.ac.uk http://www.stats.ox.ac.uk/~lim/teaching.html Slide 1 5.1 Outline (i) Overview of Distribution-Free Tests (ii) Median Test for Two Independent
More informationApplied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
More informationChapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
More information