Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis

Size: px
Start display at page:

Download "Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis"

Transcription

1 Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Giorgio Valentini DSI, Dip. di Scienze dell Informazione Università degli Studi di Milano, Italy INFM, Istituto Nazionale di Fisica della Materia, Marco Muselli IEIIT, Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, Italy muselli@ice.ge.cnr.it Francesca Ruffino DIMA, Dipartimento di Matematica Università di Genova, Italy Abstract Extracting information from gene expression data is a difficult task, as these data are characterized by very high dimensional, small sized, samples and large degree of biological variability. However, a possible way of dealing with the curse of dimensionality is offered by feature selection algorithms, while variance problems arising from small samples and biological variability can be addressed through ensemble methods based on resampling techniques. These two approaches have been combined to improve the accuracy of Support Vector Machines (SVM) in the classification of malignant tissues from DNA microarray data. To assess the accuracy and the confidence of the predictions performed proper measures have been introduced. Presented results show that bagged ensembles of SVM are more reliable and achieve equal or better classification accuracy with respect to single SVM, whereas feature selection methods can further enhance classification accuracy. I. INTRODUCTION DNA microarray technology provides fundamental insights into the mrna levels of large sets of genes, offering in such a way an approximate picture of the proteins of a cell at one time [13]. The large amount of gene expression data produced requires statistical and machine learning methods to analyze and extract significant knowledge from DNA microarray experiments. Typical problems arising from this analysis range from prediction of malignancies [15], [17] (a classification problem from a machine learning point of view) to functional discovery of new classes or subclasses of diseases [1] (an unsupervised learning problem), to the identification of groups of genes responsible or correlated with malignancies or polygenic diseases [11] (a feature selection problem). Several supervised methods have been applied to the analysis of cdna microarrays and high density oligonucleotide chips. These methods include decision trees, Fisher linear discriminant, Multi-Layer Perceptrons (MLP), Nearest- Neighbour classifiers, linear discriminant analysis, Parzen windows and others [5], [8], [1], [12], [14]. In particular, Support Vector Machines (SVM) have been recently applied to the analysis of DNA microarray gene expression data in order to classify functional groups of genes, normal and malignant tissues and multiple tumor types [5], [9], [17]. Other works pointed out the importance of feature selection methods to reduce the high dimensionality of the input space and to select the most relevant genes associated with specific functional classes [11]. Furthermore, ensembles of learning machines are wellsuited for gene expression data analysis, as they can reduce the variance due to the low cardinality of available training sets, and the bias due to specific characteristics of the learning algorithm [7]. Indeed, in recent works, combinations of binary classifiers (one-versus-all and all-pairs) and Error Correcting Output Coding (ECOC) ensembles of MLP, as well as ensemble methods based on resampling techniques, such as bagging and boosting, have been applied to the analysis of DNA microarray data [8], [15], [17]. In this work we show that the combination of feature selection methods and bagged ensembles of SVM can enhance the accuracy and the reliability of predictions based on gene expression data. In the next section the standard technique for training SVM with soft margin is presented together with a description of the considered method for feature selection. Then, procedure for bagging SVM is introduced examining different possible choices for the combination of classifiers. Finally, proper measures are employed to evaluate the performance of the proposed approach on two data sets available online, concerning tumor detection based on gene expression data produced by DNA microarrays. II. SVM TRAINING AND FEATURE SELECTION We can represent the output of a single experiment with a DNA microarray as a pair (x, y), being x R d a vector containing the expression levels for d selected genes and y { 1, +1} a binary variable determining the classification of the considered cell. As an example, y = +1 can be used to denote a tumoral cell and y = 1 for a normal cell. It is then evident that in our analysis every cell is associated with an input vector x containing the gene expression levels. When n different experiments are performed, we obtain a collection of n pairs T = {(x j, y j ) : j = 1,..., n} (training set); suppose, without loss of generality, that the first n + pairs have y j = +1, whereas the remaining n = n n + possess a negative output y j = 1. The target of a machine learning method is to construct from the pairs {(x j, y j )} n a classifier, i.e. a decision function h : R d { 1, +1}, that gives the correct classification y = h(x) for every cell (determined by x). To achieve this target, many available techniques generate a discriminant function f : R d R from the sample T at hand /3/$ IEEE 1844

2 and build h by employing the formula h(x) = sign(f(x)) (1) where the function sign(z) gives as output +1 if z and 1 otherwise. Among these techniques, SVM [6] turn out to be a promising approach, due to their theoretical motivations and their practical efficiency. They employ the following expression for the discriminant function f(x) = b + α j y j K(x j, x) (2) where the scalars α j are obtained, in the soft margin version, through the solution of the following quadratic programming problem: minimize the cost function W (α) = 1 α j α k y j y k K(x j, x k ) α j 2 k=1 subject to the constraints α j y j =, α j C for j = 1,..., n being C a regularization parameter. The symmetric function K(, ) must be chosen among the kernels of Reproducing Kernel Hilbert Spaces [16]; three possible choices are: Linear kernel: K(u, v) = u v Polynomial kernel: K(u, v) = (u v + 1) γ Gaussian kernel: K(u, v) = exp( u v 2 /σ 2 ) Since the point α of minimum of the quadratic programming problem can have several null components α j =, the sum in Eq. 2 receives the contribution of a subset V of patterns x j in T, called support vectors. The bias b in the SVM classifier is usually set to b = 1 V x V α j y j K(x j, x) where V denotes the number of elements of the set V. The accuracy of a classifier is affected by the dimension d of the input vector; roughly, the greater is d the lower is the probability of correctly classifying a pattern x. For this reason, feature selection methods are employed to choose a subset of relevant inputs (genes) for the problem at hand, so as to reduce the number of components x i. A simple feature selection method, originally proposed in [1], associates with every gene expression level x i a quantity c i given by c i = µ+ i σ + i µ i + σ i where µ + i and µ i are the mean value of x i across all the input patterns in T with positive and negative output, respectively µ + i = 1 n + n + x ji, µ i = 1 n j=n + +1 x ji (3) having denoted with x ji the ith component of the input vector x j. Similarly, σ + i and σ i are the standard deviation of x i computed in the set of pairs with positive and negative output, respectively. Then, the genes are ranked according to their c i value, and the first m and the last m genes are selected, thus obtaining a set of 2m inputs. The main problem of this approach is the underlying independence assumption of the expression patterns of each gene: indeed it fails in detecting the role of coordinately expressed genes in carcinogenic processes. Eq. 3 can also be used to compute the weights for weighted gene voting [1], a minor variant of diagonal linear discriminant analysis [8]. III. BAGGED ENSEMBLES OF SVM The low cardinality of the available data and the large degree of biological variability in gene expression suggest to apply variance-reduction methods, such as bagging, to these tasks. Denote with {T b } B b=1 a set of B (bootstrapped) samples, whose elements are drawn with replacement from the training set T according to a uniform probability distribution. Let f b be the discriminant function obtained by applying the softmargin SVM learning algorithm on the bootstrapped sample T b. The corresponding decision function h b is computed as usual through Eq. 1. The generalization ability of classifiers h b (base learners) can be improved by aggregating them through the standard formula (for two class classification problems) [3]: ( B ) h st (x) = sign h b (x) (4) b=1 In this way the decision function h st (x) of the bagged ensemble selects the most voted class among the B classifiers h b. Other choices of discriminant function for the bagged ensemble are possible, some of which lead to the above standard decision function h st (x) through Eq. 1. The following three expressions allow also to evaluate the quality of the classification offered by the bagged ensemble: f avg (x) = 1 B f win (x) = B f b (x) b=1 1 B b B f b (x) f max (x) = h st (x) max b B f b(x) where the set B = {b : h b (x) = h st (x)} contains the indices b of the base learners that vote for the class h st (x). Note that f avg (x) is the average of the f b (x), whereas f win (x) and f max (x) are, respectively, the average of the discriminant functions of the classifiers having indices in B and the signed maximum of their absolute value. 1845

3 Succ Acc.6 Mext Mmed (a).8 (b) Succ Acc M ext M med (c) (d) Fig. 1. Results obtained with for different numbers of selected genes. Colon data set: (a) Success and acceptance rate (b) Extremal and median margin. Leukemia data set: (c) Success and acceptance rate (d) External and median margin. The corresponding decision functions are given by h avg (x) = sign(f avg (x)) h win (x) = sign(f win (x)) = h st (x) h max (x) = sign(f max (x)) = h st (x) While h win (x) and h max (x) are equivalent to the standard choice h st (x), h avg (x) selects the class associated with the average of the discriminant functions computed by the base learners. Thus, the decision of each classifier in the ensemble is weighted via its prediction strength, measured by the value of the discriminant function f b ; on the contrary, in the decision function h st (x) each base learner receives the same weight. IV. ASSESSMENT OF CLASSIFIERS QUALITY Besides the success rate Succ = 1 2n y j + h(x j ) which is an estimate of the generalization error, several alternative measures can be used to assess the quality of classifiers producing a discriminant function f(x). These measures can then be directly applied to evaluate the confidence of the classification performed by simple SVM and bagged ensembles of SVM. By generalizing a definition introduced in [1], [11], a first choice is the extremal margin M ext, defined as θ + θ M ext = max f(x j) min f(x j) 1 j n 1 j n (5) 1846

4 SUCC ACC , 5.65 (a) (b) M EXT.15 M MED (c).15 (d) Fig. 2. Comparison of results obtained with single and bagged SVM on the Leukemia data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin. where the quantities θ + and θ are given by θ + = min 1 j n +f(x j), θ = max f(x j ) n + +1 j n It can be easily seen that the larger is the value of M ext, more confident is the classifier; note that if there are no classification errors M ext is positive. An alternative measure, less sensitive to outliers, is the median margin M med, which is defined as λ + λ M med = max f(x j) min f(x j) 1 j n 1 j n where λ + and λ are the median value of f(x) for the positive and negative class, respectively: λ + = min{λ R : J + λ n+ /2} λ = max{λ R : J λ n /2} (6) The sets J + λ (resp. J λ ) contain the indices j of the input patterns x j in the training set, where the discriminant function f(x j ) is greater (resp. lower) than λ: J + λ = {j : f(x j) > λ}, J λ = {j : f(x j) < λ} Finally, the acceptance rate Acc measures the fraction of samples that are correctly classified with high confidence. It is defined by the expression Acc = J + θ + J θ n where θ = max{ θ +, θ } is the smallest symmetric rejection zone to get zero error. It is important to remark that the acceptance rate is highly sensitive to the presence of outliers. (7) 1847

5 , 5.84 SUCC.82 ACC (a) (b) M EXT 5.6 M MED (c) (d) Fig. 3. Comparison of results obtained with single and bagged SVM on the Colon data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin. V. NUMERICAL EXPERIMENTS Here we present the results about the classification of DNA microarray data using the proposed techniques. We applied SVM linear classifiers to separate normal and malignant tissues with and without feature selection. Then we compare the results obtained with single and bagged SVM, using in all cases the filter method for feature selection described in Sec. II. A. Data sets The proposed approach has been tested on DNA microarray data available on-line. In particular, we used the Colon cancer data set [2] constituted by 62 samples including 22 normal and 4 colon cancer tissues. The data matrix contains expression values of 2 genes and has been preprocessed by taking the logarithm of all values and by normalizing feature (gene) vectors. This has been performed by subtracting the mean over all training values, by dividing by the corresponding standard deviation and finally by passing the result through a squashing arctan function to diminish the importance of outliers. The whole data set has been randomly split into a training and a test set of equal size, each one with the same proportion of normal and malignant examples. We also compared the different classifiers on the Leukemia data set [1]. It is composed by two variants of leukemia, ALL and AML, for a total of 72 examples split into a training set of 38 samples and a test set of 34 examples, with 7129 different genes. 1848

6 B. Results Fig. 1 summarizes the results with, obtained by varying the number of genes selected with the filter method described in Sec. II and by using the measures for classifier assessment introduced in Sec. IV. With the Colon data set, the accuracy does not change significantly when the feature selection method is applied; however, the prediction is more reliable, as attested by the higher values of Acc and M med (Fig. 1a and 1b), when the number of inputs lies beyond 256. On the contrary, we obtain the highest success rate on the Leukemia data set with only 16 selected genes; the corresponding acceptance rate is also significantly high (Fig. 1c). The extremal margin is negative but very close to, thus showing that the Leukemia data set is near linearly separable, with a relatively high confidence (Fig. 1d). Fig. 2 and 3 compare the results obtained through the application of bagged ensembles of SVM (for different choice of the decision function) with those achieved by. On the Leukemia data set, bagging seems not to improve the success rate, even if the predictions are more reliable, especially when a small number of selected genes is used (Fig. 2). On the contrary, bagging significantly improves the success rate scored on the Colon data set, both with and without feature selection (Fig. 3a). Considering the acceptance rate, there are no significant difference between bagged SVM employing f avg or f win and, whereas bagged SVM adopting f max achieve the highest values of Acc if the number of genes is less or equal to 512; for higher values the opposite situation occurs (Fig. 3b). While bagged SVM (especially when f max is used) show better values of the extremal margin with respect to single SVM when small numbers of genes are selected, we observe the opposite behavior if the number of considered genes is relatively large (Fig. 3c). Finally, bagged ensembles show clearly larger median margins with respect to, confirming a more overall reliability (Fig. 3d). Summarizing, bagged ensembles seem to be more accurate and confident in predictions with respect to. The simple gene selection method adopted is effective with the Leukemia data set, both when single and bagged SVM are used, while the accuracy for the Colon data set seems to be independent of the application of feature selection. The results obtained with are comparable to those presented in [11]; however, the application of the recursive feature elimination method allows to achieve better results than those obtained with bagged ensembles of SVM, at least for the Leukemia data set. Anyway, it is difficult to establish if a statistical significant difference between the two approaches exists, given the small size of the available samples. VI. CONCLUSIONS The results show that bagged ensembles of SVM are more reliable than in classifying DNA microarray data. Moreover they obtain an equivalent or a better accuracy in separating normal from malignant tissues, at least with Colon and Leukemia data sets. In fact, bagging is a variance reduction method which is able to improve the stability of classifiers [4], especially when the training set at hand has small size and large dimensionality, as in the present case. Despite its simplicity, the application of the feature selection method we used in our experiments allows to achieve better value of the success rate. However, it does not take into account the interactions of the expression levels between different genes. In order to manage this effect, we plan to employ more refined gene selection methods [11], in combination with bagging, to further improve the accuracy and the reliability of the predictions based on DNA microarray data. ACKNOWLEDGMENT This work was partially funded by INFM, unità di Genova. REFERENCES [1] A. Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 43:53 511, 2. [2] U. Alon et al. Broad patterns of gene expressions revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS, 96: , [3] L. Breiman. Bagging predictors. Machine Learning, 24(2):123 14, [4] L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):81 849, [5] M. Brown et al. Knowledge-base analysis of microarray gene expression data by using support vector machines. PNAS, 97(1): , 2. [6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 2: , [7] T.G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages Springer-Verlag, 2. [8] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. JASA, 97(457):77 87, 22. [9] T.S. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(1):96 914, 2. [1] T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286: , [11] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1/3): , 22. [12] J. Khan et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6): , 21. [13] D.J. Lockhart and E.A. Winzeler. Genomics, gene expression and DNA arrays. Nature, 45: , 2. [14] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Gene functional classification from heterogenous data. In Fifth International Conference on Computational Molecular Biology, 21. [15] G. Valentini. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artificial Intelligence in Medicine, 26(3):283 36, 22. [16] G Wahba. Spline models for observational data. In SIAM, Philadelphia, USA, 199. [17] C. Yeang et al. Molecular classification of multiple tumor types. In ISMB 21, Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, pages , Copenaghen, Denmark, 21. Oxford University Press. 1849

diagnosis through Random

diagnosis through Random Convegno Calcolo ad Alte Prestazioni "Biocomputing" Bio-molecular diagnosis through Random Subspace Ensembles of Learning Machines Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini DSI Dipartimento

More information

An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis

An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis Roberto Avogadri 1, Giorgio Valentini 1 1 DSI, Dipartimento di Scienze dell Informazione, Università degli Studi di Milano,Via

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Gene Selection for Cancer Classification using Support Vector Machines

Gene Selection for Cancer Classification using Support Vector Machines Gene Selection for Cancer Classification using Support Vector Machines Isabelle Guyon+, Jason Weston+, Stephen Barnhill, M.D.+ and Vladimir Vapnik* +Barnhill Bioinformatics, Savannah, Georgia, USA * AT&T

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Bayesian learning with local support vector machines for cancer classification with gene expression data

Bayesian learning with local support vector machines for cancer classification with gene expression data Bayesian learning with local support vector machines for cancer classification with gene expression data Elena Marchiori 1 and Michèle Sebag 2 1 Department of Computer Science Vrije Universiteit Amsterdam,

More information

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Reliable classification of two-class cancer data using evolutionary algorithms

Reliable classification of two-class cancer data using evolutionary algorithms BioSystems 72 (23) 111 129 Reliable classification of two-class cancer data using evolutionary algorithms Kalyanmoy Deb, A. Raji Reddy Kanpur Genetic Algorithms Laboratory (KanGAL), Indian Institute of

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Decompose Error Rate into components, some of which can be measured on unlabeled data

Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Exploratory data analysis for microarray data

Exploratory data analysis for microarray data Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Gene expression analysis. Ulf Leser and Karin Zimmermann

Gene expression analysis. Ulf Leser and Karin Zimmermann Gene expression analysis Ulf Leser and Karin Zimmermann Ulf Leser: Bioinformatics, Wintersemester 2010/2011 1 Last lecture What are microarrays? - Biomolecular devices measuring the transcriptome of a

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Combining SVM classifiers for email anti-spam filtering

Combining SVM classifiers for email anti-spam filtering Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. 1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data n Introduction to the Use of ayesian Network to nalyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co. Università degli Studi Milano-icocca

More information

Ensemble Learning of Colorectal Cancer Survival Rates

Ensemble Learning of Colorectal Cancer Survival Rates Ensemble Learning of Colorectal Cancer Survival Rates Chris Roadknight School of Computing Science University of Nottingham Malaysia Campus Malaysia Chris.roadknight@nottingham.edu.my Uwe Aickelin School

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets Ricardo Ramos Guerra Jörg Stork Master in Automation and IT Faculty of Computer Science and Engineering

More information

Lesson19: Comparing Predictive Accuracy of two Forecasts: Th. Diebold-Mariano Test

Lesson19: Comparing Predictive Accuracy of two Forecasts: Th. Diebold-Mariano Test Lesson19: Comparing Predictive Accuracy of two Forecasts: The Diebold-Mariano Test Dipartimento di Ingegneria e Scienze dell Informazione e Matematica Università dell Aquila, umberto.triacca@univaq.it

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan

More information

A Hybrid Data Mining Technique for Improving the Classification Accuracy of Microarray Data Set

A Hybrid Data Mining Technique for Improving the Classification Accuracy of Microarray Data Set I.J. Information Engineering and Electronic Business, 2012, 2, 43-50 Published Online April 2012 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijieeb.2012.02.07 A Hybrid Data Mining Technique for Improving

More information

Robust Feature Selection Using Ensemble Feature Selection Techniques

Robust Feature Selection Using Ensemble Feature Selection Techniques Robust Feature Selection Using Ensemble Feature Selection Techniques Yvan Saeys, Thomas Abeel, and Yves Van de Peer Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium and

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill

More information

Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Early defect identification of semiconductor processes using machine learning

Early defect identification of semiconductor processes using machine learning STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

SUPPORT VECTOR MACHINE (SVM) is the optimal

SUPPORT VECTOR MACHINE (SVM) is the optimal 130 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008 Multiclass Posterior Probability Support Vector Machines Mehmet Gönen, Ayşe Gönül Tanuğur, and Ethem Alpaydın, Senior Member, IEEE

More information

Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

More information

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey Molecular Genetics: Challenges for Statistical Practice J.K. Lindsey 1. What is a Microarray? 2. Design Questions 3. Modelling Questions 4. Longitudinal Data 5. Conclusions 1. What is a microarray? A microarray

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Sub-class Error-Correcting Output Codes

Sub-class Error-Correcting Output Codes Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Non-Parametric Tests (I)

Non-Parametric Tests (I) Lecture 5: Non-Parametric Tests (I) KimHuat LIM lim@stats.ox.ac.uk http://www.stats.ox.ac.uk/~lim/teaching.html Slide 1 5.1 Outline (i) Overview of Distribution-Free Tests (ii) Median Test for Two Independent

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information