FISHER DISCRIMINANT ANALYSIS WITH KERNELS
|
|
|
- Vivien Poole
- 10 years ago
- Views:
Transcription
1 FISHER DISCRIMINANT ANALYSIS WITH KERNELS Sebastian Mikat, Gunnar fitscht, Jason Weston! Bernhard Scholkopft, and Klaus-Robert Mullert tgmd FIRST, Rudower Chaussee 5, Berlin, Germany $Royal Holloway, University of London, Egham, Surrey, UK {mika, raetsch, bs, Abstract. A non-linear classification technique based on Fisher9s discriminant is proposed. The main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space. The linear classification in feature space corresponds to a (powerful) non-linear decision function in input space. Large scale simulations demonstrate the competitiveness of our approach. DISCRIMINANT ANALYSIS In classification and other data analytic tasks it is often necessary to utilize pre-processing on the data before applying the algorithm at hand and it is common to first extract features suitable for the task to solve. Feature extraction for classification differs significantly from feature extraction for describing data. For example PCA finds directions which have minimal reconstruction error by describing as much variance of the data as possible with m orthogonal directions. Considering the first directions they need not (and in practice often will not) reveal the class structure that we need for proper classification. Discriminant analysis addresses the following question: Given a data set with two classes, say, which is the best feature or feature set (either linear or non-linear) to discriminate the two classes? Classical approaches tackle this question by starting with the (theoretically) optimal Bayes classifier and, by assuming normal distributions for the classes, standard algorithms like quadratic or linear discriminant analysis, among them the famous Fisher discriminant, can be derived (e.g. [5]). Of course any other model different from a Gaussian for the class distributions could be assumed, this, however, often sacrifices the simple closed form solution. Several modifications towards more general features have been proposed (e.g. [SI); for an introduction and review on existing methods see e.g. [3, 5, 8, 111. In this work we propose to use the kernel idea [l], originally applied in Support Vector Machines [19, 14]), kernel PCA [16] and other kernel based algorithms (cf. [14]) to define a non-linear generalization of Fisher s discriminant. Our method uses kernel feature spaces yielding a highly flexible X/99/$ IEEE 41
2 , algorithm which turns out to be competitive with Support Vector Machines. Note that there exists a variety of methods called Kernel Discriminant Analysis [8]. Most of them aim at replacing the parametric estimate of class conditional distributions by a non-parametric kernel estimate. Even if our approach might be viewed in this way too, it is important to note that it goes one step further by interpreting the kernel as a dot-product in another space. This allows a theoretically sound interpretation together with an appealing closed form solution. In the following we will first review Fisher s discriminant, apply the kernel trick, then report classification results and finally present our conclusions. In this paper we will focus on two-class problems only and discriminants linear in the feature space. FISHER S LINEAR DISCRIMINANT Let XI = {xi,...,xi,} and X2 = {xf,.. ~ x ;~} be samples from two different classes and with some abuse of notation X = XI U X2 = { XI,..., xt}. Fisher s linear discriminant is given by the vector w which maximizes where WTSBW J(w) = wtsww SW := (x-m,)(x-mi)t i=1,2 e X; are the between and within class scatter matrices respectively and mi is defined by m, := $ C$, xi. The intuition behind maximizing J(w) is to find a direction which maximizes the projected class means (the numerator) while minimizing the classes variance in this direction (the denominator). But there is also a well known statistical way to motivate (1): Connection to the optimal linear Bayes classifier: The optimal Bayes classifier compares the a posteriori probabilities of all classes and assigns a pattern to the class with the maximal probability (e.g. [5]). However, the a- posteriori probabilities are usually unknown and have to be estimated from a finite sample. For most classes of distributions this is a difficult task and often it is impossible to get a closed form estimate. However, by assuming normal distributions for all classes, one arrives at quadratic discriminant analysis (which essentially.measures the Mahalanobis distance of a pattern towards the class center). Simplifying the problem even further and assuming equal covariance structure for all classes, quadratic discriminant analysis becomes linear. For two-class problems it is easy to show that the vector w maximizing (1) is in the same direction as the discriminant in the corresponding Bayes optimal classifier. Although relying on heavy assumptions which are 42 (3)
3 not true in many applications, Fisher s linear discriminant has proven very powerful. One reason is certainly that a linear model is rather robust against noise and most likely will not overfit. Crucial, however, is the estimation of the scatter matrices, which might be highly biased. Using simple plug-in estimates as in (2) when the number of samples is small compared to the dimensionality will result in a high variability. Different ways to deal with such a situation by regularization have been proposed (e.g. [4, 71) and we will return to this topic later. FISHER S DISCRIMINANT IN THE FEATURE SPACE Clearly, for most real-world data a linear discriminant is not complex enough. To increase the expressiveness of the discriminant we could either try to use more sophisticated distributions in modeling the optimal Bayes classifier or look for non-linear directions (or both). As pointed out before, assuming general distributions will cause trouble. Here we restrict ourselves to finding non-linear directions by first mapping the data non-linearly into some feature space F and computing Fisher s linear discriminant there, thus thus implicitly yielding a non-linear discriminant in input space. Let 9 be a non-linea mapping to some feature space 7. To find the linear discriminant in T we need to maximize where now w E 3 and 5 : and S$ are the corresponding matrices in F, i.e. Sz := (m: - m;)(m: - m;)t and S$ := (a(=)- m )(@(z) - m lt i=1,2 XEXi with rnt := & 9(zj). Introducing kernel functions: Clearly, if.f is very high- or even infinitely dimensional this will be impossible to solve directly. To overcome this limitation we use the same trick as in Kernel PCA [16] or Support Vector Machines. Instead of mapping the data explicitely we seek a formulation of the algorithm which uses only dot-products (@(z)@(y)) of the training patterns. As we are then able to compute these dot-products efficiently we can solve the original problem without ever mapping explicitely to F. This can be achieved using Mercer kernels (e.g. [12]): these kernels k(z, y) compute a dot-product in some feature space F, i.e. k(z,y) = Possible choices for k which have proven useful e.g. in Support Vector machines or Kernel PCA are Gaussian RBF, k(z,y) = exp(-l)a: - y1i2/c), or polynomial kernels, k(z, y) = (z. y)d, for some positive constants c and d respectively. To find Fisher s discrimiiiant in the feature space F, we first need a formulation of (4) in terms of only dot products of input patterns which we then 43
4 replace by some kernel function. From the theory of reproducing kernels we know that any solution w E 3 must lie in the span of all training samples in F. Therefore we can find an expansion for w of the form 1 e w = "iip(2i) (5) Using the expansion (5) and the definition of rnf we write i=l where we defined (Mi)j:= 6 k(q, 2:. and replaced the dot products by the kernel function. Now consider the numerator of (4). Be using the definition of Sz and (6) it can be rewritten as wts:w = atma (7) where M := (MI- Mz)(M1 - Mz)~. Considering the denominator, using (5), the definition of mf and a similar transformation as in (7) we find: wts$w = atna (8) where we set N := Cj=1,2Kj(I - lej)kjty Kj is a l x tj matrix with (Kj)nm:= k(xn,xk) (this is the kernel matrix for class j), I is the identity and ltj the matrix with all entries l/tj. Combining (7) and (8) we can find Fisher's linear discriminant in F by maximizing &Ma J(a) = - atna This problem can be solved (analogously to the algorithm in the input space) by finding the leading eigenvector of N-lM. We will call this approach (nonlinear) Kernel Fisher Discriminant (KFD). The projection of a new pattern x onto w is given by e = ai k(zi, 2). i=l Numerical issues and regularization: Obviously, the proposed setting is ill-posed: we are estimating L dimensional covariance structures from t samples. Besides numerical problems which cause the matrix N not to be positive, we need a way of capacity control in 3. To this end, we simply add a multiple of the identity matrix to N. i.e. replace N by N,, where (9) 44
5 This can be viewed in different ways: (i) it clearly makes the problem numerically more stable, as for p large enough No will become positive definite; (ii) it can be 'seen in analogy to [4], decreasing the bias in sample based estimation of eigenvalues; (iii) it imposes a regularization on Ilal12 (remember that we are maximizing (9)), favoring solutions with small expansion coefficients. Although the real influence in this setting of the regularization is not yet fully understood, it shows connections to those used in Support Vector Machines (see also [14]). Furthermore, one might use other regularization type additives to N, e.g. penalizing ll'~)11~ in analogy to SVM (by adding the full kernel matrix Kij = k(zi, zj)). EXPERIMENTS Figure 1 shows an illustrative comparison of the feature found by KFD and the first and second (non-linear) feature found by Kernel PCA [16] on a toy data set. For both we used a polynomial kernel of degree two and for KFD the regularized within class scatter (11) where p = Depicted are the two classes (crosses and dots), the feature value (indicated by grey level) and contour lines of identical feature value. Each class consists of two noisy parabolic shapes mirrored at the x and 9 axis respectively. We see, that the KFD feature discriminates the two classes in a nearly optimal way, whereas the Kernel PCA features, albeit describing interesting properties of the data set, do not separate the two classes well (although higher order Kernel PCA features might be discriminating, too). To evaluate the performance of our new approach we performed an extensive comparison to other state-of-the-art classifiers. The experimental setup was chosen in analogy to [lo] and we compared the Kernel Fisher Discriminant to AdaBoost, regularized AdaBoost (also [lo]) and Support Vector Machines (with Gaussian kernel). For KFD we used Gaussian kernels, too, and the regularized within-class scatter from (11). After the optimal direction 'U) E T was found, we computed projections onto it by using (10). To estimate an optimal threshold on the extracted feature, one may use any classification technique, e.g. as simple as fitting a sigmoid [9]. Here we used a linear Support Vector Machine (which is optimized by gradient descent as we Figure 1: Comparison of feature found by KFD (left) and those found by Kernel PCA: first (middle) and second (right); details see text. 45
6 Table 1: Comparison between KFD, a single RBF classifier, AdaBoost (AB), regularized AdaBoost (ABR) and Support Vector Machine (SVM) (see text). Best method in bold face. second best emdhasized. Banana B.Cancer Diabetes German Heart Image Ringnorm FSonar Splice Thyroid Titanic Twonorm Waveform RBF 10.8f f f f f f f f of f f f fl.l AB 12.3f f f f f f f f f ~ f f f0.6 ABR 10.9zt f f f f f f f f f f f0.8 SVM 11.5f Ozt f f f3.3 3.of f f f f f f zto.4 KFD 10.8f f f f f f fl f f f f zt0.4 only have l-d samples). A drawback of this, however, is that we have another parameter to control, namely the regularization constant in the SVM. We used 13 artificial and real world datasets from the UCI, DELVE and STATLOG benchmark repositories (except for banana).'l The problems which are not binary were partitioned into two-class problems. Then 100 partitions into test and training set (about 60%:40%) were generated. On each of these data sets we trained and tested all classifiers (see [lo] for details). The results in table 1 show the average test error over these 100 runs and the standard deviation. To estimate the necessary parameters we ran 5-fold cross validation on the first five realizations of the training sets and took the model parameters to be the median over the five estimates.2 Furthermore, we conducted preliminary experiments with KFD on the USPS dataset of handwritten digits where we restricted the expansion of w in (5) to run only over the first 3000 training samples. We achieved a 10- class error of 3.7% with a Gaussian kernel of width , which is slightly superior to a SVM with Gaussian kernel (4.2% [13]). Experimental results: The experiments show that the Kernel Fisher Discriminant (plus a Support Vector Machine to estimate the threshold) is competitive or in some cases even superior to the other algorithms on almost all data sets (an exception being image). Interestingly, both SVM and KFD construct an (in some sense) optimal hyperplane in F, while we notice that the one given by the solution w of KFD is often superior to the one of SVM. 'The breast cancer domain was obtained from the University Medical Center, Inst. of Oncology, Ljubljana, Yugoslavia. Thanks to M. Zwitter and M. Soklic for the data. 21n fact we did two such runs, first with a coarse and then with a finer stepping over parameter space. The data sets can be obtained via http: //m.first.gmd.de/'raetsch/. 46
7 DISCUSSION AND CONCLUSIONS Fisher s discriminant is one of the standard linear techniques in statistical data analysis. However, linear methods are often too limited and there have been several approaches in the past to derive more general class separability criteria (e.g. [6, 8, 51). Our approach is very much in this spirit, however, due to the fact that we are computing the discriminant function in some feature space F (which is non-linearly related to input space), we are still able to find closed form solutions and maintain the theoretical beauty of Fisher s discriminant analysis. Furthermore different kernels allow for high flexibility due to the wide range of non-linearities possible. Our experiments show that KFD is competitive to other state of the art classification techniques. Furthermore, there is still much room for extensions and further theory as linear discriminant analysis is an intensively studied field and many ideas previously developed in the input space carry over to feature space. Note that while the complexity of SVMs scales with the number of Support Vectors, KFD does not have a notion of SVs and its complexity scales with the number of training patterns. On the other hand, we speculate, that some of the superior performance of KFD over SVM might be related to the fact, that KFD uses all training samples in the solution, not only the difficult ones, i.e. the Support Vectors. Future work will be dedicated to finding suitable approximation schemes (e.g. [2, 151) and numerical algorithms for obtaining the leading eigenvectors of large matrices. Further fields of study will include the construction of multi-class discriminants, a theoretical analysis of generalization error bounds of KFD, and the investigation of the connection between KFD and Support Vector Machines (cf. [18, 171). Acknowledgment. This work was supported in part by grants of the DFG (JA 379/5-1, 5-2, 7-1) and EC STORM project number We thank A. Smola for helpful comments and discussions. REFERENCES [l] M. Aizerman, E. Braverman and L. Rozonoer, Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp , [2] C. Burges, Simplified support vector decision rules, in L. Saitta, ed., Prooceedings, 13th ICML, San Mateo, CA, 1996, pp [3] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach, prentice Hall, [4] J. Friedman, Regularized discriminant analysis, Journal of the American Statistical Association, vol. 84, no. 405, -pp ,
8 [5] K. fikunaga, Introduction to Statistical Pattern Recognition, San Diego: Academic Press, 2nd edn., [6] T. Hastie and R. Tibshirani, Discriminant analysis by gaussian mixtures, Journal of the Royal Statistical Society, December [7] T. Hastie, R. Tibshirani and A. Buja, Flexible discriminant analysis by optimal scoring, Journal of the American Statistical Association, December [8] G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, [9] J. Platt, Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, Submitted to Advances in Large Margin Classifiers, A.J. Smola, P. Bartlett, B. Scholkopf, D. Schuurmans, eds., MIT Press, 1999, to appear. [lo] G. Riitsch, T. Onoda and K.-R. Muller, Soft margins for aditboost, Tech. Rep. NC-TR , Royal Holloway College, University of London, UK, 1998, Submitted to Machine Learning. [ll] B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, [12] S. Saitoh, Theory of Reproducing Kernels and its Applications, Longman Scientific & Technical, Harlow, England, [13] B. Scholkopf, Support vector learning, Oldenbourg Verlag, [14] B. Scholkopf, C. Burges and A. Smola, eds., Advances in Kernel Methods - Support Vector Learning, MIT Press, [15] B. Scholkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Muller, G. Ratsch and A. Smola, Input space vs. feature space in kernel-based methods, IEEE Transactions on Neural Networks, 1999, Special Issue on VC Learning Theory and Its Applications, to appear. [l6] B. Scholkopf, A. Smola and K.-R. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation, vol. 10, pp ,1998. [17] A. Shashua, On the relationship between the support vector machine for classification and sparsified fisher s linear discriminant, Neural Processing Letters, vol. 9, no. 2, pp , April [U] S. Tong and D. Koller, Bayes optimal hyperplanes + maximal margin hyperplanes, Submitted to IJCAI 99 Workshop on Support Vector Machines (robotics. stanf ord. edu/ koller/). [19] V. Vapnik, The nature of statistical learning theory, New York: Springer Verlag,
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Which Is the Best Multiclass SVM Method? An Empirical Study
Which Is the Best Multiclass SVM Method? An Empirical Study Kai-Bo Duan 1 and S. Sathiya Keerthi 2 1 BioInformatics Research Centre, Nanyang Technological University, Nanyang Avenue, Singapore 639798 [email protected]
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Large Margin DAGs for Multiclass Classification
S.A. Solla, T.K. Leen and K.-R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 [email protected]
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH
1 SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH Y, HONG, N. GAUTAM, S. R. T. KUMARA, A. SURANA, H. GUPTA, S. LEE, V. NARAYANAN, H. THADAKAMALLA The Dept. of Industrial Engineering,
SUPPORT VECTOR MACHINE (SVM) is the optimal
130 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008 Multiclass Posterior Probability Support Vector Machines Mehmet Gönen, Ayşe Gönül Tanuğur, and Ethem Alpaydın, Senior Member, IEEE
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
Choosing Multiple Parameters for Support Vector Machines
Machine Learning, 46, 131 159, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Choosing Multiple Parameters for Support Vector Machines OLIVIER CHAPELLE LIP6, Paris, France [email protected]
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
A fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Support Vector Machine. Tutorial. (and Statistical Learning Theory)
Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. [email protected] 1 Support Vector Machines: history SVMs introduced
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz
AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
Support Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
A Neural Support Vector Network Architecture with Adaptive Kernels. 1 Introduction. 2 Support Vector Machines and Motivations
A Neural Support Vector Network Architecture with Adaptive Kernels Pascal Vincent & Yoshua Bengio Département d informatique et recherche opérationnelle Université de Montréal C.P. 6128 Succ. Centre-Ville,
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
Classification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
Identification algorithms for hybrid systems
Identification algorithms for hybrid systems Giancarlo Ferrari-Trecate Modeling paradigms Chemistry White box Thermodynamics System Mechanics... Drawbacks: Parameter values of components must be known
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Support Vector Machines for Dynamic Biometric Handwriting Classification
Support Vector Machines for Dynamic Biometric Handwriting Classification Tobias Scheidat, Marcus Leich, Mark Alexander, and Claus Vielhauer Abstract Biometric user authentication is a recent topic in the
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Multi-class Classification: A Coding Based Space Partitioning
Multi-class Classification: A Coding Based Space Partitioning Sohrab Ferdowsi, Svyatoslav Voloshynovskiy, Marcin Gabryel, and Marcin Korytkowski University of Geneva, Centre Universitaire d Informatique,
Learning to Find Pre-Images
Learning to Find Pre-Images Gökhan H. Bakır, Jason Weston and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstraße 38, 72076 Tübingen, Germany {gb,weston,bs}@tuebingen.mpg.de
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Support-Vector Networks
Machine Learning, 20, 273-297 (1995) 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Support-Vector Networks CORINNA CORTES VLADIMIR VAPNIK AT&T Bell Labs., Holmdel, NJ 07733,
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
A Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
Operations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 [email protected]
Maximum Margin Clustering
Maximum Margin Clustering Linli Xu James Neufeld Bryce Larson Dale Schuurmans University of Waterloo University of Alberta Abstract We propose a new method for clustering based on finding maximum margin
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES
Proceedings of the 2 nd Workshop of the EARSeL SIG on Land Use and Land Cover CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES Sebastian Mader
Using artificial intelligence for data reduction in mechanical engineering
Using artificial intelligence for data reduction in mechanical engineering L. Mdlazi 1, C.J. Stander 1, P.S. Heyns 1, T. Marwala 2 1 Dynamic Systems Group Department of Mechanical and Aeronautical Engineering,
A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM
Journal of Computational Information Systems 10: 17 (2014) 7629 7635 Available at http://www.jofcis.com A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM Tian
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
THE GOAL of this work is to learn discriminative components
68 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 1, JANUARY 2005 Discriminative Components of Data Jaakko Peltonen and Samuel Kaski, Senior Member, IEEE Abstract A simple probabilistic model is introduced
How to report the percentage of explained common variance in exploratory factor analysis
UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
DUOL: A Double Updating Approach for Online Learning
: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 [email protected] Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University
A Study on the Comparison of Electricity Forecasting Models: Korea and China
Communications for Statistical Applications and Methods 2015, Vol. 22, No. 6, 675 683 DOI: http://dx.doi.org/10.5351/csam.2015.22.6.675 Print ISSN 2287-7843 / Online ISSN 2383-4757 A Study on the Comparison
Application of Support Vector Machines to Fault Diagnosis and Automated Repair
Application of Support Vector Machines to Fault Diagnosis and Automated Repair C. Saunders and A. Gammerman Royal Holloway, University of London, Egham, Surrey, England {C.Saunders,A.Gammerman}@dcs.rhbnc.ac.uk
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES
A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES K.M.Ruba Malini #1 and R.Lakshmi *2 # P.G.Scholar, Computer Science and Engineering, K. L. N College Of
COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS
COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS B.K. Mohan and S. N. Ladha Centre for Studies in Resources Engineering IIT
On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers
On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 [email protected] Abstract
Christfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
A Linear Programming Approach to Novelty Detection
A Linear Programming Approach to Novelty Detection Colin Campbell Dept. of Engineering Mathematics, Bristol University, Bristol Bristol, BS8 1 TR, United Kingdon C. [email protected] Kristin P. Bennett
Classification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
Model Trees for Classification of Hybrid Data Types
Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,
Classification Techniques for Remote Sensing
Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara [email protected] http://www.cs.bilkent.edu.tr/ saksoy/courses/cs551
Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks
Chin. J. Astron. Astrophys. Vol. 5 (2005), No. 2, 203 210 (http:/www.chjaa.org) Chinese Journal of Astronomy and Astrophysics Automated Stellar Classification for Large Surveys with EKF and RBF Neural
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Sub-class Error-Correcting Output Codes
Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat
Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on
Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classifica6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output
Distance Metric Learning for Large Margin Nearest Neighbor Classification
Journal of Machine Learning Research 10 (2009) 207-244 Submitted 12/07; Revised 9/08; Published 2/09 Distance Metric Learning for Large Margin Nearest Neighbor Classification Kilian Q. Weinberger Yahoo!
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006. Principal Components Null Space Analysis for Image and Video Classification
1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006 Principal Components Null Space Analysis for Image and Video Classification Namrata Vaswani, Member, IEEE, and Rama Chellappa, Fellow,
