# More Generality in Efficient Multiple Kernel Learning

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 Three things are worth noting about the primal. First, we choose to use a non-convex formulation, as opposed to the convex k wt k w k/d k (Rakotomamonjy et al., 2008), since for general kernel combinations, w t k w k need not tend to zero when d k tends to zero. Second, we place r(d) in the objective and incorporate a scale parameter within it rather than having it as an equality constraint (typically k d k = 1 or k d2 k = 1). Third, the constraint d 0 can often be relaxed to a more general one which simply requires the learnt kernel to be positive definite. Conversely, the constraints can also be strengthened if prior knowledge is available. In either case, if d r exists then the gradient descent based optimization is still applicable. However, the projection back into the feasible set can get more expensive. In order to leverage existing large scale optimizers, we follow the standard procedure (Chapelle et al., 2002) of reformulating the primal as a nested two step optimization. In the outer loop, the kernel is learnt by optimizing over d while, in the inner loop, the kernel is held fixed and the SVM parameters are learnt. This can be achieved by rewriting the primal as follows Min d where T(d) subject to d 0 (3) T(d) = Min w,b 1 2 wt w + i l(y i,f(x i )) + r(d) We now need to prove that d T exists, and calculate it efficiently, if we are to utilize gradient descent in the outer loop. This can be achieved by moving to the dual formulation of T given by (for classification and regression respectively) and W C (d) = max 1 t α 1 α 2 αt YK d Yα + r(d) (4) subject to 1 t Yα = 0, 0 α C (5) W R (d) = max α 1 t Yα 1 2 αt K d α +r(d) ǫ1 t α (6) subject to 1 t α = 0, 0 α C (7) where K d is the kernel matrix for a given d and Y is a diagonal matrix with the labels on the diagonal. Note that we can write T = r+p and W = r+d with strong duality holding between P and D. Therefore, T(d) = W(d) for any given value of d, and it is sufficient for us to show that W is differentiable and calculate d W. Proof of the differentiability of W C and W R comes from Danskin s Theorem (Danskin, 1967). Since the feasible set is compact, the gradient can be shown to exist if k, r, d k and d r are smoothly varying functions of d and if α, the value of α that optimizes W, is unique. Furthermore, a straight forward extension of Lemma 2 in (Chapelle et al., 2002) can be used to show that W C and W R (as well as others obtained from loss functions for novelty detection, ranking, etc.) have derivatives given by T = W = r 1 H d k d k d 2α t α (8) k d k where H = YKY for classification and H = K for regression. Thus, in order to take a gradient step, all we need to do is obtain α. Note that since W C or W R are equivalent to their corresponding single kernel SVM duals with kernel matrix K d, α can be obtained by any SVM optimization package. The final algorithm is given in Algorithm 1 and we refer to it as Generalized MKL (GMKL). The step size s n is chosen based on the Armijo rule to guarantee convergence and the projection step, for the constraints d 0, is as simple as d max(0, d). Note that the algorithm is virtually unchanged from (Varma & Ray, 2007) apart from the more general form of the kernel k and regularizer r. If a faster rate of convergence was required, our assumptions could be suitably modified so as to take second order steps rather than perform gradient descent. Only very mild restrictions have been placed on the form of the learnt kernel k and regularizer r. As regards k, the only constraints that have been imposed are that K be strictly positive definite for all valid d and that d k exists and be continuous. Many kernels can be constructed that satisfy these properties. One can, of course, learn the standard sum of base kernels. More generally, products of base kernels, and other combinations which yield positive definite kernels, can also be learnt now. In addition, one can also tune kernel parameters in general kernels such as k d (x i,x j ) = (d 0 + m d mx t i A mx j ) n or Algorithm 1 Generalized MKL. 1: n 0 2: Initialize d 0 randomly. 3: repeat 4: K k(d n ) 5: Use an SVM solver of choice to solve the single kernel problem with ( kernel K and obtain ) α. 6: d n+1 k d n k sn r d k 1 H 2α t d k α 7: Project d n+1 onto the feasible set if any constraints are violated. 8: n n + 1 9: until converged

5 Table 1: Gender identification results. The final row summarizes the average number of features selected (in brackets) by each wrapper method and the resultant classification accuracy. See text for details. N d AdaBoost B&R 2007 OWL-QN LP-SVM S-SVM BAHSIC MKL GMKL ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (12.6) 91 (221.3) 91 (58.3) 90.8 (252) 91.6 (146.3) 95.5 (69.6) UCI datasets, there can be as much as a 6% to 10% difference in performance between GMKL and MKL. We also demonstrate that GMKL performs better than the other methods considered. To generate feature selection results, we can vary the hyper-parameter C in the wrapper methods to select a desired number of features. However, this strategy does not yield good classification results even though globally optimal solutions are obtained. Low values of C encouraged greater sparsity but also permitted more classification errors. We obtained much better results by the theoretically suboptimal strategy of fixing C to a large value (chosen via cross-validation so as to minimize classification error), learning the classifier, taking the top ranked components of w (or d) and relearning the classifier using only the selected features. This technique was used to generate the results in Tables 1 and 2. For each dataset, the very last row summarizes the number of features selected (N s ) by each wrapper method and the resultant classification accuracy. When the number of desired features (N d ) is less than N s, the classification accuracy is determined using the N d top ranked features. Otherwise, when N d > N s, the table entry is left blank as the classification accuracy either plateaus or decreases as suboptimal features are added. In such a situation, it is better to choose only N s features and maintain accuracy Gender Identification We tackle the binary gender identification problem on the benchmark database of (Moghaddam & Yang, 2002). The database has images of 1044 male and 711 female faces giving a total of 1755 images in all. We follow the standard experimental setup and use 1053 images for training and 702 for testing. Results are averaged over 3 random splits of the data. Each image in the database has been pre-processed by (Moghaddam & Yang, 2002) to be aligned and has been scaled down to have dimensions Thus, each image has 252 pixels and we associate an RBF kernel with each pixel based on its grey scale intensity directly. For Generalized MKL, the 252 base kernels are combined by taking their product to get k d (x i,x j ) = 252 m=1 where x e dm(xim xjm)2 im and x jm represent the intensity of the m th pixel in image i and j respectively. For standard MKL, the same 252 base kernels are combined using the sum representation to get k d (x i,x j ) = 252 m=1 d me γm(xim xjm)2. Both methods are subject to l 1 regularization on d so that only a few kernels are selected. AdaBoost can also be used to combine the 252 weak classifiers derived from the individual base kernels. The method of (Baluja & Rowley, 2007) can be considered as a state-of-the-art boosting variant for this problem and operates on pairs of pixels. We use RBF kernels for BAHSIC as well while the other methods are linear. Table 1 lists the feature selection results. AdaBoost tended to perform the worst and selected only 12.6 features on average. The poor performance was due to the fact that each of the 252 SVMs was learnt independently. The weak classifier coefficients (i.e. kernel weights) did not influence the individual SVM parameters. By contrast, there is a very tight coupling between the two in MKL and GMKL and this ensures better performance. Of course, other forms of boosting do not have this limitation and the state-of-the-art boosting method of (Baluja & Rowley, 2007) performs better but is still significantly inferior to GMKL. The performance of the linear feature selection methods is quite variable. Logistic regression performs poorly as the implicit bias weight got set to zero due to the l 1 regularization. On the other hand, LP- SVM and Sparse-SVM perform even better than standard MKL (though not better than GMKL). BAHSIC, which is a non-linear filter method, follows the same

6 Table 2: UCI results with datasets having N training points and M features. See text for details. Ionosphere: N = 246, M = 34, Uniform MKL = 89.9 ± 2.5, Uniform GMKL = 94.6 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (9.8) 89.2 (25.2) 92.6 (34.0) 92.9 (34.0) 88.1 (29.3) 94.4 (16.9) Parkinsons: N = 136, M = 22, Uniform MKL = 87.3 ± 3.9, Uniform GMKL = 91.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (5.2) 83.6 (11.1) 87.2 (22.0) 87.2 (22.0) 88.3 (14.6) 92.7 (9.0) Musk: N = 333, M = 166, Uniform MKL = 90.2 ± 3.2, Uniform GMKL = 93.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (31.1) 83.5 (86.7) 83.4 (166.0) 83.3 (166.0) 88.2 (73.2) 93.6 (57.9) Sonar: N = 145, M = 60, Uniform MKL = 82.9 ± 3.4, Uniform GMKL = 84.6 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (18.7) 74.7 (39.3) 73.6 (60.0) 73.5 (60.0) 81.4 (38.6) 82.3 (20.4) Wpbc: N = 135, M = 34, Uniform MKL = 72.1 ± 5.4, Uniform GMKL = 77.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (5.1) 78.3 (20.8) 79.0 (34.0) 78.9 (34.0) 69.3 (24.8) 80.0 (16.8)

7 trend and its performance is significantly worse than GMKL with identical kernels. This would suggest that wrapper methods based on the right feature representation should be preferable to filter methods which do not directly optimize for classification accuracy. The comparison between MKL and GMKL is even starker. GMKL achieves a classification accuracy of 93.2% using as few as 20 features and 95.1% using only 30 features. In comparison, standard MKL achieves just 83.8% and 86.3% respectively. This reinforces the observation that choosing the right kernel representation is much more important than converging to the globally optimal solution. Finally, the MKL and GMKL results for fixed uniform weights chosen via cross-validation are 92.6 ± 0.9 and 94.3 ± 0.1 respectively. Note that these results are obtained using all 252 features. GMKL can obtain a similar classification accuracy using as few as 25 features. In summary, there can be a difference of almost 10% between standard MKL and GMKL for a fixed small number of features. GMKL also does significantly better than BAHSIC and achieves nearly a 10x compression factor as compared to taking uniform weights UCI Datasets We also present comparative results on UCI datasets. We follow the standard experimental methodology (Rakotomamonjy et al., 2008) where 70% of the points are used for training and the remaining 30% for testing. Results are reported over 20 random splits of the data. All datasets are preprocessed to have zero mean and unit variance. An RBF kernel is assigned to each of the M features in a given dataset. The M RBF kernels are then combined linearly for standard MKL and by taking their product for GMKL. The learnt kernels are of the form k d (x i,x j ) = M m=1 d me γm(xim xjm)2 and k d (x i,x j ) = M m=1 e dm(xim xjm)2 respectively. Table 2 lists the feature selection results. The analysis Table 3: Comparison with the results in (Rakotomamonjy et al., 2008). GMKL achieves slightly better results but takes far fewer kernels as input. Database SimpleMKL GMKL Sonar 80.6 ± 5.1 (793) 82.3 ± 4.8 (60) Wpbc 76.7 ± 1.2 (442) 79.0 ± 3.5 (34) Ionosphere 91.5 ± 2.5 (442) 93.0 ± 2.1 (34) Liver 65.9 ± 2.3 (091) 72.7 ± 4.0 (06) Pima 76.5 ± 2.6 (117) 77.2 ± 2.1 (08) is similar to that of gender identification except that OWL-QN performs better as the implicit bias is not always being set to zero. For a fixed number of features, GMKL has the best classification results even as compared to MKL or BAHSIC (though the variance can be high for all the methods). Furthermore, a kernel with fixed uniform weights yields ballpark classification accuracies though GMKL can achieve the same results using far fewer features Comparison to SimpleMKL and HMKL The classification performance of standard MKL can be improved by adding extra base kernels which are either more informative or help better approximate a desired kernel function. However, this can lead to a more complex and costlier learning task. We therefore leave aside feature selection for the moment and compare our results to those reported in (Rakotomamonjy et al., 2008). Table 3 lists classification performance on the 5 datasets of (Rakotomamonjy et al., 2008). The number of kernels input to each method are reported in brackets. As can be seen, GMKL can achieve slightly better performance than SimpleMKL while training on far fewer kernels. Of course, one can reduce the number of kernels input to SimpleMKL but this will result in reduced accuracy. In the limit that only a single kernel is used per feature, we will get back the results of Table 2 where GMKL does much better than standard MKL. The results for Liver were obtained using l 2 regularization. This lead to a 7% improvement in performance over SimpleMKL as a sparse representation is not suitable for this database (there are only 6 features). Pima also has very few features but l 1 and l 2 regularization give comparable results. Finally, we also compare results to Hierarchical MKL (Bach, 2008). We use a quadratic kernel of the form k d (x i,x j ) = (1 + m d mx im x jm ) 2 as compared to the more powerful k d (x i,x j ) = m (1 + x imx jm ) 4 of (Bach, 2008) which decomposes to a linear combination of 5 M kernels. Nevertheless, as shown in Table 4, we achieve slightly better results with less computational cost. Table 4: Comparison between HKL and GMKL. Database N M HKL GMKL Magic ± ± 1.2 Spambase ± ± 0.8 Mushroom ± ± 0.0

8 6. Conclusions In this paper, we have shown how traditional MKL formulations can be easily extended to learn general kernel combinations subject to general regularization on the kernel parameters. While our focus was on binary classification our approach can be applied to other loss functions and even other formulations such as Local MKL, multi-class and multi-label MKL. Generalized kernel learning can be achieved very efficiently based on gradient descent optimization and existing large scale SVM solvers. As such, it is now possible to learn much richer feature representations as compared to standard MKL without sacrificing any efficiency in terms of speed of optimization. Our GMKL formulation based on products of kernels was shown to give good results for various feature selection problems not only as compared to traditional MKL but also as compared to leading wrapper and filter feature selection methods. Of course, taking products of kernels might not always be the right approach to every problem. In such cases, our formulation can be used to learn other appropriate representation including sums of kernels. Finally, it should be noted that the classification accuracy of GMKL with learnt weights tends to be much the same as that obtained using uniform weights chosen through cross-validation. The advantage in learning would therefore seem to lie in the fact that GMKL can learn to achieve the same classification accuracy but using far fewer features. Acknowledgements We are grateful to P. Anandan, Andrew Zisserman and our reviewers for providing helpful feedback. References Andrew, G., & Gao, J. (2007). Scalable training of L 1 - regularized log-linear models. ICML (pp ). Argyriou, A., Micchelli, C. A., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. COLT (pp ). Bach, F. R. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. NIPS (pp ). Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. ICML (pp. 6 13). Baluja, S., & Rowley, H. (2007). Boosting sex identification performance. IJCV, 71, Bi, J., Zhang, T., & Bennet, K. P. (2004). Columngeneration boosting methods for mixture of kernels. Proc. SIGKDD (pp ). Chan, A. B., Vasconcelos, N., & Lanckriet, G. (2007). Direct convex relaxations of sparse SVM. ICML (pp ). Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for Support Vector Machines. Machine Learning, 46, Crammer, K., Keshet, J., & Singer, Y. (2002). Kernel design using boosting. NIPS (pp ). Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2001). On kernel-target alignment. NIPS (pp ). Danskin, J. M. (1967). The theorey of max-min and its applications to weapons allocation problems. Fung, G., & Mangasarian, O. L. (2002). A feature selection newton method for support vector machine classification (Technical Report 02-03). Univ. of Wisconsin. Kloft, M., Brefeld, U., Laskov, P., & Sonnenburg, S. (2008). Non-sparse Multiple Kernel Learning. NIPS Workshop on Kernel Learning. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. JMLR, 5, Moghaddam, B., & Yang, M. H. (2002). Learning gender with support faces. IEEE PAMI, 24, Ong, C. S., Smola, A. J., & Williamson, R. C. (2005). Learning the kernel with hyperkernels. JMLR, 6, Rakotomamonjy, A., Bach, F., Grandvalet, Y., & Canu, S. (2008). Simplemkl. JMLR, 9, Song, L., Smola, A., Gretton, A., Borgwardt, K., & Bedo, J. (2007). Supervised feature selection via dependence estimation. ICML (pp ). Sonnenburg, S., Raetsch, G., Schaefer, C., & Schoelkopf, B. (2006). Large scale multiple kernel learning. JMLR, 7, Varma, M., & Ray, D. (2007). Learning the discriminative power-invariance trade-off. ICCV. Zien, A., & Ong, C. S. (2007). Multiclass multiple kernel learning. ICML (pp ).

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### DUOL: A Double Updating Approach for Online Learning

: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Infinite Kernel Learning

Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. TR-78 Infinite Kernel Learning Peter Vincent Gehler and Sebastian Nowozin October 008

### Multiple Kernel Learning on the Limit Order Book

JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

### Maximum Margin Clustering

Maximum Margin Clustering Linli Xu James Neufeld Bryce Larson Dale Schuurmans University of Waterloo University of Alberta Abstract We propose a new method for clustering based on finding maximum margin

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Direct Convex Relaxations of Sparse SVM

Antoni B. Chan abchan@ucsd.edu Nuno Vasconcelos nuno@ece.ucsd.edu Gert R. G. Lanckriet gert@ece.ucsd.edu Department of Electrical and Computer Engineering, University of California, San Diego, CA, 9037,

### Financial Forecasting with Gompertz Multiple Kernel Learning

2010 IEEE International Conference on Data Mining Financial Forecasting with Gompertz Multiple Kernel Learning Han Qin Computer and Information Science University of Oregon Eugene, OR 97403, USA qinhan@cs.uoregon.edu

### Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Large Scale Learning to Rank

Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

### Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Lecture 2: The SVM classifier

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

### Convex Optimization SVM s and Kernel Machines

Convex Optimization SVM s and Kernel Machines S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola and Stéphane Canu S.V.N.

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### Supervised Learning with Unsupervised Output Separation

Supervised Learning with Unsupervised Output Separation Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa 150 Louis Pasteur, P.O. Box 450 Stn. A Ottawa, Ontario,

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

### Support Vector Machines

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Learning with Local and Global Consistency

Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

### Maximum-Margin Matrix Factorization

Maximum-Margin Matrix Factorization Nathan Srebro Dept. of Computer Science University of Toronto Toronto, ON, CANADA nati@cs.toronto.edu Jason D. M. Rennie Tommi S. Jaakkola Computer Science and Artificial

### Learning with Local and Global Consistency

Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

### Author Gender Identification of English Novels

Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

### LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

### Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

### A semi-supervised Spam mail detector

A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Adapting Codes and Embeddings for Polychotomies

Adapting Codes and Embeddings for Polychotomies Gunnar Rätsch, Alexander J. Smola RSISE, CSL, Machine Learning Group The Australian National University Canberra, 2 ACT, Australia Gunnar.Raetsch, Alex.Smola

### On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 fbach@cs.berkeley.edu Abstract

### Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin

Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

### Classification using intersection kernel SVMs is efficient

Classification using intersection kernel SVMs is efficient Jitendra Malik UC Berkeley Joint work with Subhransu Maji and Alex Berg Fast intersection kernel SVMs and other generalizations of linear SVMs

### Lecture 5: Conic Optimization: Overview

EE 227A: Conve Optimization and Applications January 31, 2012 Lecture 5: Conic Optimization: Overview Lecturer: Laurent El Ghaoui Reading assignment: Chapter 4 of BV. Sections 3.1-3.6 of WTB. 5.1 Linear

### CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression

A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression José Guajardo, Jaime Miranda, and Richard Weber, Department of Industrial Engineering, University of Chile Abstract

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Machine learning challenges for big data

Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

### Towards Ultrahigh Dimensional Feature Selection for Big Data

Journal of Machine Learning Research 15 (2014) 1371-1429 Submitted 5/12; Revised 1/14; Published 4/14 Towards Ultrahigh Dimensional Feature Selection for Big Data Mingkui Tan School of Computer Engineering

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

### Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

### Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### A Feature Selection Methodology for Steganalysis

A Feature Selection Methodology for Steganalysis Yoan Miche 1, Benoit Roue 2, Amaury Lendasse 1, Patrick Bas 12 1 Laboratory of Computer and Information Science Helsinki University of Technology P.O. Box

### Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

### Transform-based Domain Adaptation for Big Data

Transform-based Domain Adaptation for Big Data Erik Rodner University of Jena Judy Hoffman Jeff Donahue Trevor Darrell Kate Saenko UMass Lowell Abstract Images seen during test time are often not from

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

### Optimization for Machine Learning

Optimization for Machine Learning Lecture 4: SMO-MKL S.V. N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning

### Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl

### A User s Guide to Support Vector Machines

A User s Guide to Support Vector Machines Asa Ben-Hur Department of Computer Science Colorado State University Jason Weston NEC Labs America Princeton, NJ 08540 USA Abstract The Support Vector Machine

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### On Multifont Character Classification in Telugu

On Multifont Character Classification in Telugu Venkat Rasagna, K. J. Jinesh, and C. V. Jawahar International Institute of Information Technology, Hyderabad 500032, INDIA. Abstract. A major requirement

### Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### SUPPORT VECTOR MACHINE (SVM) is the optimal

130 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008 Multiclass Posterior Probability Support Vector Machines Mehmet Gönen, Ayşe Gönül Tanuğur, and Ethem Alpaydın, Senior Member, IEEE

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Musical Pitch Identification

Musical Pitch Identification Manoj Deshpande Introduction Musical pitch identification is the fundamental problem that serves as a building block for many music applications such as music transcription,

### Enhanced Protein Fold Recognition through a Novel Data Integration Approach

Enhanced Protein Fold Recognition through a Novel Data Integration Approach Yiming Ying 1, Kaizhu Huang 2 and Colin Campbell 1 1 Department of Engineering Mathematics, University of Bristol, Bristol BS8

### Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

### Factorization Machines

Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute of Scientific and Industrial Research Osaka University, Japan rendle@ar.sanken.osaka-u.ac.jp Abstract In this

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Machine Learning in FX Carry Basket Prediction

Machine Learning in FX Carry Basket Prediction Tristan Fletcher, Fabian Redpath and Joe D Alessandro Abstract Artificial Neural Networks ANN), Support Vector Machines SVM) and Relevance Vector Machines

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

M A C H I N E L E A R N I N G P R O J E C T F I N A L R E P O R T F A L L 2 7 C S 6 8 9 CLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS MARK GRUMAN MANJUNATH NARAYANA Abstract In the CAT Tournament,

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### A fast multi-class SVM learning method for huge databases

www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

### Large-Scale Machine Learning with Stochastic Gradient Descent

Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA leon@bottou.org Abstract. During the last decade, the data sizes have grown faster than