STA 4273H: Statistical Machine Learning

Similar documents

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Christfried Webers. Canberra February June 2015

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Linear Classification. Volker Tresp Summer 2015

Linear Models for Classification

Statistical Machine Learning

CSCI567 Machine Learning (Fall 2014)

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Lecture 3: Linear methods for classification

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

CSC 411: Lecture 07: Multiclass Classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Bayes and Naïve Bayes. cs534-machine Learning

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to Logistic Regression

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Lecture 8 February 4

HT2015: SC4 Statistical Data Mining and Machine Learning

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Predict Influencers in the Social Network

Statistics Graduate Courses

Linear Threshold Units

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Logistic Regression (1/24/13)

Gaussian Processes in Machine Learning

Classification Problems

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

3F3: Signal and Pattern Processing

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression

Basics of Statistical Machine Learning

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Methods of Data Analysis Working with probability distributions

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Machine Learning Logistic Regression

1 Maximum likelihood estimation

Statistical Models in Data Mining

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Classification. Chapter 3

Lecture 6: Logistic Regression

The equivalence of logistic regression and maximum entropy models

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Local classification and local likelihoods

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Supervised Learning (Big Data Analytics)

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Tutorial on Markov Chain Monte Carlo

Parallelization Strategies for Multicore Data Analysis

Principled Hybrids of Generative and Discriminative Models

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Learning outcomes. Knowledge and understanding. Competence and skills

Dealing with large datasets

Penalized regression: Introduction

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Question 2 Naïve Bayes (16 points)

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Program description for the Master s Degree Program in Mathematics and Finance

Classification by Pairwise Coupling

Monotonicity Hints. Abstract

Machine Learning.

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Predict the Popularity of YouTube Videos Using Early View Data

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Probability and Statistics

Data Mining. Nonlinear Classification

11. Time series and dynamic linear models

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

BayesX - Software for Bayesian Inference in Structured Additive Regression

Classification Techniques for Remote Sensing

Probabilistic Methods for Time-Series Analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Logistic Regression for Spam Filtering

Principles of Data Mining by Hand&Mannila&Smyth

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Pricing and calibration in local volatility models via fast quantization

The Optimality of Naive Bayes

Statistical Machine Learning from Data

Data Mining: An Overview. David Madigan

Bayesian Image Super-Resolution

An introduction to OBJECTIVE ASSESSMENT OF IMAGE QUALITY. Harrison H. Barrett University of Arizona Tucson, AZ

A Simple Introduction to Support Vector Machines

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Markov Chain Monte Carlo Simulation Made Simple

Penalized Logistic Regression and Classification of Microarray Data

Introduction to General and Generalized Linear Models

Advanced Signal Processing and Digital Noise Reduction

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Handling attrition and non-response in longitudinal data

Supporting Online Material for

Transcription:

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6

Three Approaches to Classification Construct a discriminant function that directly maps each input vector to a specific class. Model the conditional probability distribution use this distribution to make optimal decisions. There are two approaches: and then - Discriminative Approach: Model directly, for example by representing them as parametric models, and optimize for parameters using the training set (e.g. logistic regression). - Generative Approach: Model class conditional densities together with the prior probabilities for the classes. Infer posterior probability using Bayes rule: We will consider next.

Fixed Basis Functions So far, we have considered classification models that work directly in the input space. All considered algorithms are equally applicable if we first make a fixed nonlinear transformation of the input space using vector of basis functions Decision boundaries will be linear in the feature space but would correspond to nonlinear boundaries in the original input space x. Classes that are linearly separable in the feature space not be linearly separable in the original input space. need

Linear Basis Function Models Original input space Corresponding feature space using two Gaussian basis functions We define two Gaussian basis functions with centers shown by green the crosses, and with contours shown by the green circles. Linear decision boundary (right) is obtained using logistic regression, and corresponds to nonlinear decision boundary in the input space (left, black curve).

Logistic Regression Consider the problem of two-class classification. We have seen that the posterior probability of class C 1 can be written as a logistic sigmoid function: where and we omit the bias term for clarity. This model is known as logistic regression (although this is a model for classification rather than regression). logistic sigmoid function Note that for generative models, we would first determine the class conditional densities and class-specific priors, and then use Bayes rule to obtain the posterior probabilities. Here we model directly.

ML for Logistic Regression We observed a training dataset Maximize the probability of getting the label right, so the likelihood function takes form: Taking the negative log of the likelihood, we can define cross-entropy error function (that we want to minimize): Differentiating and using the chain rule: Note that the factor involving the derivative of the logistic function cancelled.

ML for Logistic Regression We therefore obtain: prediction target This takes exactly the same form as the gradient of the sum-ofsquares error function for the linear regression model. Unlike in linear regression, there is no closed form solution, due to nonlinearity of the logistic sigmoid function. The error function is convex and can be optimized using standard gradient-based (or more advanced) optimization techniques. Easy to adapt to the online learning setting.

Multiclass Logistic Regression For the multiclass case, we represent posterior probabilities by a softmax transformation of linear functions of input variables : Unlike in generative models, here we will use maximum likelihood to determine parameters of this discriminative model directly. As usual, we observed a dataset we use 1-of-K encoding for the target vector t n. where So if x n belongs to class C k, then t is a binary vector of length K containing a single 1 for element k (the correct class) and 0 elsewhere. For example, if we have K=5 classes, then an input that belongs to class 2 would be given a target vector:

Multiclass Logistic Regression We can write down the likelihood function: N K binary matrix of target variables. Only one term corresponding to correct class contributes. where Taking the negative logarithm gives the cross-entropy entropy function for multi-class classification problem: Taking the gradient:

Special Case of Softmax If we consider a softmax function for two classes: So the logistic sigmoid is just a special case of the softmax function that avoids using redundant parameters: - Adding the same constant to both a 1 and a 2 has no effect. - The over-parameterization of the softmax is because probabilities must add up to one.

Recap Generative approach: Determine the class conditional densities and class-specific priors, and then use Bayes rule to obtain the posterior probabilities. Different models can be trained separately on different machines. It is easy to add a new class without retraining all the other classes. Discriminative approach: Train all of the model parameters to maximize the probability of getting the labels right. Model directly.

Bayesian Logistic Regression We next look at the Bayesian treatment of logistic regression. For the two-class problem, the likelihood takes form: Similar to Bayesian linear regression, we could start with a Gaussian prior: However, the posterior distribution is no longer Gaussian, and we cannot analytically integrate over model parameters w. We need to introduce some approximations.

Consider a simple distribution: Pictorial illustration The plot shows the normalized distribution (in yellow), which is not Gaussian. The red curve displays the corresponding Gaussian approximation.

Recap: Computational Challenge of Bayesian Framework Remember: the big challenge is computing the posterior distribution. There are several main approaches: Analytical integration: If we use conjugate priors, the posterior distribution can be computed analytically (we saw this for Bayesian linear regression). We will consider Laplace approximation next. Gaussian (Laplace) approximation: Approximate the posterior distribution with a Gaussian. Works well when there is a lot of data compared to the model complexity (as posterior is close to Gaussian). Monte Carlo integration: The dominant current approach is Markov Chain Monte Carlo (MCMC) -- simulate a Markov chain that converges to the posterior distribution. It can be applied to a wide variety of problems. Variational approximation: A cleverer way to approximate the posterior. It often works much faster, but not as general as MCMC.

Laplace Approximation We will use the following notation: We can evaluate but cannot evaluate For example point-wise Goal: Find a Gaussian approximation q(z) which is centered on a mode of the distribution p(z).

Laplace Approximation We will use the following notation: At the stationary point gradient vanishes. the Consider a Taylor approximation around where A is a Hessian matrix: Exponentiating both sides:

Laplace Approximation We will use the following notation: Using Taylor approximation, we get: Hence a Gaussian approximation for is: where is the mode of and A is the Hessian:

Laplace Approximation We will use the following notation: Using Taylor approximation, we get: Bayesian inference: Identify: The posterior is approximately Gaussian around the MAP estimate:

Laplace Approximation We will use the following notation: Using Taylor approximation, we get: We can approximate Model Evidence: using Laplace approximation: Data fit Occam factor: penalize model complexity

Bayesian Information Criterion BIC can be obtained from the Laplace approximation: by taking the large sample limit (N! 1) where N is the number of data points. Quick and easy, does not depend on the prior. Can use maximum likelihood estimate instead of the MAP estimate. D denotes the number of well-determined parameters. Danger: Counting parameters can be tricky (e.g. infinite models).

Bayesian Logistic Regression Remember the likelihood: And the prior: The log of the posterior takes form: Log-prior term Log-likelihood term We first maximize the log-posterior to get the MAP estimate: The inverse of covariance is given by the matrix of second derivatives: The Gaussian approximation to the posterior distribution is given by:

Predictive Distribution The predictive distribution for class C 1, given a new input x * is given by marginalizing with respect to posterior distribution which is itself approximated by a Gaussian distribution: with the corresponding probability for class C 2 given by: Still not tractable. The convolution of Gaussian with logistic sigmoid cannot be evaluated analytically.

Predictive Distribution Note that the logistic function depends on w only through its projection onto x *. Denoting we have: where ± is the Dirac delta function. Hence Let us characterize p(a). 1-dimensional integral. The delta function imposes a linear constraint on w. It forms a marginal distribution from the joint q(w) by marginalizing out all directions orthogonal to x *. Since q(w) is Gaussian, the marginal is also Gaussian.

Predictive Distribution We can evaluate the mean and variance of the marginal p(a). Hence we obtain approximate predictive: Same form as the predictive distribution for the Bayesian linear regression model. The integral is 1-dimensional and can further be approximated via:

Midterm Review Polynomial curve fitting generalization, overfitting Decision theory: Minimizing misclassification rate / Minimizing the expected loss Loss functions for regression

Midterm Review Bernoulli, Multinomial random variables (mean, variances) Multivariate Gaussian distribution (form, mean, covariance) Maximum likelihood estimation for these distributions. Exponential family / Maximum likelihood estimation / sufficient statistics for exponential family. Linear basis function models / maximum likelihood and least squares:

Midterm Review Regularized least squares: Ridge regression Bias-variance decomposition. High variance Low bias

Midterm Review Bayesian Inference: likelihood, prior, posterior: Marginal likelihood / predictive distribution. Bayesian linear regression / parameter estimation / posterior distribution / predictive distribution Marginal likelihood (normalizing constant): Bayesian model comparison / Evidence approximation Matching data and model complexity

Midterm Review Classification models: Discriminant functions Fisher s linear discriminant Perceptron algorithm Probabilistic Generative Models / Gaussian class conditionals / Maximum likelihood estimation:

Midterm Review Discriminative Models / Logistic regression / maximum likelihood estimation Laplace approximation BIC Bayesian logistic regression / predictive distribution