# These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Save this PDF as:

Size: px
Start display at page:

Download "These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop"

## Transcription

1 Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

2 Linear Models for Classification Goal: take input vector x and map it onto one of K discrete classes Input space divided into decision regions divided by decision surfaces or decision boundaries Consider linear models: those separable by (D-1) dimensional hyperplanes in the D-dimensional input space. Perfect separation: linearly-separable Recall simplest linear regression model: y(x) = w T x + w 0 Use activation function f(. ) to map function onto discrete classes y(x) = f ( w T ) x + w 0 Due to f(. ) these models are no longer linear in the parameters 2

3 Discriminant Functions 2 class case, simplest: y(x) = w T x + w 0 where w is a weight vector and wo is the bias Decision boundary is 0. Consider 2 points xa and xa lying on decision surface. Because y(xa) = y(xb) = 0 we have w T (x A x B ) = 0 thus vector w is orthogonal to every vector lying within decision surface If x is on decision surface then y(x)=0 indicating that: w T x w = w 0 w We thus see that bias wo decides location of decision surface 3

4 Geometry of linear discriminant y > 0 y = 0 y < 0 x 2 R 2 R 1 w x y(x) w x x 1 w 0 w Decision surface is perpendicular to w. Displacement controlled by w0. 4

5 Multiple Classes Not generally good idea to use multiple 2-class classifiers to do K- class classification Leads to ambiguous regions. See figure:? C 1 C 3 R 1 R 1 R 2 C 1? R 3 C 1 R 3 not C 1 C 2 C 2 R 2 C 3 not C 2 C 2 Left: distinguish points in Ck from those not in Ck. Right: pairwise separation of classes Ck and Cj. Both lead to ambiguous regions. 5

6 Single K-class classifier Single discriminant comprising K linear functions of form y k (x) = wk T x + w k0 Point x belongs in class Ck if yk(x) > yj(x) for all j k Decision boundary between Ck and Cj is had by yk(x) = yj(x) and corresponds to (D-1)-dimensional hyperplane (w k w j ) T x + (w k0 w j0 ) = 0 Decision region singly connected and convex (due to linearity of discriminant functions. See Bishop pg 184. R j R i R k x A ˆx x B 6

7 Least Squares Classification Approximates conditional expectation E[t x] Each class Ck described by its own linear model y k (x) = w T k x + w k0 Minimize sum-of-squared errors on set of models Works very poorly. E.g. very sensitive compared to logistic regression to outliers

8 Least Squares Classification Second example: even for easy dataset with clear decision boundaries one class is underrepresented (compare to logistic regression below) Behavior not surprising when we consider that it corresponds to ML under assumption of Gaussian conditional distribution. Binary target vectors are far from Gaussian

9 Fisher s linear discriminant Can look at classification as dimensionality reduction: take D dimensional input vector and project it down to 1 dimension using: y = w T x Use threshold such that y w0 is classified as C0 (Yields standard linear classifier from previous slide). Adjust weight vector to project with maximum class separation. Consider 2-class version: m 1 = 1 N 1 n C 1 x n m 2 = 1 N 2 n C 2 x n Simplest measure of class separation is separation of class means. Choose w to maximize m 2 m 1 = w T (m 2 m 1 ) where m k = w T m k si the mean of the projected data from class Ck. 9

10 Fisher s linear discriminant Still problematic: when there is strongly non-diagonal covariance between class distributions, separation is poor: Fishers: maximize separating between class means while minimizing variance within each class. Within class variance is: s 2 k = (y n m k ) 2 where y n = w T x n n C k Total within-class variance is s s 2 2 Fisher criterion is ratio of the between-class mean and the within-class J(w) = (m 2 m 1 ) 2 s s2 2 10

11 Geometry of Fisher s Left: samples from two classes along with histograms resulting from projection onto line joining the class means (reproduced from previous slide). Right: corresponding projection based on Fisher linear discriminant. 11

12 Fisher s criterion: Fisher s linear discriminant J(w) = (m 2 m 1 ) 2 s s2 2 Can relate this to weights w by rewriting as: where SB is between-class covariance and where SW is within-class covariance S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T J(w) = wt S B w w T S W w S B = (m 2 m 1 )(m 2 m 1 ) T Differentiating J(w) with respect to w reveals maximum when: (w T S B w)s W w = (w T S W w)s B w SBw is always in the direction of (m2 - m1). Don t care about magnitiude of w so drop scalar factors Multiply both sides by yields Fisher s linear discriminant S 1 W w S 1 W (m 2 m 1 ) 12 (w T S B w) and (w T S W w)

13 Fisher s linear discriminant Least squares approach to linear discriminant was to make model predictions as close as possible to target values Fisher was derived based on maximum separation of classes Fisher can still be derived as special case of least squares Trick is to recode target for C1 to be N/N1 where N1 is the number of patterns in class C1 and N is the total number of patterns. For C2 target is -N/N2. Approximates reciprocal of the prior for the class. Read Fisher s criterion can be extended to >2 classes. Read

14 Rosenblatt (1962) Perceptron algorithm. Linear model with step activation function: y(x) = f(w T φ(x)) f(a) = { +1, a 0 1, a < 0. Train using perceptron criterion E P = n M w T φ n t n where M is the set of misclassified patterns. Note that direct misclassification using total number of miscalssified patterns will not work because of nonlinear f() (the gradient is ill-conditioned). Instead we seek a weight vector such that x n C 1, w T φ(x n ) > 0 and x n C 2, w T φ(x n ) < 0. 14

15 Perceptron algorithm. Total error function is piecewise linear (across all misclassified patterns). Stochastic gradient descent: w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ n t n Update is not a function of w thus η can be equal to 1 Perceptron convergence theorem: If there exists an exact solution (i.e. if training set is indeed linearly separable) then PA will find a solution in finite number of steps. Attacked by Minsky and Papert in Perceptrons (1969). Attack valid only for single-layer perceptrons. Still, stopped research in neural computation for nearly a decade. 15

16 Geometry of Perceptron Convergence for 2 classes (red and blue). Black arrow = w; black line is decision boundary. On top left, the point circled in green is misclassified thus its feature vector is added to w. This is repeated on the bottom 16

17 Probabilistic Generative Models Model class-conditional densities Posterior probability for class C1: p(c 1 x) = where we have defined σ is the logistic sigmoid; Note that and that the inverse of σ is the logit function a = ln = ( σ 1 σ representing ratio of probabilities p(x C k ) p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) exp( a) = σ(a) ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) σ( a) = 1 σ(a) a = ln[p(c 1 x)/p(c 2 x)] 17

18 Probabilistic Generative Models Generalization to multiple classes: normalized exponential function p(c k x) = where = p(x C k )p(c k ) j p(x C j)p(c j ) exp(a k ) j exp(a j) a k = ln p(x C k )p(c k ) Also known as softmax function because it is a smoothed version of max. Different representations for class-conditional densities yield different consequences in how classification is done [see following slides...] 18

19 Continuous inputs First assume all classes share same covariance matrix and only 2 classes. This yields: p(c 1 x) = σ(w T x + w 0 ) where: w = Σ 1 (µ 1 µ 2 ) w 0 = 1 2 µt 1 Σ 1 µ µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) Quadratic term from Gaussians vanishes. priors p(ck) only enter via bias parameter. Thus make parallel shift in decision boundry. For general case of K classes we have: a k (x) = wk T x + w k0 where: w k = Σ 1 (µ k ) w k0 = 1 2 µt k Σ 1 µ k + ln p(c k ) 19

20 Continuous inputs Left: class-conditional densities for two classes, red and blue. Right, corresponding posterior probability p(c1 x) which is given by a logistic sigmoid of a linear function of x. On right: proportion of red yields p(c1 x), proportion of blue yields p(c2 x) = 1-p(C1 x) 20

21 Linear versus quadratic When covariance is shared by classes, decision boundary is linear. When covariances are unlinked, decision boundary is quadratic. See Bishop p for details Left: class conditional densities for three Gaussians. Green and red have same covariance matrix. Right: Decision boundary is linear where covariances are the same, quadratic where covariances differ. 21

22 Maximum likelihood Now have a parametric form for class-conditional densities p(x Ck). Can now determine values of parameters and priors p(ck). p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ) Yields likelihood: p(t π, µ 1, µ 2, Σ) = N [πn (x n µ 1, Σ)] t n [(1 π)n (x n µ 2, Σ)] 1 t n n=1 Maximize log-likelihood. For π relevant terms are: N {t n ln π + (1 t n ) ln(1 π)} n=1 set derivative with respect to π to 0 yields: π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2 22

23 Probabilistic Discriminative Models Generative approach: For two-class classification, posterior of C1 is logistic sigmoid acting on linear function of x for many distributions p(x Ck). Used ML to determine parameters of densities as well as class priors p(ck). Then use Bayes to determine posterior class probabilities Discriminiative approach: Learn parameters of general linear model explicitly using ML. Maximize likelihood of conditional p(ck x) directly. Advantages of discriminiative approach: fewer parameters, better performance when class-conditional densities do not correspond well to true distributions 23

24 Fixed Basis Functions Left: original data is not linearly separable. Two Gaussian basis functions ϕ1 and ϕ1 are defined, as shown by green lines. Right: New linearly-separable space as created by the Gaussians. Line is nonlinear in the original variables. 24

25 Logistic regression Posterior probability of class C1written as a logistic sigmoid acting on linear function of feature vector ϕ p(c 1 φ) = y(φ) = σ(w T φ) where σ(. ) is the logistic sigmoid Called logistic regression but really a model of classification More compact than ML-fitting of Gaussians: for M parameters, Gaussian model uses 2M parameters for the means and M(M + 1) /2 parameters for shared covariance matrix. Grows quadratically. Use ML to determine parameters. Recall derivitive of logistic sigmoid is: p(t w) = dσ da N n=1 = σ(1 σ) y t n n { 1 yn } 1 tn 25

26 Logistic regression Negative log of likelihood yields cross entropy: N { E(w) = ln p(t, w) = tn ln y n + (1 t n ) ln(1 t n ) } n=1 Gradient with respect to w yields: E(w) = N (y n t n )φ n n=1 w ML = ( Φ T Φ ) 1 Φ T t Takes same form as the gradient for sum-of-squares error (eqn B3.13): N { ln p(t w, β) = tn w T φ(x n ) } φ(x n ) T n=1 However logistic sigmoid means no closed-form quadratic solution (here is sum of squared error for comparison:) w ML = ( Φ T Φ ) 1 Φ T t where Φ is the design matrix 26

27 Iterative reweighted least squares (IRLS) Efficient iterative optimization: Newton-Raphson w new = w old H 1 E(w) where H is the Hessian matrix comprising the second derivatives of E(w) with respect to w. For sum-of-squares error this can be done in one step because the error function is quadratic (see p207) For cross-entropy we get a similar set of normal equations for weighted least squares where the weighting depends on w This dependancy forces us to apply the update iteratively. (see p208) 27

28 Final notes on chapter 4 Logistic regression suffers from overfitting. ML solution corresponds to σ=0.5 which is where w T ϕ=0. Magnitude of w goes to infinity; slope of the sigmoid becomes infinitely steep. All positive examples take a posterior of p (Ck x)=1. Not resolved by having more data. But is resolved using (a)regularization or (b) a Bayesian approach involving an appropriate prior over w. (4.3.4) Multiclass logistic regression is an easy extension of the two-class formulation. Replace logistic sigmoid with softmax and use maximum likelihood. (4.3.5) Probit regression is of theoretical interest but is in practice similar to logistic regression. (4.3.6) Canonical link functions provide a principled framework for matching activation functions and error functions. Based on assumptions about the conditional distribution for the target variable are not discussed further. Please read them. 28

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### CSC 411: Lecture 07: Multiclass Classification

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### CSI:FLORIDA. Section 4.4: Logistic Regression

SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승 How much energy do we need for brain functions? Information processing: Trade-off between energy consumption and wiring cost Trade-off between energy consumption

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

### IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Notes on Support Vector Machines

Notes on Support Vector Machines Fernando Mira da Silva Fernando.Silva@inesc.pt Neural Network Group I N E S C November 1998 Abstract This report describes an empirical study of Support Vector Machines

### MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

### Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very

### Introduction to Machine Learning

Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:

### Lecture 2: The SVM classifier

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### LCs for Binary Classification

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

### The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### Efficient Streaming Classification Methods

1/44 Efficient Streaming Classification Methods Niall M. Adams 1, Nicos G. Pavlidis 2, Christoforos Anagnostopoulos 3, Dimitris K. Tasoulis 1 1 Department of Mathematics 2 Institute for Mathematical Sciences

### ELEC-E8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems

Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### 3F3: Signal and Pattern Processing

3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term Classification We will represent data by

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Introduction to Artificial Neural Networks. Introduction to Artificial Neural Networks

Introduction to Artificial Neural Networks v.3 August Michel Verleysen Introduction - Introduction to Artificial Neural Networks p Why ANNs? p Biological inspiration p Some examples of problems p Historical

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Math for Machine Learning

Math for Machine Learning Author: Parameswaran Raman January 5, 015 Abstract In this post, I want to discuss the connections between Machine Learning and various other fields (especially Mathematics),

### Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

### The Probit Link Function in Generalized Linear Models for Data Mining Applications

Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/\$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### One-Class Classifiers: A Review and Analysis of Suitability in the Context of Mobile-Masquerader Detection

Joint Special Issue Advances in end-user data-mining techniques 29 One-Class Classifiers: A Review and Analysis of Suitability in the Context of Mobile-Masquerader Detection O Mazhelis Department of Computer

### Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

### Support Vector Machines for Classification and Regression

UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Neural Networks. Introduction to Artificial Intelligence CSE 150 May 29, 2007

Neural Networks Introduction to Artificial Intelligence CSE 150 May 29, 2007 Administration Last programming assignment has been posted! Final Exam: Tuesday, June 12, 11:30-2:30 Last Lecture Naïve Bayes

### INTRODUCTION TO NEURAL NETWORKS

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit