# Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 1

2 Probabilistic Classification Given: N labeled training examples {x n, y n } N, x n R D, y n {0, 1} X : N D feature matrix, y : N 1 label vector y n = 1: positive example, y n = 0: negative example Goal: Learn a classifier that predicts the binary label y for a new input x Want a probabilistic model to be able to also predict the label probabilities p(y n = 1 x n, w) = µ n p(y n = 0 x n, w) = 1 µ n µ n (0, 1) is the probability of y n being 1 Note: Features x n assumed fixed (given). Only labels y n being modeled w is the model parameter (to be learned) How do we define µ n (want it to be a function of w and input x n)? Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 2

3 Logistic Regression Logistic regression defines µ using the sigmoid function µ = σ(w 1 x) = 1 + exp( w x) = exp(w x) 1 + exp(w x) Sigmoid computes a real-valued score (w x) for input x and squashes it between (0,1) to turn this score into a probability (of x s label being 1) Thus we have p(y = 1 x, w) = µ = σ(w 1 x) = 1 + exp( w x) = exp(w x) 1 + exp(w x) p(y = 0 x, w) = 1 µ = 1 σ(w x) = exp(w x) Note: If we assume y { 1, +1} instead of y {0, 1} then p(y x, w) = 1 1+exp( y w x) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 3

4 Logistic Regression: A Closer Look.. What s the underlying decision rule in Logistic Regression? At the decision boundary, both classes are equiprobable. Thus: p(y = 1 x, w) = p(y = 0 x, w) exp(w x) 1 + exp(w x) = exp(w x) = 1 w x = exp(w x) Thus the decision boundary of LR is nothing but a linear hyperplane, just like Perceptron, Support Vector Machine (SVM), etc. Therefore y = 1 if w x 0, otherwise y = 0 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 4

5 Interpreting the probabilities.. Recall that p(y = 1 x, w) = µ = exp( w x) Note that the score w x is also a measure of distance of x from the hyperplane (score is positive for pos. examples, negative for neg. examples) High positive score w x: High probability of label 1 High negative score w x: Low prob. of label 1 (high prob. of label 0) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 5

6 Logistic Regression: Parameter Estimation Recall, each label y n is binary with prob. µ n. Assume Bernoulli likelihood: where µ n = N N p(y X, w) = p(y n x n, w) = µ yn n (1 µn)1 yn exp(w xn) 1+exp(w xn) Negative log-likelihood Plugging in µ n = NLL(w) = log p(y X, w) = N (y n log µ n + (1 y n) log(1 µ n)) exp(w xn) and chugging, we get (verify yourself) 1+exp(w xn) N NLL(w) = (y nw x n log(1 + exp(w x n))) To do MLE for w, we ll minimize negative log-likelihood NLL(w) w.r.t. w Important note: NLL(w) is convex in w, so global minima can be found Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 6

7 MLE Estimation for Logistic Regression We have NLL(w) = N (ynw x n log(1 + exp(w x n))) Taking the derivative of NLL(w) w.r.t. w NLL(w) w N = [ (y nw x n log(1 + exp(w x n)))] w ( ) N = y nx n exp(w x n) (1 + exp(w x n)) x n Can t get a closed form estimate for w by setting the derivative to zero One solution: Iterative minimization via gradient descent. Gradient is: g = NLL(w) N = (y n µ n)x n = X (µ y) w Intuitively, a large error on x n (y n µ n ) will be large large contribution (positive/negative) of x n to the gradient Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 7

8 MLE Estimation via Gradient Descent Gradient descent (GD) or steepest descent w t+1 = w t η t g t where η t is the learning rate (or step size), and g t is gradient at step t GD can converge slowly and is also sensitive to the step size Several ways to remedy this 1. E.g., Choose the optimal step size η t by line-search Add a momentum term to the updates w t+1 = w t η tg t + α t(w t w t 1) Use methods such as conjugate gradient Use second-order methods (e.g., Newton s method) to exploit the curvature of the objective function NLL(w): Require the Hessian matrix 1 Also see: A comparison of numerical optimizers for logistic regression by Tom Minka Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 8

9 MLE Estimation via Newton s Method Update via Newton s method: w t+1 = w t H 1 t g t where H t is the Hessian matrix at step t Hessian: double derivative of the objective function (NLL(w) in this case) H = 2 NLL(w) w w Recall that the gradient is: g = N Thus H = g w = w Using the fact that µn w = g w (yn µn)x n = X (µ y) N (yn µn)x n = N µn x w n ( ) = exp(w x n) w 1+exp(w x n) = µ n (1 µ n )x n, we have N H = µ n(1 µ n)x nx n = X SX where S is a diagonal matrix with its n th diagonal element = µ n (1 µ n ) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 9

10 MLE Estimation via Newton s Method Update via Newton s method: w t+1 = w t H 1 t g t = w t (X S tx) 1 X (µ t y) = w t + (X S tx) 1 X (y µ t ) = (X S tx) 1 [(X S tx)w t + X (y µ t )] = (X S tx) 1 X [S txw t + y µ t ] = (X S tx) 1 X S t[xw t + S 1 (y µ t )] = (X S tx) 1 X S t ŷ t Interpreting the solution found by Newton s method: It basically solves an Iteratively Reweighted Least Squares (IRLS) problem N arg min S w tn(ŷ tn w x n) 2 Note that the (redefined) response vector ŷ t changes in each iteration Each term in the objective has weight S tn (changes in each iteration) The weight S tn is the n th diagonal element of S t Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 10

11 MAP Estimation for Logisic Regression MLE estimate of w can lead to overfitting. Solution: use a prior on w Just like the linear regression case, let s put a Gausian prior on w p(w) = N (0, λ 1 I D ) exp( λ 2 w w) MAP objective: MLE objective + log p(w) Leads to the objective (negative of log posterior, ignoring constants): NLL(w) + λ 2 w w Estimation of w proceeds the same way as MLE excepet that now we have Gradient: g = X (µ y) + λw Hessian: H = X SX + λi D Can now apply iterative optimization (gradient des., Newton s method, etc.) Note: MAP estimation for log. reg. is equivalent to regularized log. reg. Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 11

12 Fully Bayesian Estimation for Logistic Regression What about the full posterior on w? Not as easy to estimate as in the linear regression case! Reason: likelihood (logistic-bernoulli) and prior (Gaussian) not conjugate Need to approximate the posterior in this case A crude approximation: Laplace approximation: Approximate a posterior by a Gaussian with mean = MAP estimate and covariance = inverse hessian p(w X, y) = N (w MAP, H 1 ) Will see other ways of approximating the posterior later during the semester Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 12

13 Derivation of the Laplace Approximation The posterior p(w X, y) = p(y X,w)p(w). Let s approximate it as p(y X) p(w X, y) = exp( E(w)) Z where E(w) = log p(y X, w)p(w) and Z is the normalizer Expand E(w) around its minima (w = w MAP) using 2 nd order Taylor exp. Thus the posterior E(w) E(w ) + (w w ) g (w w ) H(w w ) = E(w ) (w w ) H(w w ) (because g = 0 at w )) p(w X, y) exp( E(w )) exp( 1 2 (w w ) H(w w ))) Z Using w p(w X, y)d w = 1, we get Z = exp( E(w ))(2π)D/2 H 1/2. Thus p(w X, y) = N (w, H 1 ) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 13

14 Multinomial Logistic Regression Logistic reg. can be extended to handle K > 2 classes) In this case, y n {0, 1, 2,..., K 1} and label probabilities are defined as p(y n = k x n, W) = exp(w k x n) K l=1 exp(w l x n) = µ nk µ nk : probability that example n belongs to class k. Also, K l=1 µ nl = 1 W = [w 1 w 2... w K ] is D K weight matrix (column k for class k) Likelihood for the multinomial (or multinoulli) logistic regression model p(y X, W) = N K µ y nl nl l=1 where y nl = 1 if true class of example n is l and y nl = 0 for all other l l Can do MLE/MAP/fully Bayesian estimation for W similar to the binary case Decision rule: y = arg max l=1,...,k w l x, i.e., predict the class whose weight vector gives the largest score (or, equivalently, the largest probability) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 14

15 Next class: Generalized Linear Models Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 15

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Lecture 2: The SVM classifier

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### 3F3: Signal and Pattern Processing

3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term Classification We will represent data by

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Introduction to Machine Learning

Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:

### CSC 411: Lecture 07: Multiclass Classification

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass

### Logistic Regression for Data Mining and High-Dimensional Classification

Logistic Regression for Data Mining and High-Dimensional Classification Paul Komarek Dept. of Math Sciences Carnegie Mellon University komarek@cmu.edu Advised by Andrew Moore School of Computer Science

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

### Probabilistic Discriminative Kernel Classifiers for Multi-class Problems

c Springer-Verlag Probabilistic Discriminative Kernel Classifiers for Multi-class Problems Volker Roth University of Bonn Department of Computer Science III Roemerstr. 164 D-53117 Bonn Germany roth@cs.uni-bonn.de

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

### CSI:FLORIDA. Section 4.4: Logistic Regression

SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

### Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### CS229 Lecture notes. Andrew Ng

CS229 Lecture notes Andrew Ng Supervised learning Let s start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### CSE 473: Artificial Intelligence Autumn 2010

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### Bayesian Multinomial Logistic Regression for Author Identification

Bayesian Multinomial Logistic Regression for Author Identification David Madigan,, Alexander Genkin, David D. Lewis and Dmitriy Fradkin, DIMACS, Rutgers University Department of Statistics, Rutgers University

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### MACHINE LEARNING. Introduction. Alessandro Moschitti

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule Lectures Tuesday, 14:00-16:00

### KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Rahayu, Kernel Logistic Regression-Linear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2

### Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Multilayer Perceptrons

Made wi t h OpenOf f i ce. or g 1 Multilayer Perceptrons 2 nd Order Learning Algorithms Made wi t h OpenOf f i ce. or g 2 Why 2 nd Order? Gradient descent Back-propagation used to obtain first derivatives

### INTRODUCTION TO NEURAL NETWORKS

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Classification. Chapter 3

Chapter 3 Classification In chapter we have considered regression problems, where the targets are real valued. Another important class of problems is classification problems, where we wish to assign an

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Content-Based Recommendation

Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model

### Designing a learning system

Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Machine Learning: Multi Layer Perceptrons

Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Classification using Logistic Regression

Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function

### Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

### Classifying Chess Positions

Classifying Chess Positions Christopher De Sa December 14, 2012 Chess was one of the first problems studied by the AI community. While currently, chessplaying programs perform very well using primarily

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Advice for applying Machine Learning

Advice for applying Machine Learning Andrew Ng Stanford University Today s Lecture Advice on how getting learning algorithms to different applications. Most of today s material is not very mathematical.

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

### Lecture 6. Artificial Neural Networks

Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

### A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

### Multiclass Bounded Logistic Regression Efficient Regularization with Interior Point Method

Multiclass Bounded Logistic Regression Efficient Regularization with Interior Point Method Ribana Roscher, Wolfgang Förstner rroscher@uni-bonn.de, wf@ipb.uni-bonn.de TR-IGG-P-2009-02 July 23, 2009 Technical

### LCs for Binary Classification

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

### Learning Kernel Logistic Regression in the Presence of Class Label Noise

Learning Kernel Logistic Regression in the Presence of Class Label Noise Jakramate Bootkrajang and Ata Kabán School of Computer Science, The University of Birmingham, Birmingham, B15 2TT, UK Abstract The