# Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Save this PDF as:

Size: px
Start display at page:

Download "Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression"

## Transcription

1 Logistic Regression Department of Statistics The Pennsylvania State University

2 Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max Pr(G = k X = x). k Decision boundary between class k and l is determined by the equation: Pr(G = k X = x) = Pr(G = l X = x). Divide both sides by Pr(G = l X = x) and take log. The above equation is equivalent to log Pr(G = k X = x) Pr(G = l X = x) = 0.

3 Since we enforce linear boundary, we can assume log Pr(G = k X = x) Pr(G = l X = x) = a(k,l) 0 + p j=1 a (k,l) j x j. For logistic regression, there are restrictive relations between a (k,l) for different pairs of (k, l).

4 Assumptions log Pr(G = 1 X = x) log Pr(G = K X = x) Pr(G = 2 X = x) log Pr(G = K X = x) Pr(G = K 1 X = x) Pr(G = K X = x) = β 10 + β1 T x = β 20 + β2 T x. = β (K 1)0 + βk 1 T x

5 For any pair (k, l): log Pr(G = k X = x) Pr(G = l X = x) = β k0 β l0 + (β k β l ) T x. Number of parameters: (K 1)(p + 1). Denote the entire parameter set by θ = {β 10, β 1, β 20, β 2,..., β (K 1)0, β K 1 }. The log ratio of posterior probabilities are called log-odds or logit transformations.

6 Under the assumptions, the posterior probabilities are given by: Pr(G = k X = x) = exp(β k0 + β T k x) 1 + K 1 l=1 exp(β l0 + β T l x) for k = 1,..., K 1 Pr(G = K X = x) = K 1 l=1 exp(β l0 + βl T x). For Pr(G = k X = x) given above, obviously Sum up to 1: K k=1 Pr(G = k X = x) = 1. A simple calculation shows that the assumptions are satisfied.

7 Comparison with LR on Indicators Similarities: Both attempt to estimate Pr(G = k X = x). Both have linear classification boundaries. Difference: Linear regression on indicator matrix: approximate Pr(G = k X = x) by a linear function of x. Pr(G = k X = x) is not guaranteed to fall between 0 and 1 and to sum up to 1. Logistic regression: Pr(G = k X = x) is a nonlinear function of x. It is guaranteed to range from 0 to 1 and to sum up to 1.

8 Fitting Logistic Regression Models Criteria: find parameters that maximize the conditional likelihood of G given X using the training data. Denote p k (x i ; θ) = Pr(G = k X = x i ; θ). Given the first input x 1, the posterior probability of its class being g 1 is Pr(G = g 1 X = x 1 ). Since samples in the training data set are independent, the posterior probability for the N samples each having class g i, i = 1, 2,..., N, given their inputs x 1, x 2,..., x N is: N Pr(G = g i X = x i ).

9 The conditional log-likelihood of the class labels in the training data set is L(θ) = = N log Pr(G = g i X = x i ) N log p gi (x i ; θ).

10 Binary Classification For binary classification, if g i = 1, denote y i = 1; if g i = 2, denote y i = 0. Let p 1 (x; θ) = p(x; θ), then p 2 (x; θ) = 1 p 1 (x; θ) = 1 p(x; θ). Since K = 2, the parameters θ = {β 10, β 1 }. We denote β = (β 10, β 1 ) T.

11 If y i = 1, i.e., g i = 1, log p gi (x; β) = log p 1 (x; β) = 1 log p(x; β) = y i log p(x; β). If y i = 0, i.e., g i = 2, log p gi (x; β) = log p 2 (x; β) = 1 log(1 p(x; β)) = (1 y i ) log(1 p(x; β)). Since either y i = 0 or 1 y i = 0, we have log p gi (x; β) = y i log p(x; β) + (1 y i ) log(1 p(x; β)).

12 The conditional likelihood L(β) = = N log p gi (x i ; β) N [y i log p(x i ; β) + (1 y i ) log(1 p(x i ; β))] There p + 1 parameters in β = (β 10, β 1 ) T. Assume a column vector form for β: β 10 β 11 β = β 12.. β 1,p

13 Here we add the constant term 1 to x to accommodate the intercept. 1 x,1 x = x,2.. x,p

14 By the assumption of logistic regression model: p(x; β) = Pr(G = 1 X = x) = exp(βt x) 1 + exp(β T x) 1 p(x; β) = Pr(G = 2 X = x) = exp(β T x) Substitute the above in L(β): L(β) = N [ ] y i β T x i log(1 + e βt x i ).

15 To maximize L(β), we set the first order partial derivatives of L(β) to zero. L(β) β 1j = = = N N x ij e βt x i y i x ij 1 + e βt x i N y i x ij N p(x; β)x ij N x ij (y i p(x i ; β)) for all j = 0, 1,..., p.

16 In matrix form, we write L(β) β N = x i (y i p(x i ; β)). To solve the set of p + 1 nonlinear equations L(β) β 1j = 0, j = 0, 1,..., p, use the Newton-Raphson algorithm. The Newton-Raphson algorithm requires the second-derivatives or Hessian matrix: 2 L(β) N β β T = x i xi T p(x i ; β)(1 p(x i ; β)).

17 The element on the jth row and nth column is (counting from 0): L(β) β 1j β 1n N (1 + e βt x i )e βt x i x ij x in (e βt x i ) 2 x ij x in = (1 + e βt x i ) 2 = = N x ij x in p(x i ; β) x ij x in p(x i ; β) 2 N x ij x in p(x i ; β)(1 p(x i ; β)).

18 Starting with β old, a single Newton-Raphson update is β new = β old ( 2 ) 1 L(β) L(β) β β T β, where the derivatives are evaluated at β old.

19 The iteration can be expressed compactly in matrix form. Let y be the column vector of yi. Let X be the N (p + 1) input matrix. Let p be the N-vector of fitted probabilities with ith element p(x i ; β old ). Let W be an N N diagonal matrix of weights with ith element p(x i ; β old )(1 p(x i ; β old )). L(β) β = X T (y p) 2 L(β) β β T = X T WX.

20 The Newton-Raphson step is β new = β old + (X T WX) 1 X T (y p) = (X T WX) 1 X T W(Xβ old + W 1 (y p)) = (X T WX) 1 X T Wz, where z Xβ old + W 1 (y p). If z is viewed as a response and X is the input matrix, β new is the solution to a weighted least square problem: β new arg min β (z Xβ) T W(z Xβ). Recall that linear regression by least square is to solve arg min β (z Xβ) T (z Xβ). z is referred to as the adjusted response. The algorithm is referred to as iteratively reweighted least squares or IRLS.

21 Pseudo Code 1. 0 β 2. Compute y by setting its elements to { 1 if gi = 1 y i = 0 if g i = 2 i = 1, 2,..., N. 3. Compute p by setting its elements to, eβt x i p(x i ; β) = i = 1, 2,..., N. 1 + e βt x i 4. Compute the diagonal matrix W. The ith diagonal element is p(x i ; β)(1 p(x i ; β)), i = 1, 2,..., N. 5. z Xβ + W 1 (y p). 6. β (X T WX) 1 X T Wz. 7. If the stopping criteria is met, stop; otherwise go back to step 3.

22 Computational Efficiency Since W is an N N diagonal matrix, direct matrix operations with it may be very inefficient. A modified pseudo code is provided next.

23 1. 0 β 2. Compute y by setting its elements to { 1 if gi = 1 y i =, i = 1, 2,..., N. 0 if g i = 2 3. Compute p by setting its elements to p(x i ; β) = eβt x i 1 + e βt x i i = 1, 2,..., N. 4. Compute the N (p + 1) matrix X by multiplying the ith row of matrix X by p(x i ; β)(1 p(x i ; β)), i = 1, 2,..., N: x1 T p(x 1 ; β)(1 p(x 1 ; β))x X = x2 T 1 T X = p(x 2 ; β)(1 p(x 2 ; β))x T 2 xn T p(x N ; β)(1 p(x N ; β))xn T 5. β β + (X T X) 1 X T (y p). 6. If the stopping criteria is met, stop; otherwise go back to step 3.

24 Example Diabetes data set Input X is two dimensional. X 1 and X 2 are the two principal components of the original 8 variables. Class 1: without diabetes; Class 2: with diabetes. Applying logistic regression, we obtain β = (0.7679, , ) T.

25 The posterior probabilities are: Pr(G = 1 X = x) = Pr(G = 2 X = x) = The classification rule is: e X X e X X e X X 2 Ĝ(x) = { X X X X 2 < 0

26 Solid line: decision boundary obtained by logistic regression. Dash line: decision boundary obtained by LDA. Within training data set classification error rate: 28.12%. Sensitivity: 45.9%. Specificity: 85.8%.

27 Multiclass Case (K 3) When K 3, β is a (K-1)(p+1)-vector: β 10 β 11 β 10. β 1 β 1p β 20 β 20 β = β 2 =.. β 2p β (K 1)0 β K 1. β (K 1)0. β (K 1)p

28 Let β l = ( βl0 β l ). The likelihood function becomes L(β) = = = N log p gi (x i ; β) ( N e β g T x i i log 1 + K 1 [ ( N β g T i x i log 1 + l=1 e β T l x i K 1 l=1 ) e β T l x i )]

29 Note: the indicator function I ( ) equals 1 when the argument is true and 0 otherwise. First order derivatives: [ ] L(β) N e β k T x i x ij = I (g i = k)x ij β kj 1 + K 1 l=1 e β l T x i N = x ij (I (g i = k) p k (x i ; β))

30 Second order derivatives: = = 2 L(β) β kj β mn N 1 x ij (1 + K 1 l=1 e β l T x i ) 2 [ e β T k x i I (k = m)x in (1 + K 1 l=1 e β T l x i ) + e β T k x i e β T mx i x in ] N x ij x in ( p k (x i ; β)i (k = m) + p k (x i ; β)p m (x i ; β)) = N x ij x in p k (x i ; β)[i (k = m) p m (x i ; β)].

31 Matrix form. y is the concatenated indicator vector of dimension N (K 1). y 1 I (g 1 = k) y 2 I (g 2 = k) y =. y K 1 y k =. I (g N = k) p = 1 k K 1 p is the concatenated vector of fitted probabilities of dimension N (K 1). p 1 p k (x 1 ; β) p 2 p k (x 2 ; β). p K 1 p k =. p k (x N ; β) 1 k K 1

32 X is an N(K 1) (p + 1)(K 1) matrix: X = X X X

33 Matrix W is an N(K 1) N(K 1) square matrix: W = W 11 W 12 W 1(K 1) W 21 W 22 W 2(K 1) W (K 1),1 W (K 1),2 W (K 1),(K 1) Each submatrix W km, 1 k, m K 1, is an N N diagonal matrix. When k = m, the ith diagonal element in W kk is p k (x i ; β old )(1 p k (x i ; β old )). When k m, the ith diagonal element in W km is p k (x i ; β old )p m (x i ; β old ).

34 Similarly as with binary classification L(β) β = X T (y p) 2 L(β) β β T = X T W X. The formula for updating β new in the binary classification case holds for multiclass. β new = ( X T W X) 1 X T Wz, where z Xβ old + W 1 (y p). Or simply: β new = β old + ( X T W X) 1 X T (y p).

35 Computation Issues Initialization: one option is to use β = 0. Convergence is not guaranteed, but usually is the case. Usually, the log-likelihood increases after each iteration, but overshooting can occur. In the rare cases that the log-likelihood decreases, cut step size by half.

36 Connection with LDA Under the model of LDA: log Pr(G = k X = x) Pr(G = K X = x) = log π k π K 1 2 (µ k + µ K ) T Σ 1 (µ k µ K ) +x T Σ 1 (µ k µ K ) = a k0 + a T k x. The model of LDA satisfies the assumption of the linear logistic model. The linear logistic model only specifies the conditional distribution Pr(G = k X = x). No assumption is made about Pr(X ).

37 The LDA model specifies the joint distribution of X and G. Pr(X ) is a mixture of Gaussians: Pr(X ) = K π k φ(x ; µ k, Σ). k=1 where φ is the Gaussian density function. Linear logistic regression maximizes the conditional likelihood of G given X : Pr(G = k X = x). LDA maximizes the joint likelihood of G and X : Pr(X = x, G = k).

38 If the additional assumption made by LDA is appropriate, LDA tends to estimate the parameters more efficiently by using more information about the data. Samples without class labels can be used under the model of LDA. LDA is not robust to gross outliers. As logistic regression relies on fewer assumptions, it seems to be more robust. In practice, logistic regression and LDA often give similar results.

39 Simulation Assume input X is 1-D. Two classes have equal priors and the class-conditional densities of X are shifted versions of each other. Each conditional density is a mixture of two normals: Class 1 (red): 0.6N( 2, 1 4 ) + 0.4N(0, 1). Class 2 (blue): 0.6N(0, 1 4 ) + 0.4N(2, 1). The class-conditional densities are shown below.

40

41 LDA Result Training data set: 2000 samples for each class. Test data set: 1000 samples for each class. The estimation by LDA: ˆµ 1 = , ˆµ 2 = , ˆσ 2 = Boundary value between the two classes is (ˆµ 1 + ˆµ 2 )/2 = The classification error rate on the test data is Based on the true distribution, the Bayes (optimal) boundary value between the two classes is and the error rate is

42

43 Logistic Regression Result Linear logistic regression obtains β = ( , ) T. The boundary value satisfies X = 0, hence equals The error rate on the test data set is The estimated posterior probability is: Pr(G = 1 X = x) = e x 1 + e x.

44 The estimated posterior probability Pr(G = 1 X = x) and its true value based on the true distribution are compared in the graph below.

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### CSI:FLORIDA. Section 4.4: Logistic Regression

SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Logit Models for Binary Data

Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation

Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation Abstract This article presents an overview of the logistic regression model for dependent variables having two or

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Rahayu, Kernel Logistic Regression-Linear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Factorial experimental designs and generalized linear models

Statistics & Operations Research Transactions SORT 29 (2) July-December 2005, 249-268 ISSN: 1696-2281 www.idescat.net/sort Statistics & Operations Research c Institut d Estadística de Transactions Catalunya

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

### Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Nominal and ordinal logistic regression

Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

### Cofactor Expansion: Cramer s Rule

Cofactor Expansion: Cramer s Rule MATH 322, Linear Algebra I J. Robert Buchanan Department of Mathematics Spring 2015 Introduction Today we will focus on developing: an efficient method for calculating

### Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

### Logistic Regression for Data Mining and High-Dimensional Classification

Logistic Regression for Data Mining and High-Dimensional Classification Paul Komarek Dept. of Math Sciences Carnegie Mellon University komarek@cmu.edu Advised by Andrew Moore School of Computer Science

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Towards running complex models on big data

Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

### Package EstCRM. July 13, 2015

Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu

### Direct Methods for Solving Linear Systems. Matrix Factorization

Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011

### MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

### Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

Lecture 6: Regression Analysis MIT 18.S096 Dr. Kempthorne Fall 2013 MIT 18.S096 Regression Analysis 1 Outline Regression Analysis 1 Regression Analysis MIT 18.S096 Regression Analysis 2 Multiple Linear

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Estimating an ARMA Process

Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators

### Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

### THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

### Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### 5. Linear Regression

5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

### Logit and Probit. Brad Jones 1. April 21, 2009. University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Logit and Probit Brad 1 1 Department of Political Science University of California, Davis April 21, 2009 Logit, redux Logit resolves the functional form problem (in terms of the response function in the

### (a) The transpose of a lower triangular matrix is upper triangular, and the transpose of an upper triangular matrix is lower triangular.

Theorem.7.: (Properties of Triangular Matrices) (a) The transpose of a lower triangular matrix is upper triangular, and the transpose of an upper triangular matrix is lower triangular. (b) The product

### Lecture 2 Matrix Operations

Lecture 2 Matrix Operations transpose, sum & difference, scalar multiplication matrix multiplication, matrix-vector product matrix inverse 2 1 Matrix transpose transpose of m n matrix A, denoted A T or

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

### Diagonal, Symmetric and Triangular Matrices

Contents 1 Diagonal, Symmetric Triangular Matrices 2 Diagonal Matrices 2.1 Products, Powers Inverses of Diagonal Matrices 2.1.1 Theorem (Powers of Matrices) 2.2 Multiplying Matrices on the Left Right by

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

### Multiple Choice Models II

Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical

### 4. Joint Distributions of Two Random Variables

4. Joint Distributions of Two Random Variables 4.1 Joint Distributions of Two Discrete Random Variables Suppose the discrete random variables X and Y have supports S X and S Y, respectively. The joint

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation