Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression"

Transcription

1 Logistic Regression Department of Statistics The Pennsylvania State University

2 Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max Pr(G = k X = x). k Decision boundary between class k and l is determined by the equation: Pr(G = k X = x) = Pr(G = l X = x). Divide both sides by Pr(G = l X = x) and take log. The above equation is equivalent to log Pr(G = k X = x) Pr(G = l X = x) = 0.

3 Since we enforce linear boundary, we can assume log Pr(G = k X = x) Pr(G = l X = x) = a(k,l) 0 + p j=1 a (k,l) j x j. For logistic regression, there are restrictive relations between a (k,l) for different pairs of (k, l).

4 Assumptions log Pr(G = 1 X = x) log Pr(G = K X = x) Pr(G = 2 X = x) log Pr(G = K X = x) Pr(G = K 1 X = x) Pr(G = K X = x) = β 10 + β1 T x = β 20 + β2 T x. = β (K 1)0 + βk 1 T x

5 For any pair (k, l): log Pr(G = k X = x) Pr(G = l X = x) = β k0 β l0 + (β k β l ) T x. Number of parameters: (K 1)(p + 1). Denote the entire parameter set by θ = {β 10, β 1, β 20, β 2,..., β (K 1)0, β K 1 }. The log ratio of posterior probabilities are called log-odds or logit transformations.

6 Under the assumptions, the posterior probabilities are given by: Pr(G = k X = x) = exp(β k0 + β T k x) 1 + K 1 l=1 exp(β l0 + β T l x) for k = 1,..., K 1 Pr(G = K X = x) = K 1 l=1 exp(β l0 + βl T x). For Pr(G = k X = x) given above, obviously Sum up to 1: K k=1 Pr(G = k X = x) = 1. A simple calculation shows that the assumptions are satisfied.

7 Comparison with LR on Indicators Similarities: Both attempt to estimate Pr(G = k X = x). Both have linear classification boundaries. Difference: Linear regression on indicator matrix: approximate Pr(G = k X = x) by a linear function of x. Pr(G = k X = x) is not guaranteed to fall between 0 and 1 and to sum up to 1. Logistic regression: Pr(G = k X = x) is a nonlinear function of x. It is guaranteed to range from 0 to 1 and to sum up to 1.

8 Fitting Logistic Regression Models Criteria: find parameters that maximize the conditional likelihood of G given X using the training data. Denote p k (x i ; θ) = Pr(G = k X = x i ; θ). Given the first input x 1, the posterior probability of its class being g 1 is Pr(G = g 1 X = x 1 ). Since samples in the training data set are independent, the posterior probability for the N samples each having class g i, i = 1, 2,..., N, given their inputs x 1, x 2,..., x N is: N Pr(G = g i X = x i ).

9 The conditional log-likelihood of the class labels in the training data set is L(θ) = = N log Pr(G = g i X = x i ) N log p gi (x i ; θ).

10 Binary Classification For binary classification, if g i = 1, denote y i = 1; if g i = 2, denote y i = 0. Let p 1 (x; θ) = p(x; θ), then p 2 (x; θ) = 1 p 1 (x; θ) = 1 p(x; θ). Since K = 2, the parameters θ = {β 10, β 1 }. We denote β = (β 10, β 1 ) T.

11 If y i = 1, i.e., g i = 1, log p gi (x; β) = log p 1 (x; β) = 1 log p(x; β) = y i log p(x; β). If y i = 0, i.e., g i = 2, log p gi (x; β) = log p 2 (x; β) = 1 log(1 p(x; β)) = (1 y i ) log(1 p(x; β)). Since either y i = 0 or 1 y i = 0, we have log p gi (x; β) = y i log p(x; β) + (1 y i ) log(1 p(x; β)).

12 The conditional likelihood L(β) = = N log p gi (x i ; β) N [y i log p(x i ; β) + (1 y i ) log(1 p(x i ; β))] There p + 1 parameters in β = (β 10, β 1 ) T. Assume a column vector form for β: β 10 β 11 β = β 12.. β 1,p

13 Here we add the constant term 1 to x to accommodate the intercept. 1 x,1 x = x,2.. x,p

14 By the assumption of logistic regression model: p(x; β) = Pr(G = 1 X = x) = exp(βt x) 1 + exp(β T x) 1 p(x; β) = Pr(G = 2 X = x) = exp(β T x) Substitute the above in L(β): L(β) = N [ ] y i β T x i log(1 + e βt x i ).

15 To maximize L(β), we set the first order partial derivatives of L(β) to zero. L(β) β 1j = = = N N x ij e βt x i y i x ij 1 + e βt x i N y i x ij N p(x; β)x ij N x ij (y i p(x i ; β)) for all j = 0, 1,..., p.

16 In matrix form, we write L(β) β N = x i (y i p(x i ; β)). To solve the set of p + 1 nonlinear equations L(β) β 1j = 0, j = 0, 1,..., p, use the Newton-Raphson algorithm. The Newton-Raphson algorithm requires the second-derivatives or Hessian matrix: 2 L(β) N β β T = x i xi T p(x i ; β)(1 p(x i ; β)).

17 The element on the jth row and nth column is (counting from 0): L(β) β 1j β 1n N (1 + e βt x i )e βt x i x ij x in (e βt x i ) 2 x ij x in = (1 + e βt x i ) 2 = = N x ij x in p(x i ; β) x ij x in p(x i ; β) 2 N x ij x in p(x i ; β)(1 p(x i ; β)).

18 Starting with β old, a single Newton-Raphson update is β new = β old ( 2 ) 1 L(β) L(β) β β T β, where the derivatives are evaluated at β old.

19 The iteration can be expressed compactly in matrix form. Let y be the column vector of yi. Let X be the N (p + 1) input matrix. Let p be the N-vector of fitted probabilities with ith element p(x i ; β old ). Let W be an N N diagonal matrix of weights with ith element p(x i ; β old )(1 p(x i ; β old )). L(β) β = X T (y p) 2 L(β) β β T = X T WX.

20 The Newton-Raphson step is β new = β old + (X T WX) 1 X T (y p) = (X T WX) 1 X T W(Xβ old + W 1 (y p)) = (X T WX) 1 X T Wz, where z Xβ old + W 1 (y p). If z is viewed as a response and X is the input matrix, β new is the solution to a weighted least square problem: β new arg min β (z Xβ) T W(z Xβ). Recall that linear regression by least square is to solve arg min β (z Xβ) T (z Xβ). z is referred to as the adjusted response. The algorithm is referred to as iteratively reweighted least squares or IRLS.

21 Pseudo Code 1. 0 β 2. Compute y by setting its elements to { 1 if gi = 1 y i = 0 if g i = 2 i = 1, 2,..., N. 3. Compute p by setting its elements to, eβt x i p(x i ; β) = i = 1, 2,..., N. 1 + e βt x i 4. Compute the diagonal matrix W. The ith diagonal element is p(x i ; β)(1 p(x i ; β)), i = 1, 2,..., N. 5. z Xβ + W 1 (y p). 6. β (X T WX) 1 X T Wz. 7. If the stopping criteria is met, stop; otherwise go back to step 3.

22 Computational Efficiency Since W is an N N diagonal matrix, direct matrix operations with it may be very inefficient. A modified pseudo code is provided next.

23 1. 0 β 2. Compute y by setting its elements to { 1 if gi = 1 y i =, i = 1, 2,..., N. 0 if g i = 2 3. Compute p by setting its elements to p(x i ; β) = eβt x i 1 + e βt x i i = 1, 2,..., N. 4. Compute the N (p + 1) matrix X by multiplying the ith row of matrix X by p(x i ; β)(1 p(x i ; β)), i = 1, 2,..., N: x1 T p(x 1 ; β)(1 p(x 1 ; β))x X = x2 T 1 T X = p(x 2 ; β)(1 p(x 2 ; β))x T 2 xn T p(x N ; β)(1 p(x N ; β))xn T 5. β β + (X T X) 1 X T (y p). 6. If the stopping criteria is met, stop; otherwise go back to step 3.

24 Example Diabetes data set Input X is two dimensional. X 1 and X 2 are the two principal components of the original 8 variables. Class 1: without diabetes; Class 2: with diabetes. Applying logistic regression, we obtain β = (0.7679, , ) T.

25 The posterior probabilities are: Pr(G = 1 X = x) = Pr(G = 2 X = x) = The classification rule is: e X X e X X e X X 2 Ĝ(x) = { X X X X 2 < 0

26 Solid line: decision boundary obtained by logistic regression. Dash line: decision boundary obtained by LDA. Within training data set classification error rate: 28.12%. Sensitivity: 45.9%. Specificity: 85.8%.

27 Multiclass Case (K 3) When K 3, β is a (K-1)(p+1)-vector: β 10 β 11 β 10. β 1 β 1p β 20 β 20 β = β 2 =.. β 2p β (K 1)0 β K 1. β (K 1)0. β (K 1)p

28 Let β l = ( βl0 β l ). The likelihood function becomes L(β) = = = N log p gi (x i ; β) ( N e β g T x i i log 1 + K 1 [ ( N β g T i x i log 1 + l=1 e β T l x i K 1 l=1 ) e β T l x i )]

29 Note: the indicator function I ( ) equals 1 when the argument is true and 0 otherwise. First order derivatives: [ ] L(β) N e β k T x i x ij = I (g i = k)x ij β kj 1 + K 1 l=1 e β l T x i N = x ij (I (g i = k) p k (x i ; β))

30 Second order derivatives: = = 2 L(β) β kj β mn N 1 x ij (1 + K 1 l=1 e β l T x i ) 2 [ e β T k x i I (k = m)x in (1 + K 1 l=1 e β T l x i ) + e β T k x i e β T mx i x in ] N x ij x in ( p k (x i ; β)i (k = m) + p k (x i ; β)p m (x i ; β)) = N x ij x in p k (x i ; β)[i (k = m) p m (x i ; β)].

31 Matrix form. y is the concatenated indicator vector of dimension N (K 1). y 1 I (g 1 = k) y 2 I (g 2 = k) y =. y K 1 y k =. I (g N = k) p = 1 k K 1 p is the concatenated vector of fitted probabilities of dimension N (K 1). p 1 p k (x 1 ; β) p 2 p k (x 2 ; β). p K 1 p k =. p k (x N ; β) 1 k K 1

32 X is an N(K 1) (p + 1)(K 1) matrix: X = X X X

33 Matrix W is an N(K 1) N(K 1) square matrix: W = W 11 W 12 W 1(K 1) W 21 W 22 W 2(K 1) W (K 1),1 W (K 1),2 W (K 1),(K 1) Each submatrix W km, 1 k, m K 1, is an N N diagonal matrix. When k = m, the ith diagonal element in W kk is p k (x i ; β old )(1 p k (x i ; β old )). When k m, the ith diagonal element in W km is p k (x i ; β old )p m (x i ; β old ).

34 Similarly as with binary classification L(β) β = X T (y p) 2 L(β) β β T = X T W X. The formula for updating β new in the binary classification case holds for multiclass. β new = ( X T W X) 1 X T Wz, where z Xβ old + W 1 (y p). Or simply: β new = β old + ( X T W X) 1 X T (y p).

35 Computation Issues Initialization: one option is to use β = 0. Convergence is not guaranteed, but usually is the case. Usually, the log-likelihood increases after each iteration, but overshooting can occur. In the rare cases that the log-likelihood decreases, cut step size by half.

36 Connection with LDA Under the model of LDA: log Pr(G = k X = x) Pr(G = K X = x) = log π k π K 1 2 (µ k + µ K ) T Σ 1 (µ k µ K ) +x T Σ 1 (µ k µ K ) = a k0 + a T k x. The model of LDA satisfies the assumption of the linear logistic model. The linear logistic model only specifies the conditional distribution Pr(G = k X = x). No assumption is made about Pr(X ).

37 The LDA model specifies the joint distribution of X and G. Pr(X ) is a mixture of Gaussians: Pr(X ) = K π k φ(x ; µ k, Σ). k=1 where φ is the Gaussian density function. Linear logistic regression maximizes the conditional likelihood of G given X : Pr(G = k X = x). LDA maximizes the joint likelihood of G and X : Pr(X = x, G = k).

38 If the additional assumption made by LDA is appropriate, LDA tends to estimate the parameters more efficiently by using more information about the data. Samples without class labels can be used under the model of LDA. LDA is not robust to gross outliers. As logistic regression relies on fewer assumptions, it seems to be more robust. In practice, logistic regression and LDA often give similar results.

39 Simulation Assume input X is 1-D. Two classes have equal priors and the class-conditional densities of X are shifted versions of each other. Each conditional density is a mixture of two normals: Class 1 (red): 0.6N( 2, 1 4 ) + 0.4N(0, 1). Class 2 (blue): 0.6N(0, 1 4 ) + 0.4N(2, 1). The class-conditional densities are shown below.

40

41 LDA Result Training data set: 2000 samples for each class. Test data set: 1000 samples for each class. The estimation by LDA: ˆµ 1 = , ˆµ 2 = , ˆσ 2 = Boundary value between the two classes is (ˆµ 1 + ˆµ 2 )/2 = The classification error rate on the test data is Based on the true distribution, the Bayes (optimal) boundary value between the two classes is and the error rate is

42

43 Logistic Regression Result Linear logistic regression obtains β = ( , ) T. The boundary value satisfies X = 0, hence equals The error rate on the test data set is The estimated posterior probability is: Pr(G = 1 X = x) = e x 1 + e x.

44 The estimated posterior probability Pr(G = 1 X = x) and its true value based on the true distribution are compared in the graph below.

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }]. Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

CSI:FLORIDA. Section 4.4: Logistic Regression

CSI:FLORIDA. Section 4.4: Logistic Regression SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classifica6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component) Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation

Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation Abstract This article presents an overview of the logistic regression model for dependent variables having two or

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23 (copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA Rahayu, Kernel Logistic Regression-Linear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil. Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

More information

Factorial experimental designs and generalized linear models

Factorial experimental designs and generalized linear models Statistics & Operations Research Transactions SORT 29 (2) July-December 2005, 249-268 ISSN: 1696-2281 www.idescat.net/sort Statistics & Operations Research c Institut d Estadística de Transactions Catalunya

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

More information

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

More information

Cofactor Expansion: Cramer s Rule

Cofactor Expansion: Cramer s Rule Cofactor Expansion: Cramer s Rule MATH 322, Linear Algebra I J. Robert Buchanan Department of Mathematics Spring 2015 Introduction Today we will focus on developing: an efficient method for calculating

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

Logistic Regression for Data Mining and High-Dimensional Classification

Logistic Regression for Data Mining and High-Dimensional Classification Logistic Regression for Data Mining and High-Dimensional Classification Paul Komarek Dept. of Math Sciences Carnegie Mellon University komarek@cmu.edu Advised by Andrew Moore School of Computer Science

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Package EstCRM. July 13, 2015

Package EstCRM. July 13, 2015 Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu

More information

Direct Methods for Solving Linear Systems. Matrix Factorization

Direct Methods for Solving Linear Systems. Matrix Factorization Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011

More information

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

More information

Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013 Lecture 6: Regression Analysis MIT 18.S096 Dr. Kempthorne Fall 2013 MIT 18.S096 Regression Analysis 1 Outline Regression Analysis 1 Regression Analysis MIT 18.S096 Regression Analysis 2 Multiple Linear

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

More information

Estimating an ARMA Process

Estimating an ARMA Process Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

More information

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Logit and Probit. Brad Jones 1. April 21, 2009. University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Logit and Probit. Brad Jones 1. April 21, 2009. University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science Logit and Probit Brad 1 1 Department of Political Science University of California, Davis April 21, 2009 Logit, redux Logit resolves the functional form problem (in terms of the response function in the

More information

(a) The transpose of a lower triangular matrix is upper triangular, and the transpose of an upper triangular matrix is lower triangular.

(a) The transpose of a lower triangular matrix is upper triangular, and the transpose of an upper triangular matrix is lower triangular. Theorem.7.: (Properties of Triangular Matrices) (a) The transpose of a lower triangular matrix is upper triangular, and the transpose of an upper triangular matrix is lower triangular. (b) The product

More information

Lecture 2 Matrix Operations

Lecture 2 Matrix Operations Lecture 2 Matrix Operations transpose, sum & difference, scalar multiplication matrix multiplication, matrix-vector product matrix inverse 2 1 Matrix transpose transpose of m n matrix A, denoted A T or

More information

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting. Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Factorization Theorems

Factorization Theorems Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

More information

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

More information

Diagonal, Symmetric and Triangular Matrices

Diagonal, Symmetric and Triangular Matrices Contents 1 Diagonal, Symmetric Triangular Matrices 2 Diagonal Matrices 2.1 Products, Powers Inverses of Diagonal Matrices 2.1.1 Theorem (Powers of Matrices) 2.2 Multiplying Matrices on the Left Right by

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean

More information

Classification by Pairwise Coupling

Classification by Pairwise Coupling Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

More information

L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

More information

The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

Multiple Choice Models II

Multiple Choice Models II Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical

More information

4. Joint Distributions of Two Random Variables

4. Joint Distributions of Two Random Variables 4. Joint Distributions of Two Random Variables 4.1 Joint Distributions of Two Discrete Random Variables Suppose the discrete random variables X and Y have supports S X and S Y, respectively. The joint

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

Local classification and local likelihoods

Local classification and local likelihoods Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

Principle of Data Reduction

Principle of Data Reduction Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information