CSI:FLORIDA. Section 4.4: Logistic Regression



Similar documents
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Statistical Machine Learning

Lecture 3: Linear methods for classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Linear Classification. Volker Tresp Summer 2015

Christfried Webers. Canberra February June 2015

Lecture 8 February 4

Softmax Model as Generalization upon Logistic Discrimination Suffers from Overfitting

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Logistic Regression (1/24/13)

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Machine Learning and Pattern Recognition Logistic Regression

Linear Threshold Units

Classification Problems

Local classification and local likelihoods

STA 4273H: Statistical Machine Learning

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris


KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

3F3: Signal and Pattern Processing

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

A Multivariate Statistical Analysis of Stock Trends. Abstract

Rotated Ellipses. And Their Intersections With Lines. Mark C. Hendricks, Ph.D. Copyright March 8, 2012

Support Vector Machine (SVM)

Reject Inference in Credit Scoring. Jie-Men Mok

Logit Models for Binary Data

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

LINES AND PLANES IN R 3

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Fitting Subject-specific Curves to Grouped Longitudinal Data

CSCI567 Machine Learning (Fall 2014)

Lecture 6: Logistic Regression

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

The equivalence of logistic regression and maximum entropy models

Penalized Logistic Regression and Classification of Microarray Data

Regularized Logistic Regression for Mind Reading with Parallel Validation

Some Essential Statistics The Lure of Statistics

The Online Freeze-tag Problem

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Re-Dispatch Approach for Congestion Relief in Deregulated Power Systems

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Poisson Models for Count Data

Introduction to Logistic Regression

Machine Learning Logistic Regression

A Simple Introduction to Support Vector Machines

Linear Models for Classification

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Key Stage 2 Mathematics Programme of Study

Statistical Machine Learning from Data

Principles of Hydrology. Hydrograph components include rising limb, recession limb, peak, direct runoff, and baseflow.

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Programming Exercise 3: Multi-class Classification and Neural Networks

The Lognormal Distribution Engr 323 Geppert page 1of 6 The Lognormal Distribution

Least Squares Estimation

Machine Learning with Operational Costs

A General Approach to Variance Estimation under Imputation for Missing Survey Data

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

APPLICATIONS OF BAYES THEOREM

Mean shift-based clustering

Pressure Drop in Air Piping Systems Series of Technical White Papers from Ohio Medical Corporation

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Introduction to General and Generalized Linear Models

Predict Influencers in the Social Network

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Predict the Popularity of YouTube Videos Using Early View Data

9. Forced Convection Correlations

On the intensimetric analysis and monitoring of flue organ pipes. 1 Introduction

We are going to delve into some economics today. Specifically we are going to talk about production and returns to scale.

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Making Sense of the Mayhem: Machine Learning and March Madness

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

Pinhole Optics. OBJECTIVES To study the formation of an image without use of a lens.

Simple and efficient online algorithms for real world applications

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

The Artificial Prediction Market

VI. Introduction to Logistic Regression

Notes on the Negative Binomial Distribution

CS229 Project Report Automated Stock Trading Using Machine Learning Algorithms

Optimal Pricing for Multiple Services in Telecommunications Networks Offering Quality of Service Guarantees

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

SOME PROPERTIES OF EXTENSIONS OF SMALL DEGREE OVER Q. 1. Quadratic Extensions

Nominal and ordinal logistic regression

Java Modules for Time Series Analysis

Logistic Regression for Data Mining and High-Dimensional Classification

Data Mining Part 5. Prediction

Basics of Statistical Machine Learning

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Transcription:

SI:FLORIDA Section 4.4: Logistic Regression

SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well!

SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 What is the actual roblem here?

SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 What is the actual roblem here? -No one line can searate the blue class from the other dataoints! Where has this roblem been seen before?

SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 What is the actual roblem here? -No one line can searate the blue class from the other dataoints! Where has this roblem been seen before? -The single-layer ercetron roblem! The XOR roblem

SI:FLORIDA Linear Regression in Feature Sace.5.8.5.6 2.4 -.5.2 - -.2 -.5 -.5 - -.5.5.5 an classify the green class with no roblem! 6

SI:FLORIDA Linear Regression in Feature Sace.5.8.5.6 2.4 -.5.2 - -.2 -.5 -.5 - -.5.5.5 an classify the black class with no roblem! 7

SI:FLORIDA Linear Regression in Feature Sace.5.45.5.4 2.35 -.5 - -.5 -.5 - -.5.5.5.3.25.2 Problems when we try to classify the blue class!!!! 8

SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 Are linear methods comletely useless on this data? -No, we can erform a non-linear transformation on the data ia fied basis functions! -Many times when we erform this transformation features that where not linearly searable in the original feature sace become linearly searable in the transformed feature sace.

SI:FLORIDA Basis Functions Oeriew Basic linear regression models are linear combinations of inut ariables y, w w + w + L + w D D w is the bias arameter Models can be etended by using fied basis functions which allows for linear combinations of nonlinear functions of the inut ariables M T y, w w j ϕ j w ϕ 2 j µ j Gaussian or RBF basis function: ϕ j e T Basis ector: s ϕ ϕ, K, ϕm Dummy basis function used for bias arameter: Basis function center ϕ µ j goerns location in inut sace Scale arameter determines satial scale s

SI:FLORIDA Linear Regression in Transformed Feature Sace.5.2.5.8 2.6 -.5.4.2 - -.5 -.5 - -.5.5.5 -.2 Again, can classify the green class with no roblem!

SI:FLORIDA Linear Regression in Transformed Feature Sace.5.8.5.6 2.4 -.5.2 - -.5 -.5 - -.5.5.5 Again, can classify the black class with no roblem! 2

SI:FLORIDA Linear Regression in Transformed Feature Sace.5.8.5.6 2.4 -.5.2 - -.2 -.5 -.5 - -.5.5.5 -.4 Now we can classify the blue class with no roblem! 3

SI:FLORIDA Features in Transformed Sace are Linearly Searable.9.8.7.6 theta2.5.4.3.2...2.3.4.5.6.7.8.9 theta 4

SI:FLORIDA More on basis functions and kernel sace in later sections of the book. Now that we hae introduced basis functions and the basis ector we can discuss logistic regression in these terms!

SI:FLORIDA Logistic Regression Motiations Desire for a linear model to estimate the osterior robabilities of K classes; to be a robability the model must ensure The osterior robabilities sum to one The osterior robabilities lie in [,] Build a model with roerties desired for a classification task ersus regression No etreme numbers, constrain the model oututs to lie within the [,] interal reate a model that is robust to outliers Desire a model with less arameters 6

SI:FLORIDA Logistic Regression Model Formulation The Elements of Statistical Learning The model is formulated as K- log-odds or logit transformations *NOTE: The logits are constructed with linear form but do not require the Gaussian assumtions, will estimate the weights ia IRLS **As reiously shown: This linear model can be deried from LDA under the assumtion of Gaussian distributed classes with a shared coariance matri ϕ t ln w,ϕ + w, ϕ + L+ w, M ϕ M w ϕ K ϕ 2 ϕ t ln w2,ϕ + w2,ϕ + L+ w2, M ϕ M w2 ϕ ϕ K ln K- K M ϕ w ϕ K, ϕ + w K, ϕ + L+ w K, M ϕ M w A logit function or log-odds is the log ratio of the robabilities for two classes; in our model we arbitrarily choose the Kth class for our ratio denominator K t ϕ 7

SI:FLORIDA Logistic Regression Model Formulation The Elements of Statistical Learning The class osterior estimations are: t e wk ϕ k ϕ, k, L, K K t + e w ϕ K ϕ j K + j j e t w j ϕ The class distributions will sum to and roduce an outut within [,]; the two class ariant is an een simler model with only a single linear function simle enough, but why do they call it LOGISTI regression 8

SI SI SI SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book Instead of starting with the multi-class ersion lets start with the two class case 2 2 2 2 + + a σ 9 a a σ + + e ln e 2 2 where we hae defined ln 2 2 a σa is the logistic sigmoid defined as a a + e σ

SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book.9.8.7.6 sigmoid outut.5.4.3.2. - -8-6 -4-2 2 4 6 8 alues of 'a' The term sigmoid means S-shaed Also can be referred to as a squashing function 2

SI SI SI SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book The inerse of the logistic sigmoid: ln ln ln 2 a σ σ This function is known as the logit function or log-odds! a a + e σ 2 For the case when K > 2 classes are resent we can use a multi-class generalization of the logistic sigmoid known as the normalized eonential, also known as a softma function K j j k K j j j k k a a k e e where ϕ t k w k a

SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book Thus for a two class logistic regression model we hae: tw ϕ σ w ϕ Now how do we learn the weights? Use a least squares method known as Iteratie Reweighted Least Squares IRLS Why can we not simly use the standard least squares solution? 22

SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book Thus for a two class logistic regression model we hae: tw ϕ σ w ϕ Now how do we learn the weights? Use a least squares method known as Iteratie Reweighted Least Squares IRLS Why can we not simly use the standard least squares solution? Because our log-likelihood function is not quadratic in the weights and thus the deriatie is not linear in the weights. This means we do NOT hae a closed form solution and must erform an iteratie method. 23

SI:FLORIDA Iteratie Reweighted Least Squares IRLS IRLS is deried similarly when using and not using the sigmoid function. Deriation of IRLS is made straight forward when using the sigmoid because its deriatie can be eressed in terms of itself. σ σ σ a NOTE: The remainder of the IRLS discussion will be done from the class book, howeer I will oint out some differences between the deriation in the class book and Bishos book. 24

SI:FLORIDA Iteratie Reweighted Least Squares IRLS Since we are using the two class case we use the binomial distribution to, model the class robability. We reresent our class labels as ; θ ; θ ; 2 ; θ ; θ N N t t { yi ln i; β + yi ln i; β } { yiβ i ln + e i } l β β i We want to maimize the log-likelihood, howeer in Bisho he minimizes the error function gien by the negatie log-likelihood. l β β N i y i i ; β i As we can see the equations are nonlinear in β. i y i 25

SI:FLORIDA Iteratie Reweighted Least Squares IRLS Sole the equations for β using the Newton-Rahson algorithm β new old l β 2 l β β β t β β The second-deriatie or Hessian of our log-likelihood is: 2 l β t β β N t i i i; β i; β i If we eress our data and labels by the matri X and ector y, our robabilities by the ector, and the weighting matri by W we can show the aboe in matri form: β new The weighting matri is a diagonal matri with the ith diagonal entry: i ; β old old β t t X WX X y i ; β old 26

SI:FLORIDA Iteratie Reweighted Least Squares IRLS Since the log-likelihood is concae the algorithm does conerge. We can rearrange the Newton ste to show eress the algorithm as a weighted least squares ste: β new X WX t X Wz With the adjusted resonse: z Xβ old + W y t See section 4.4.3 for more roerties inherent with the IRLS adjusted resonse. 27

SI:FLORIDA L Regularized Logistic Regression As in LASSO an L enalty can be used for ariable selection and shrinkage. This is done by relacing our log-likelihood function with a regularized form and maimizing it: N P t t l β yi β + β i ln + e β + β i λ β j i j NOTE: As before we do not enalize the intercet and so must eress it searately. This function is concae and can be soled ia nonlinear rogramming methods or by reeated alication of the weighted LASSO algorithm. 28

onclusions SI:FLORIDA Logistic Regression and LDA hae similar forms: LDA: k ln In LDA this linearity results from our Gaussian assumtion Logistic Regression: K π k ln π K 2 t t t µ + µ Σ µ µ + Σ µ µ α + α k K K k ln K k K t β + β k Logistic regression has linear logits by construction Howeer, their coefficients are estimated differently The logistic regression model has less assumtions and therefore is more general To illustrate, look at the joint density of X and G X, X X k k Both models hae the logit-linear form for the right term The logistic regression model basically ignores the marginal density of X and fits the arameters by maimizing the conditional likelihood The LDA model maimizes the full likelihood based on the joint density k k K k k

onclusions SI:FLORIDA What the does this mean for LDA? If Gaussian assumtions are accurate than we hae more information about the model arameters and thus can estimate them more efficiently In addition we can use unlabeled oints to hel us in estimating model and distribution arameters LDA is less robust to outliers i.e. dataoints far from the decision boundary lay a role in estimating the common coariance Logistic regression requires less arameters. Gien an M dimensional feature sace and a two class roblem Logistic regression requires M adjustable arameters LDA requires MM+5/2+ arameters 2M arameters for the means MM+/2 arameters for the shared coariance matri for the class rior