SI:FLORIDA Section 4.4: Logistic Regression
SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well!
SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 What is the actual roblem here?
SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 What is the actual roblem here? -No one line can searate the blue class from the other dataoints! Where has this roblem been seen before?
SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 What is the actual roblem here? -No one line can searate the blue class from the other dataoints! Where has this roblem been seen before? -The single-layer ercetron roblem! The XOR roblem
SI:FLORIDA Linear Regression in Feature Sace.5.8.5.6 2.4 -.5.2 - -.2 -.5 -.5 - -.5.5.5 an classify the green class with no roblem! 6
SI:FLORIDA Linear Regression in Feature Sace.5.8.5.6 2.4 -.5.2 - -.2 -.5 -.5 - -.5.5.5 an classify the black class with no roblem! 7
SI:FLORIDA Linear Regression in Feature Sace.5.45.5.4 2.35 -.5 - -.5 -.5 - -.5.5.5.3.25.2 Problems when we try to classify the blue class!!!! 8
SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 Are linear methods comletely useless on this data? -No, we can erform a non-linear transformation on the data ia fied basis functions! -Many times when we erform this transformation features that where not linearly searable in the original feature sace become linearly searable in the transformed feature sace.
SI:FLORIDA Basis Functions Oeriew Basic linear regression models are linear combinations of inut ariables y, w w + w + L + w D D w is the bias arameter Models can be etended by using fied basis functions which allows for linear combinations of nonlinear functions of the inut ariables M T y, w w j ϕ j w ϕ 2 j µ j Gaussian or RBF basis function: ϕ j e T Basis ector: s ϕ ϕ, K, ϕm Dummy basis function used for bias arameter: Basis function center ϕ µ j goerns location in inut sace Scale arameter determines satial scale s
SI:FLORIDA Linear Regression in Transformed Feature Sace.5.2.5.8 2.6 -.5.4.2 - -.5 -.5 - -.5.5.5 -.2 Again, can classify the green class with no roblem!
SI:FLORIDA Linear Regression in Transformed Feature Sace.5.8.5.6 2.4 -.5.2 - -.5 -.5 - -.5.5.5 Again, can classify the black class with no roblem! 2
SI:FLORIDA Linear Regression in Transformed Feature Sace.5.8.5.6 2.4 -.5.2 - -.2 -.5 -.5 - -.5.5.5 -.4 Now we can classify the blue class with no roblem! 3
SI:FLORIDA Features in Transformed Sace are Linearly Searable.9.8.7.6 theta2.5.4.3.2...2.3.4.5.6.7.8.9 theta 4
SI:FLORIDA More on basis functions and kernel sace in later sections of the book. Now that we hae introduced basis functions and the basis ector we can discuss logistic regression in these terms!
SI:FLORIDA Logistic Regression Motiations Desire for a linear model to estimate the osterior robabilities of K classes; to be a robability the model must ensure The osterior robabilities sum to one The osterior robabilities lie in [,] Build a model with roerties desired for a classification task ersus regression No etreme numbers, constrain the model oututs to lie within the [,] interal reate a model that is robust to outliers Desire a model with less arameters 6
SI:FLORIDA Logistic Regression Model Formulation The Elements of Statistical Learning The model is formulated as K- log-odds or logit transformations *NOTE: The logits are constructed with linear form but do not require the Gaussian assumtions, will estimate the weights ia IRLS **As reiously shown: This linear model can be deried from LDA under the assumtion of Gaussian distributed classes with a shared coariance matri ϕ t ln w,ϕ + w, ϕ + L+ w, M ϕ M w ϕ K ϕ 2 ϕ t ln w2,ϕ + w2,ϕ + L+ w2, M ϕ M w2 ϕ ϕ K ln K- K M ϕ w ϕ K, ϕ + w K, ϕ + L+ w K, M ϕ M w A logit function or log-odds is the log ratio of the robabilities for two classes; in our model we arbitrarily choose the Kth class for our ratio denominator K t ϕ 7
SI:FLORIDA Logistic Regression Model Formulation The Elements of Statistical Learning The class osterior estimations are: t e wk ϕ k ϕ, k, L, K K t + e w ϕ K ϕ j K + j j e t w j ϕ The class distributions will sum to and roduce an outut within [,]; the two class ariant is an een simler model with only a single linear function simle enough, but why do they call it LOGISTI regression 8
SI SI SI SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book Instead of starting with the multi-class ersion lets start with the two class case 2 2 2 2 + + a σ 9 a a σ + + e ln e 2 2 where we hae defined ln 2 2 a σa is the logistic sigmoid defined as a a + e σ
SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book.9.8.7.6 sigmoid outut.5.4.3.2. - -8-6 -4-2 2 4 6 8 alues of 'a' The term sigmoid means S-shaed Also can be referred to as a squashing function 2
SI SI SI SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book The inerse of the logistic sigmoid: ln ln ln 2 a σ σ This function is known as the logit function or log-odds! a a + e σ 2 For the case when K > 2 classes are resent we can use a multi-class generalization of the logistic sigmoid known as the normalized eonential, also known as a softma function K j j k K j j j k k a a k e e where ϕ t k w k a
SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book Thus for a two class logistic regression model we hae: tw ϕ σ w ϕ Now how do we learn the weights? Use a least squares method known as Iteratie Reweighted Least Squares IRLS Why can we not simly use the standard least squares solution? 22
SI:FLORIDA Logistic Regression Model Formulation Pattern Recognition and Machine Learning the Bisho book Thus for a two class logistic regression model we hae: tw ϕ σ w ϕ Now how do we learn the weights? Use a least squares method known as Iteratie Reweighted Least Squares IRLS Why can we not simly use the standard least squares solution? Because our log-likelihood function is not quadratic in the weights and thus the deriatie is not linear in the weights. This means we do NOT hae a closed form solution and must erform an iteratie method. 23
SI:FLORIDA Iteratie Reweighted Least Squares IRLS IRLS is deried similarly when using and not using the sigmoid function. Deriation of IRLS is made straight forward when using the sigmoid because its deriatie can be eressed in terms of itself. σ σ σ a NOTE: The remainder of the IRLS discussion will be done from the class book, howeer I will oint out some differences between the deriation in the class book and Bishos book. 24
SI:FLORIDA Iteratie Reweighted Least Squares IRLS Since we are using the two class case we use the binomial distribution to, model the class robability. We reresent our class labels as ; θ ; θ ; 2 ; θ ; θ N N t t { yi ln i; β + yi ln i; β } { yiβ i ln + e i } l β β i We want to maimize the log-likelihood, howeer in Bisho he minimizes the error function gien by the negatie log-likelihood. l β β N i y i i ; β i As we can see the equations are nonlinear in β. i y i 25
SI:FLORIDA Iteratie Reweighted Least Squares IRLS Sole the equations for β using the Newton-Rahson algorithm β new old l β 2 l β β β t β β The second-deriatie or Hessian of our log-likelihood is: 2 l β t β β N t i i i; β i; β i If we eress our data and labels by the matri X and ector y, our robabilities by the ector, and the weighting matri by W we can show the aboe in matri form: β new The weighting matri is a diagonal matri with the ith diagonal entry: i ; β old old β t t X WX X y i ; β old 26
SI:FLORIDA Iteratie Reweighted Least Squares IRLS Since the log-likelihood is concae the algorithm does conerge. We can rearrange the Newton ste to show eress the algorithm as a weighted least squares ste: β new X WX t X Wz With the adjusted resonse: z Xβ old + W y t See section 4.4.3 for more roerties inherent with the IRLS adjusted resonse. 27
SI:FLORIDA L Regularized Logistic Regression As in LASSO an L enalty can be used for ariable selection and shrinkage. This is done by relacing our log-likelihood function with a regularized form and maimizing it: N P t t l β yi β + β i ln + e β + β i λ β j i j NOTE: As before we do not enalize the intercet and so must eress it searately. This function is concae and can be soled ia nonlinear rogramming methods or by reeated alication of the weighted LASSO algorithm. 28
onclusions SI:FLORIDA Logistic Regression and LDA hae similar forms: LDA: k ln In LDA this linearity results from our Gaussian assumtion Logistic Regression: K π k ln π K 2 t t t µ + µ Σ µ µ + Σ µ µ α + α k K K k ln K k K t β + β k Logistic regression has linear logits by construction Howeer, their coefficients are estimated differently The logistic regression model has less assumtions and therefore is more general To illustrate, look at the joint density of X and G X, X X k k Both models hae the logit-linear form for the right term The logistic regression model basically ignores the marginal density of X and fits the arameters by maimizing the conditional likelihood The LDA model maimizes the full likelihood based on the joint density k k K k k
onclusions SI:FLORIDA What the does this mean for LDA? If Gaussian assumtions are accurate than we hae more information about the model arameters and thus can estimate them more efficiently In addition we can use unlabeled oints to hel us in estimating model and distribution arameters LDA is less robust to outliers i.e. dataoints far from the decision boundary lay a role in estimating the common coariance Logistic regression requires less arameters. Gien an M dimensional feature sace and a two class roblem Logistic regression requires M adjustable arameters LDA requires MM+5/2+ arameters 2M arameters for the means MM+/2 arameters for the shared coariance matri for the class rior