Lecture 5,6 Linear Methods for Classification. Summary

Similar documents

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Steve Kroon

L10: Linear discriminants analysis

What is Candidate Sampling

Forecasting the Direction and Strength of Stock Market Movement

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Single and multiple stage classifiers implementing logistic discrimination

Support Vector Machines

Probabilistic Linear Classifier: Logistic Regression. CS534-Machine Learning

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Regression Models for a Binary Response Using EXCEL and JMP

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

CHAPTER 14 MORE ABOUT REGRESSION

Quantization Effects in Digital Filters

Prediction of Stock Market Index Movement by Ten Data Mining Techniques

1 Example 1: Axis-aligned rectangles

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Portfolio Loss Distribution

Lecture 2: Single Layer Perceptrons Kevin Swingler

Statistical Methods to Develop Rating Models

Discussion Papers. Support Vector Machines (SVM) as a Technique for Solvency Analysis. Laura Auria Rouslan A. Moro. Berlin, August 2008

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Point cloud to point cloud rigid transformations. Minimizing Rigid Registration Errors

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

STATISTICAL DATA ANALYSIS IN EXCEL

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

1 De nitions and Censoring

Support vector domain description

Machine Learning and Data Mining Lecture Notes

Georey E. Hinton. University oftoronto. Technical Report CRG-TR May 21, 1996 (revised Feb 27, 1997) Abstract

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

A Lyapunov Optimization Approach to Repeated Stochastic Games

Imperial College London

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

How To Calculate The Accountng Perod Of Nequalty

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Formulating & Solving Integer Problems Chapter

Recurrence. 1 Definitions and main statements

Learning from Multiple Outlooks

Mixtures of Factor Analyzers with Common Factor Loadings for the Clustering and Visualisation of High-Dimensional Data

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Economic Interpretation of Regression. Theory and Applications

Solving Factored MDPs with Continuous and Discrete Variables

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Prediction of Disability Frequencies in Life Insurance

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Gender Classification for Real-Time Audience Analysis System

The Greedy Method. Introduction. 0/1 Knapsack Problem

OPTIMAL INVESTMENT POLICIES FOR THE HORSE RACE MODEL. Thomas S. Ferguson and C. Zachary Gilstein UCLA and Bell Communications May 1985, revised 2004

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Lecture 3: Force of Interest, Real Interest Rate, Annuity

A Practitioner's Guide to Generalized Linear Models

Multi-View Regression via Canonical Correlation Analysis

Production. 2. Y is closed A set is closed if it contains its boundary. We need this for the solution existence in the profit maximization problem.

DEFINING %COMPLETE IN MICROSOFT PROJECT

HÜCKEL MOLECULAR ORBITAL THEORY

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC

SVM Tutorial: Classification, Regression, and Ranking

Transition Matrix Models of Consumer Credit Ratings

An Algorithm for Data-Driven Bandwidth Selection

Calculation of Sampling Weights

A Simple Approach to Clustering in Excel

Variance estimation for the instrumental variables approach to measurement error in generalized linear models

Extending Probabilistic Dynamic Epistemic Logic

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

ONE of the most crucial problems that every image

Least Squares Fitting of Data

Chapter 6. Classification and Prediction

Credit Limit Optimization (CLO) for Credit Cards

Churn prediction in subscription services: An application of support vector machines while comparing two parameter-selection techniques

New Approaches to Support Vector Ordinal Regression

Lecture 3: Linear methods for classification

BERNSTEIN POLYNOMIALS

Boosting as a Regularized Path to a Maximum Margin Classifier

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Traffic-light a stress test for life insurance provisions

Performance Analysis and Coding Strategy of ECOC SVMs

Transcription:

Lecture 5,6 Lnear Methods for Classfcaton Rce ELEC 697 Farnaz Koushanfar Fall 2006 Summary Bayes Classfers Lnear Classfers Lnear regresson of an ndcator matrx Lnear dscrmnant analyss (LDA) Logstc regresson Separatng hyperplanes Readng (ch4, ELS) 1

Bayes Classfer he margnal dstrbutons of G are specfed as PMF p G (g), g=1,2,,k f X G (x G=g) shows the condtonal dstrbuton of X for G=g he tranng set (x,g ),=1,..,N has ndependent samples from the jont dstrbuton f X,G (x,g) f X,G (x,g) = p G (g)f X G (x G=g) he loss of predctng G * for G s L(G *,G) Classfcaton goal: mnmze the expected loss E X,G L(G(X),G)=E X (E G X L(G(X),G)) Bayes Classfer (cont d) It suffces to mnmze E G X L(G(X),G) for each X. he optmal classfer s: G(x) = argmn g E G X=x L(g,G) he Bayes rule s also nown as the rule of maxmum a posteror probablty G(x) = argmax g Pr(G=g X=x) Many classfcaton algorthms estmate the Pr(G=g X=x) and then apply the Bayes rule Bayes classfcaton rule 2

More About Lnear Classfcaton Snce predctor G(x) tae values n a dscrete set G, we can dvde the nput space nto a collecton of regons labeled accordng to classfcaton For K classes (1,2,,K), and the ftted lnear model for -th ndcator response varable s fˆ ( x ) = β ˆ + βˆ 0 x he decson boundary b/w and l s: fˆ ˆ ( x) = fl( x) An affne set or hyperplane: x : ( βˆ βˆ ) + ( βˆ βˆ { 0 l 0 l ) x = Model dscrmnant functon δ (x) for each class, then classfy x to the class wth the largest value for δ (x) 0} Lnear Decson Boundary We requre that monotone transformaton of δ or Pr(G= X=x) be lnear Decson boundares are the set of ponts wth log-odds=0 Prob. of class 1: π, prob. of class 2: 1- π Apply a transformaton:: log[π/(1- π)]=β 0 + β x wo popular methods that use log-odds Lnear dscrmnant analyss, lnear logstc regresson Explctly model the boundary b/w two classes as lnear. For a two-class problem wth p-dmensonal nput space, ths s modelng decson boundary as a hyperplane wo methods usng separatng hyperplanes Perceptron - Rosenblatt, optmally separatng hyperplanes -Vapn 3

Generalzng Lnear Decson Boundares Expand the varable set X 1,,X p by ncludng squares and cross products, addng up to p(p+1)/2 addtonal varables Lnear Regresson of an Indcator Matrx For K classes, K ndcators Y, =1,,K, wth Y =1, f G=, else 0 Indcator response matrx 4

Lnear Regresson of an Indcator Matrx (Cont d) For N tranng data, form N K ndcator response matrx Y, a matrx of 0 s and 1 s Yˆ 1 = X( X X) X Y A new observaton s classfed as follows: Compute the ftted output (K vector) - f ˆ( x) = [(1, x) Bˆ ] Identfy the largest component and classfy accordngly: Gˆ ( x) = argmax fˆ G (x) But how good s the ft? Verfy G f (x)=1 for any x f (x) can be negatve or larger than 1 We can allow lnear regresson nto bass expanson of h(x) As the sze of tranng set ncreases, adaptvely add more bass Lnear Regresson - Drawbac For K 3, especally for large K 5

Lnear Regresson - Drawbac For large K and small p, masng can naturally occur E.g. Vowel recognton data n 2D subspace, K=11, p=10 dmensons Lnear Regresson and Projecton * A lnear regresson functon (here n 2D) Projectseach pont x=[x 1 x 2 ] to a lne parallel to W 1 We can study how well the projected ponts {z 1,z 2,,z n }, vewed as functons of w 1, are separated across the classes * Sldes Courtesy of omm S. Jaaola, MI CSAIL 6

Lnear Regresson and Projecton A lnear regresson functon (here n 2D) Projects each pont x=[x 1 x 2 ] to a lne parallel to W 1 We can study how well the projected ponts {z 1,z 2,,z n }, vewed as functons of w 1, are separated across the classes Projecton and Classfcaton By varyng w 1 we get dfferent levels of separaton between the projected ponts 7

Optmzng the Projecton We would le to fnd the w 1 that somehow maxmzes the separaton of the projected ponts across classes We can quantfy the separaton (overlap) n terms of means and varatons of the resultng 1-D class dstrbuton Fsher Lnear Dscrmnant: Prelmnares Class descrpton n R d Class 0: n 0 samples, mean μ 0, covarance 0 Class 1: n 1 samples, mean μ 1, covarance 1 Projected class descrptons n R Class 0: n 0 samples, mean μ 0 w 1, covarance w 1 0 w 1 Class 1: n 1 samples, mean μ 1 w 1, covarance w 1 1 w 1 8

Fsher Lnear Dscrmnant Estmaton crteron: fnd w 1 that maxmzes he soluton (class separaton) s decson theoretcally optmal for two normal populatons wth equal covarances ( 1 = 0 ) Lnear Dscrmnant Analyss (LDA) π class pror Pr(G=) Functon f (x)=densty of X n class G= Bayes heorem: Leads to LDA, QDA, MDA (mxture DA), Kernel DA, Naïve Bayes Suppose that we model densty as a MVG: LDA s when we assume the classes have a common covarance matrx: =. It s suffcent to loo at log-odds 9

LDA Log-odds functon mples decson boundary b/w and l: Pr(G= X=x)=Pr(G=l X=x) lnear n x; n p dmensons a hyperplane Example: three classes and p=2 LDA (Cont d) 10

LDA (Cont d) In practce, we do not now the parameters of Gaussan dstrbutons. Estmate w/ tranng set N s the number of class data π μˆ = g = x / N K Σ ˆ = ( x μˆ )( x μˆ ) /( N ˆ = = = 1 g N / K ) For two classes, ths s le lnear regresson N QDA If s are not equal, the quadratc terms n x reman; we get quadratc dscrmnant functons (QDA) 11

QDA (Cont d) he estmates are smlar to LDA, but each class has a separate covarance matrces For large p dramatc ncrease n parameters In LDA, there are (K-1)(p+1) parameters For QDA, there are (K-1) {1+p(p+3)/2} LDA and QDA both wor really well hs s not because the data s Gaussan, rather, for smple decson boundares, Gaussan estmates are stable Bas-varance trade-off Regularzed Dscrmnent Analyss A compromse b/w LDA and QDA. Shrn separate covarances of QDA towards a common covarance (smlar to Rdge Reg.) 12

Example - RDA Computatons for LDA Suppose we compute the egen decomposton for,.e. U s p p orthonormal, D dagonal matrx of postve egenvalues d l. hen, ˆ 1 1 ( x μ ) ( ) [ ( ˆ )] [ ( ˆ Σ x μ = U x μ D U x μ log Σˆ = log d l l he LDA classfer s mplemented as: X* D -1/2 U X, where =UDU. he common covarance estmate of X* s dentty Classfy to the closest class centrod n the transformed space, modulo the effect of the class pror probabltes π )] 13

Bacground: Smple Decson heory * Suppose we now the class-condtonal denstes p(x y) for y=0,1 as well as the overall class frequences P(y) How do we decde whch class a new example x belongs to so as to mnmze the overall probablty of error? * Courtesy of omm S. Jaaola, MI CSAIL Bacground: Smple Decson heory Suppose we now the class-condtonal denstes p(x y) for y=0,1 as well as the overall class frequences P(y) How do we decde whch class a new example x belongs to so as to mnmze the overall probablty of error? 14

2-Class Logstc Regresson he optmal decsons are based on the posteror class probabltes P(y x). For bnary classfcaton problems, we can wrte these decsons as We generally don t now P(y x) but we can parameterze the possble decsons accordng to 2-Class Logstc Regresson (Cont d) Our log-odds model Gves rse to a specfc form for the condtonal probablty over the labels (the logstc model): Where Is a logstc squashng functon hat turns lnear predctons nto probabltes 15

2-Class Logstc Regresson: Decsons Logstc regresson models mply a lnear decson boundary K-Class Logstc Regresson he model s specfed n terms of K-1 log-odds or logt transformatons (reflectng the constrant that the probabltes sum to one) he choce of denomnator s arbtrary, typcally last class Pr( G = 1 X = x) log = β Pr( G = K X = x) Pr( G = 2 X = x) log = β Pr( G = K X = x) 10 20 + β 1 + β x 2 x log.. Pr( G = K 1 X = x) Pr( G = K X = x) = β + β ( K 1) 0 K 1 x 16

K-Class Logstc Regresson (Cont d) he model s specfed n terms of K-1 log-odds or logt transformatons (reflectng the constrant that the probabltes sum to one) A smple calculaton shows that exp( β 0 + β x) Pr( G = X = x) =, = 1,..., K 1, K 1 1 + exp( β ) 1 0 + β x l = l l 1 Pr( G = K X = x) = K 1 1 + exp( β + β x = 1 0 ) l l l o emphasze the dependence on the entre parameter set θ={β 10, β 1,,β (K-1)0, β (K-1)}, we denote the probabltes as Pr(G= X=x) = p (x; θ) Fttng Logstc Regresson Models logt P x P x = ( ) ( ) log = η( x = β 1 P( x) ) x log Lelhood = = N = 1 N = 1 { y log p + (1 y )log(1 p )} { y β x log(1 + e β x )} 17

Fttng Logstc Regresson Models IRLS s equvalent to Newton-Raphson procedure Fttng Logstc Regresson Models logt P x P x = ( ) ( ) log = η( x = β 1 P( x) ) log Lelhood = = N = 1 N = 1 { y β x { y log p + (1 y )log(1 p )} x log(1 + e β x IRLS algorthm (equvalent to Newton-Raphson) Intalze β. Form Lnearzed response: Form weghts w =p (1-p ) Update β by weghted LS of z on x wth weghts w Steps 2-4 repeated untl convergence )} 18

Example Logstc Regresson South Afrcan Heart Dsease: Coronary rs factor study (CORIS) baselne survey, carred out n three rural areas. Whte males b/w 15 and 64 Response: presence or absence of myocardal nfarcton Maxmum lelhood ft: Example Logstc Regresson South Afrcan Heart Dsease: 19

LDA: Logstc Regresson or LDA? hs lnearty s a consequence of the Gaussan assumpton for the class denstes, as well as the assumpton of a common covarance matrx. Logstc model hey use the same form for the logt functon Logstc Regresson or LDA? Dscrmnatve vs nformatve learnng: logstc regresson uses the condtonal dstrbuton of Y gven x to estmate parameters, whle LDA uses the full jont dstrbuton (assumng normalty). If normalty holds, LDA s up to 30% more effcent; o/w logstc regresson can be more robust. But the methods are smlar n practce. 20

Separatng Hyperplanes Separatng Hyperplanes Perceptrons: compute a lnear combnaton of the nput features and return the sgn For x 1,x 2 n L, β (x 1 -x 2 )=0 β*= β/ β normal to surface L For x 0 n L, β x 0 = - β 0 he sgned dstance of any pont x to L s gven by * 1 β ( x x0) = ( β x + β0) β 1 = f ( x) f '( x) 21

Rosenblatt's Perceptron Learnng Algorthm Fnds a separatng hyperplane by mnmzng the dstance of msclassfed ponts to the decson boundary If a response y =1 s msclassfed, then x β+β 0 <0, and the opposte for msclassfed pont y =-1 he goal s to mnmze Rosenblatt's Perceptron Learnng Algorthm (Cont d) Stochastc gradent descent he msclassfed observatons are vsted n some sequence and the parameters β updated ρ s the learnng rate, can be 1 w/o loss of generalty It can be shown that algorthm converges to a separatng hyperplane n a fnte number of steps 22

Optmal Separatng Hyperplanes Problem Example - Optmal Separatng Hyperplanes 23