Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification



Similar documents
What is Candidate Sampling

Logistic Regression. Steve Kroon

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Support Vector Machines

Lecture 5,6 Linear Methods for Classification. Summary

Lecture 2: Single Layer Perceptrons Kevin Swingler

BERNSTEIN POLYNOMIALS

Probabilistic Linear Classifier: Logistic Regression. CS534-Machine Learning

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

1 Example 1: Axis-aligned rectangles

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Recurrence. 1 Definitions and main statements

8 Algorithm for Binary Searching in Trees

STATISTICAL DATA ANALYSIS IN EXCEL

L10: Linear discriminants analysis

Forecasting the Direction and Strength of Stock Market Movement

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Single and multiple stage classifiers implementing logistic discrimination

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Gender Classification for Real-Time Audience Analysis System

Production. 2. Y is closed A set is closed if it contains its boundary. We need this for the solution existence in the profit maximization problem.

Quantization Effects in Digital Filters

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

The Greedy Method. Introduction. 0/1 Knapsack Problem

J. Parallel Distrib. Comput.

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Texas Instruments 30X IIS Calculator

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing

Realistic Image Synthesis

SVM Tutorial: Classification, Regression, and Ranking

Least 1-Norm SVMs: a New SVM Variant between Standard and LS-SVMs

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Machine Learning and Data Mining Lecture Notes

Boosting as a Regularized Path to a Maximum Margin Classifier

Statistical Methods to Develop Rating Models

Online Multiple Kernel Learning: Algorithms and Mistake Bounds

The Mathematical Derivation of Least Squares

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Data Visualization by Pairwise Distortion Minimization

An Alternative Way to Measure Private Equity Performance

Section 5.4 Annuities, Present Value, and Amortization

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

An MILP model for planning of batch plants operating in a campaign-mode

NMT EE 589 & UNM ME 482/582 ROBOT ENGINEERING. Dr. Stephen Bruder NMT EE 589 & UNM ME 482/582

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Loop Parallelization

Fisher Markets and Convex Programs

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

A Lyapunov Optimization Approach to Repeated Stochastic Games

A Probabilistic Theory of Coherence

n + d + q = 24 and.05n +.1d +.25q = 2 { n + d + q = 24 (3) n + 2d + 5q = 40 (2)

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

1. Measuring association using correlation and regression

A Simple Approach to Clustering in Excel

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques

Lecture 2: The SVM classifier

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Extending Probabilistic Dynamic Epistemic Logic

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Formulating & Solving Integer Problems Chapter

New Approaches to Support Vector Ordinal Regression

General Iteration Algorithm for Classification Ratemaking

Project Networks With Mixed-Time Constraints

Implementation of Deutsch's Algorithm Using Mathcad

A Master Time Value of Money Formula. Floyd Vest

ONE of the most crucial problems that every image

Regression Models for a Binary Response Using EXCEL and JMP

Using Series to Analyze Financial Situations: Present Value

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Learning from Multiple Outlooks

10.2 Future Value and Present Value of an Ordinary Simple Annuity

An artificial Neural Network approach to monitor and diagnose multi-attribute quality control processes. S. T. A. Niaki*

Ants Can Schedule Software Projects

Person Re-identification by Probabilistic Relative Distance Comparison

Performance Analysis and Coding Strategy of ECOC SVMs

Transcription:

Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

Overvew Logstc regresson s actually a classfcaton method LR ntroduces an extra non-lnearty over a lnear classfer, f(x) = w > x + b, by usng a logstc (or sgmod) functon, σ(). The LR classfer s defned as ( 0.5 y =+ σ (f(x )) < 0.5 y = where σ(f(x)) = +e f(x) The logstc functon or sgmod functon 0.9 0.8 σ(z) = +e z 0.7 0.6 0.5 0.4 0.3 0.2 0. 0-20 -5-0 -5 0 5 0 5 20 z As z goes from to, σ(z) goes from 0 to, a squashng functon. It has a sgmod shape (.e. S-lke shape) σ(0) = 0.5, and f z = w > x + b then dσ(z) dx z=0 = 4 w

-20-5 -0-5 0 5 0 5 20 Intuton why use a sgmod? Here, choose bnary classfcaton to be represented by y {0, }, rather than y {, } Least squares ft σ(wx + b) ft toy 0.9 0.8 0.7 wx + b ft toy 0.5 0.6 0.5 0.4 0.3 0.2 ft of wx + b domnated by more dstant ponts causes msclassfcaton nstead LR regresses the sgmod to the class data 0. 0 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.3 0.2 0. 0 0-20 -5-0 -5 0 5 0 5 20 Smlarly n 2D LR lnear LR lnear σ(w x + w 2 x 2 + b) ft, vs w x + w 2 x 2 + b

Learnng In logstc regresson ft a sgmod functon to the data { x, y } by mnmzng the classfcaton errors y σ(w > x ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0-20 -5-0 -5 0 5 0 5 20 Margn property A sgmod favours a larger margn cf a step classfer 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0-20 -5-0 -5 0 5 0 5 20 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0-20 -5-0 -5 0 5 0 5 20

Probablstc nterpretaton Thnk of σ(f(x)) as the posteror probablty that y =,.e.p (y = x) =σ(f(x)) Hence, f σ(f(x)) > 0.5 thenclassy = s selected Then, after a rearrangement P (y = x) f(x) =log P (y = x) =logp (y = x) P (y =0 x) whch s the log odds rato Maxmum Lkelhood Estmaton Assume p(y = x; w) = σ(w > x) p(y =0 x; w) = σ(w > x) wrte ths more compactly as p(y x; w) = ³ σ(w > x) y ³ σ(w > x) ( y) Then the lkelhood (assumng data ndependence) s NY ³ p(y x; w) σ(w > ³ x ) y σ(w > ) x ) ( y and the negatve log lkelhood s L(w) = y log σ(w > x )+( y )log( σ(w > x ))

Logstc Regresson Loss functon Use notaton y {, }. Then P (y = x) =σ(f(x)) = +e f(x) P (y = x) = σ(f(x)) = So n both cases P (y x )= +e y f(x ) Assumng ndependence, the lkelhood s NY +e y f(x ) andthenegatveloglkelhoods = log ³ +e y f(x ) whch defnes the loss functon. +e +f(x) Logstc Regresson Learnng Learnng s formulated as the optmzaton problem mn w R d log ³ +e y f(x ) + λ w 2 loss functon regularzaton For correctly classfed ponts y f(x ) s negatve, and log ³ +e y f(x ) s near zero For ncorrectly classfed ponts y f(x ) s postve, and log ³ +e y f(x ) can be large. Hence the optmzaton penalzes parameters whch lead to such msclassfcatons

Comparson of SVM and LR cost functons SVM mn w R d C N X max (0, y f(x )) + w 2 Logstc regresson: mn log ³ +e y f(x ) + λ w 2 w R d Note: both approxmate 0- loss very smlar asymptotc behavour man dfference s smoothness, and non-zero values outsde margn SVM gves sparse soluton for α y f(x ) AdaBoost

Overvew AdaBoost s an algorthm for constructng a strong classfer out of a lnear combnaton TX t= α t h t (x) of smple weak classfers h t (x). It provdes a method of choosng the weak classfers and settng the weghts α t Termnology weak classfer h t (x) {, } strong classfer H(x) =sgn TX t=, for data vector x α t h t (x) Example: combnaton of lnear classfers h t (x) {, } weak classfer weak classfer 2 weak classfer 3 strong classfer h (x) h 2 (x) h 3 (x) H(x) H(x) =sgn(α h (x)+α 2 h 2 (x)+α 3 h 3 (x)) Note, ths lnear combnaton s not a smple majorty vote (t would be f ) Need to compute as well as selectng weak classfers

AdaBoost algorthm buldng a strong classfer Start wth equal weghts on each x, and a set of weak classfers For t =,T Select weak classfer wth mnmum error ² t = X ω [h t (x ) 6= y ] Set α t = 2 ln ² t ² t Reweght examples (boostng) to gve msclassfed examples more weght ω t+, = ω t, e α ty h t (x ) α t Add weak classfer wth weght TX H(x) =sgn α t h t (x) t= h t (x) where ω are weghts Example start wth equal weghts on each data pont () Weak Classfer ² j = X ω [h j (x ) 6= y ] Weghts Increased Weak Classfer 2 Weak classfer 3 Fnal classfer s lnear combnaton of weak classfers

The AdaBoost algorthm (Freund & Shapre 995) Gven example data (x,y ),...,(x n,y n ), where y =, for negatve and postve examples respectvely. Intalze weghts ω, = 2m, 2l for y =, respectvely, where m and l are the number of negatves and postves respectvely. For t =,...,T. Normalze the weghts, ω t, ω t, P n j= ω t,j so that ω t, s a probablty dstrbuton. 2. For each j, tranaweakclassfer h j wth error evaluated wth respect to ω t,, ² j = X ω t, [h j (x ) 6= y ] 3. Choose the classfer, h t,wththelowesterror² t. 4. Set α t as α t = 2 ln ² t ² t 5. Update the weghts ω t+, = ω t, e αtyht(x) The fnal strong classfer s TX H(x) =sgn α t h t (x) t= Why does t work? The AdaBoost algorthm carres out a greedy optmzaton of a loss functon AdaBoost mn α,h e y H(x ) SVM loss functon max (0, y f(x )) Logstc regresson loss functon log ³ +e y f(x ) LR SVM hnge loss y f(x )

Sketch dervaton non-examnable The objectve functon used by AdaBoost s J(H) = X e yh(x) For a correctly classfed pont the penalty s exp( H ) and for an ncorrectly classfed pont the penalty s exp(+ H ). The AdaBoost algorthm ncrementally decreases the cost by addng smple functons to H(x) = X t α t h t (x) Suppose that we have a functon B and we propose to add the functon αh(x) where the scalar α s to be determned and h(x) s some functon that takes values n + or only. The new functon s B(x)+αh(x) and the new cost s J(B + αh) = X e yb(x) e αyh(x) Dfferentatng wth respect to α and settng the result to zero gves X X e yb(x) e +α e yb(x) =0 e α y =h(x ) y 6=h(x ) Rearrangng, the optmal value of α s therefore determned to be α = P 2 log y P =h(x ) e yb(x) y 6=h(x ) e yb(x) The classfcaton error s defned as ² = X ω [h(x ) 6= y ] where ω = e yb(x) P j e y jb(x j ) Then, t can be shown that, α = 2 log ² ² The update from B to H therefore nvolves evaluatng the weghted performance (wth the weghts ω gven above) ² of the weak classfer h. If the current functon B s B(x) = 0 then the weghts wll be unform. Ths s a common startng pont for the mnmzaton. As a numercal convenence, note that at the next round of boostng the requred weghts are obtaned by multplyng the old weghts wth exp( αy h(x )) and then normalzng. Ths gves the update formula where Z t s a normalzng factor. ω t+, = Z t ω t, e αtyht(x) Choosng h The functon h s not chosen arbtrarly but s chosen to gve a good performance (low value of ²) on the tranng data weghted by ω.

Optmzaton We have seen many cost functons, e.g. SVM mn w R d C N X max (0, y f(x )) + w 2 Logstc regresson: mn log ³ +e y f(x ) + λ w 2 w R d local mnmum global mnmum Do these have a unque soluton? Does the soluton depend on the startng pont of an teratve optmzaton algorthm (such as gradent descent)? If the cost functon s convex, then a locally optmal pont s globally optmal (provded the optmzaton s over a convex set, whch t s n our case)

Convex functons Convex functon examples convex Not convex A non-negatve sum of convex functons s convex

+ Logstc regresson: mn w R d log ³ +e y f(x ) + λ w 2 convex + SVM mn w R d C N X max (0, y f(x )) + w 2 convex Gradent (or Steepest) descent algorthms To mnmze a cost functon C(w) use the teratve update where η s the learnng rate. w t+ w t η t w C(w t ) In our case the loss functon s a sum over the tranng data. For example for LR X N mn C(w) = log ³ +e y f(x ) + λ w 2 = L(x,y ; w)+λ w 2 w R d Ths means that one teratve update conssts of a pass through the tranng data wth an update for each pont w t+ w t ( η t w L(x,y ; w t )+2λw t ) The advantage s that for large amounts of data, ths can be carred out pont by pont.

Gradent descent algorthm for LR Mnmzng L(w) usng gradent descent gves the update rule [exercse] w w η(y σ(w > x ))x where y {0, } Note: ths s smlar, but not dentcal, to the perceptron update rule. w w ηsgn(w > x )x there s a unque soluton for n practce more effcent Newton methods are used to mnmze L there can be problems wth w w becomng nfnte for lnearly separable data Gradent descent algorthm for SVM Frst, rewrte the optmzaton problem as an average mn w C(w) = λ 2 w 2 + max (0, y f(x )) N = µ λ N 2 w 2 +max(0, y f(x )) (wth λ =2/(NC) up to an overall scale of the problem) and f(x) =w > x + b Because the hnge loss s not dfferentable, a sub-gradent s computed

Sub-gradent for hnge loss L(x,y ; w) =max(0, y f(x )) f(x )=w > x + b L w = y x L w =0 y f(x ) Sub-gradent descent algorthm for SVM C(w) = N µ λ 2 w 2 + L(x,y ; w) The teratve update s w t+ w t η wt C(w t ) where η s the learnng rate. w t η N (λw t + w L(x,y ; w t )) Then each teraton t nvolves cyclng through the tranng data wth the updates: w t+ ( ηλ)w t + ηy x f y (w > x + b) < ( ηλ)w t otherwse

Mult-class Classfcaton Mult-Class Classfcaton what we would lke Assgn nput vector x to one of K classes C k Goal: a decson rule that dvdes nput space nto K decson regons separated by decson boundares

Remnder: K Nearest Neghbour (K-NN) Classfer Algorthm For each test pont, x, to be classfed, fnd the K nearest samples n the tranng data Classfy the pont, x, accordng to the majorty vote of ther class labels e.g. K = 3 naturally applcable to mult-class case Buld from bnary classfers Learn: K two-class vs the rest classfers f k (x) vs 2 & 3 C? C 2 C 3 2 vs & 3 3 vs & 2

Buld from bnary classfers Learn: K two-class vs the rest classfers f k (x) Classfcaton: choose class wth most postve score vs 2 & 3 C C 2 max k f k (x) C 3 2 vs & 3 3 vs & 2 Applcaton: hand wrtten dgt recognton Feature vectors: each mage s 28 x 28 pxels. Rearrange as a 784-vector x Tranng: learn k=0 two-class vs the rest SVM classfers f k (x) Classfcaton: choose class wth most postve score f(x) =max k f k (x)

Example hand drawn 5 5 4 2 3 4 5 6 7 8 9 0 classfcaton 5 3 5 2 5 5 0 2 3 4 5 Background readng and more Other multple-class classfers (not covered here): Neural networks Random forests Bshop, chapters 4. 4.3 and 4.3 Haste et al, chapters 0. 0.6 More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml