# Logistic Regression. Steve Kroon

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Logstc Regresson Steve Kroon Course notes sectons: Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro supervsed classfcaton We are gven tranng data {(x, y ) : =,..., n} from some (mxture) dstrbuton, where y ndcates class membershp. Am: gven a new x, predct the correspondng y. Ths stuaton s often not determnstc (e.g. gven heght and weght nfo, predctng gender). Usng class membershp probabltes Gven pror probabltes for each class, and a generatve model for each class, we can use the maxmum lkelhood estmate. Ths s the class the pont has the hghest probablty of beng n. To do ths, we only need to know whch class has hghest P (y x) at each x. More generally, we mght not want to pck the class wth hghest probablty (e.g. spam classfcaton, cancer dagnoss, extreme sports). Decdng when ths s the case, and what to do then, s the subject of decson theory. The theory makes use of a so-called loss functon. Key nsght s that the actual probabltes of each class are useful beyond just the maxmum. However, we stll only need to know P (y x) at each x. Generatve vs dscrmnatve models We can get class probabltes f we have generatve models. (A generatve model s a full specfcaton of P (x, y).) The key ssue: The more parameters you have to estmate from data, the less sure you are of each estmate.

2 Snce we usually don t actually know the model for each class, we must estmate t from the class data. Two phases: certan assumptons/pror knowledge, such as normalty; followed by estmatng parameters from the data. If we need the model, there s no problem wth ths approach. However, f we only want to classfy, we don t need to know the margnal dstrbuton P (x), even though generatve models provde ths nformaton. Dscrmnatve models are specfcatons of the condtonal dstrbuton P (y x). Snce generatve models usually have more parameters than dscrmnatve ones, dscrmnatve models often outperform generatve models for classfcaton. Note that generatve models can be used for tasks dscrmnatve models can t perform. What should a dscrmnatve model look lke? We don t know a model for P (y x), and have no ntuton yet. To develop an ntuton, let us look at what P (y x) looks lke when we do know the models generatng the data. Assume we have two classes C and C2, wth pror probabltes P (C) and P (C2). Then P (C x) = = = P (C, x) P (x) P (x C)P (C) P (x C)P (C) + P (x C2)P (C2) + P (x C2)P (C2) P (x C)P (C) = + exp ( a(x)) = σ(a(x)) where we convenently defne a(x) = ln P (x C)P (C) P (x C2)P (C2) and the logstc functon σ(y) = +exp ( y). Note that σ(y) les n (0, ), and that a(x) s the so-called log-odds for class membershp of x. (You should see a smlar expresson turnng up n assgnment 2.) Also t s worth verfyng that the dervatve of σ(y) s σ(y)( σ(y)). Next, we wll assume the classes each have Gaussan dstrbutons, wth means µ and µ 2 and covarance matrces Σ and Σ 2. What s a(x) then? a(x) = ln P (x C) ln P (x C2) + ln P (C) P (C2) = 2 ln 2πΣ 2 (x µ ) T Σ (x µ ) + 2 ln 2πΣ (x µ 2) T Σ 2 (x µ 2) + ln P (C) P (C2) 2

3 If we further assume that Σ = Σ 2 = Σ, we get some cancellaton, yeldng ln P (C) P (C2) /2[(x µ ) T Σ (x µ ) (x µ 2 ) T Σ (x µ 2 )] Multplyng out, we get: /2 [ (µ 2 µ ) T Σ x + x T Σ (µ 2 µ ) + (µ T Σ µ µ T 2 Σ µ 2 ) ] + ln P (C) P (C2) [ = [Σ (µ 2 µ )] T x + /2µ T Σ µ + /2µ T 2 Σ µ 2 + ln P (C) ] P (C2) = w T x + w 0 where these equatons defne w and w 0. Thus, we fnd that for 2 classes wth equal covarances, but dfferent means, the log-odds s a lnear functon of the observatons. It follows that n ths case, f we used the data to drectly estmate the means and covarance matrx, we would estmate 2d+d(d+)/2 parameters, whle f we could drectly estmate (w, w 0 ), we would only be estmatng d + parameters. The multvarate normal case Let us now consder the same problem, but wth k classes. Then P (C x) = P (x C)P (C)/ [P (x C )P (C)] We could go the same route as before (dvdng the numerator and denomnator by P (x C)P (C), but that leads to complcatons wth more than 2 classes. Instead, we shall wrte a (x) = ln P (x C )P (C ), so that P (C x) = exp (a (x))/ exp (a (x)). 2 Agan assumng Gaussans wth shared covarance, we eventually conclude that a (x) = w T x + w 0, where and w = Σ µ w 0 = µt Σ µ + ln P (C ) 2 Comparng the number of parameters, we have kd+d(d+)/2 for a generatve approach versus k(d + ) for the dscrmnatve approach. If we restrct ourself to usng a dagonal covarance matrx n the generatve approach, a la Nave Bayes, the number of parameters s reduced to k(d + ). But now there s a hgher chance the model s wrong. Fndng w These examples motvate modellng P (y x) by a logstc functon of the logodds of the observaton, whch we model usng lnear functons (for 2 classes); or a softmax functon, usng lnear functons of the observatons as exponents If we assume dfferent covarance matrces, we get a quadratc functon of the observatons. 2 Thus a(x) n the two-class case s a (x) a 2 (x). 3

4 (for mult-class problems). More generally, we could use quadratc functons, or even more generally, a lnear functon of some transformaton of the observaton. The extenson to transformatons of the data s n the textbook; we wll stck to the lnear case here. However, we add a to the feature vector for each observaton to get rd of the nconvenent w 0. Let us try to select w usng maxmum lkelhood on a tranng set. Thus, we try to dentfy whch selecton of w was most lkely to generate the labels n the tranng set! We begn by wrtng down the lkelhood of the tranng data as a functon of w (bnary case n notes). 3 We have P (X, Y w) = P (Y X, w)p (X w), but snce P (X w) = P (X), ths equals P (Y X, w)p (X) = P (X) P (y x, w) Ths factorzaton assumes that the label of x s c.. of other observatons and labels, gven the observaton x. 4 For mathematcal convenence defne t j = f y = Cj, and 0 for the other k classes. 5 Then the lkelhood becomes P (X) P (y = Cj x, w) tj j In order to maxmze ths, we mnmze the negatve log-lkelhood w.r.t. w. Ths equals ln P (X) t j ln P (y = Cj x, w) j Wrtng P (y = Cj x, w) = ln P (X) exp (aj(x)) r exp (ar(x)), we get [ t j a j (x ) ln r j exp (a r (x )) ] where the a r are lnear functons of x : a r (x ) = w T r x. To mnmze, we take the gradent w.r.t. w: 6 wv = [ t v x exp (a ] v(x ))x r exp (a r(x )) = [ ] exp (av (x )) r exp (a r(x )) t v x For an optmum all k of these gradents must smultaneously be zero. Ths s a non-lnear system of k(d + ) equatons n k(d + ) unknowns, so we wll make use of a numercal optmzaton technque, Newton-Raphson optmzaton. 3 Here X s the observaton matrx and Y the vector of labels. 4 A common settng for supervsed learnng s assumng IID data, whch satsfes ths. Ths assumpton keeps thngs smple, even f often not qute true. 5 Ths s known as a -of-k encodng. 6 Note that ln P (X) s constant w.r.t. w, allowng ths term to be removed from the optmzaton problem. Wth generatve models, ths s not the case. 4

5 Newton-Raphson for mult-class logstc regresson Recall that we wanted to fnd the elements of w mnmzng the negatve loglkelhood l(w) = ln P (X) t j [a j (x ) ln exp (a r (x ))] j r wth a r (x ) = wr T x. Settng the gradent l(w) to zero drectly yelded a large non-lnear system of equatons, whch we could not solve analytcally. To apply logstc regresson, we need not only l and l, but also H l, so we must do further dfferentaton. To smplfy ths, let us defne y v (b) = exp (b v )/ r exp (b r), wth y v = y v (a(x )). In ths notaton, the gradent of the negatve log-lkelhood (w.r.t. w v ) turns out to smply be [y v t v ]x. Another advantage of ths defnton s to smplfy the calculus. Let us frst calculate b y v. We have y v = exp (b v)( r exp (b r) exp (b v )) b v ( r exp (b r)) 2 = y v (b)( y v (b)) smlar to the dervatve of the logstc functon, whle for j v we have y v = exp (b v) exp (b j ) b j ( r exp (b r)) 2 = y v(b)y j (b) ( r exp (b r)) 2 Note that these results can be pooled as yv b j = y v (b)(i vj y j (b)), so that we do not need to handle the case j = v separately. Usng ths, we can fnd the entres of the Hessan as follows (where x (k) denotes the k-th component of x ): 7 2 l w v,d w v2,d 2 = = = w v,d x (d2) x (d2) [y v2 t v2 ]x (d2) y v2 (a(x )) w v,d y v2 (a(x ))(I v2j y j (a(x ))) a j (x ) w v,d j where the last step follows from the chan rule. Now, w v a,d j (x ) s zero for j v, and x (d) for j = v, so that the above expresson equals y v2 (a(x ))(I v2v y v (a(x )))x (d2)x (d) so that we can wrte the block of the Hessan correspondng to w v and w v2 as y v2 (I v2v y v )x x T Now that we can calculate the Hessan and the gradent, we can start wth an ntal guess (for example, settng all the w s to zero ntally), and then applyng Newton-Raphson updates. We leave showng that the Hessan s postve semdefnte to the nterested reader. 7 Here w v,dj refers to the d j -th component of w v. 5

6 Two-class logstc regresson It s worth notcng that the soluton to the mult-class problem above s not unque: addng a constant to any component of all the w vectors yelds the same soluton. Thus, we can assume that the soluton vector for one of the classes s the zero vector. Ths means that we only need to fnd the w vectors for k classes, rather than k. In the bnary case, ths smplfes thngs consderably. It s left to the reader to verfy that after the adjustment mentoned n the prevous paragraph (settng w = 0), the softmax functon for class probablty for the frst class reduces to the logstc functon dscussed earler, where w now represents the adjusted weght w 2. 8 The negatve log-lkelhood of the observatons, as obtaned earler, s l = ln P (X) + t j ln P (y = Cj x, w) j In ths case, we have P (y = C x, w) = σ(a(x )) and P (y = C2 x, w) = σ(a(x )), wth a(x) = w T x (agan, an extra feature has been added to the observatons to cater for the bas term). For the bnary case, t s more convenent to replace the -of-k encodng t j wth a bnary encodng: t = f x s n class, and 0 otherwse. Then, the negatve log-lkelhood becomes ln P (X) (t ln σ(a(x )) + ( t ) ln( σ(a(x )))) Next we derve the gradent and Hessan of l: l = w d t [σ(a(x ))] σ(a(x ))( σ(a(x )))x (d) = ( t )[ σ(a(x ))] σ(a(x ))( σ(a(x )))x (d) = [t ( σ(a(x ))) ( t )σ(a(x ))]x (d) = (σ(a(x )) t )x (d) so that w l = (σ(a(x )) t )x. Next 2 l w d w d2 = x (d2)σ(a(x ))[ σ(a(x ))]x (d) so that H l (w) = σ(a(x ))[ σ(a(x ))]x x T. In order to ensure that our optmzaton fnds a mnmum, we show that the Hessan matrx s postve sem-defnte. Frst note that for any, x x T s postve sem-defnte, snce for any u, u T x x T u = (x T u) T (x T u) = x T u It should also be easy to verfy that the new weght vector w equals the dfference of the weght vectors obtaned usng the mult-class approach,.e. w 2 w. 6

7 Next we note that snce the range of σ s (0, ), the coeffcent of x x T s always postve, so that the Hessan s a sum of postve sem-defnte matrces, and s thus postve sem-defnte. 9 Fnally, we can apply Newton-Raphson optmzaton 0 to the log-lkelhood to obtan the weght vector w. A complcaton wth logstc regresson overfttng Suppose that a weght vector w leads to a perfect classfcaton of the data set n the bnary case, usng classfcaton by the class wth hghest probablty, and where we consder all classes equally lkely. In such a case, we say the data set s lnearly seperable. For ths classfcaton, the classfcaton boundary (or decson surface) les where P (C x) = 0.5. Snce P (C x) = (+exp (w T x)), we must have that w T x = 0,.e. the decson surface s a lne passng through the orgn. Now, consder what happens f we rather classfy wth w = 2w. In such a case, the decson boundary and all the pont classfcatons reman the same. However, the lkelhoods assocated wth each pont now becomes greater, yeldng a hgher lkelhood soluton than the orgnal w. We can contnue doublng w repeatedly n ths way, leadng n the lmt to a stuaton where the predcted probabltes become step functons at the decson boundary. (To clarfy ths, draw the logstc functon as ts argument ncreases.) Although the maths s dfferent, ths behavour manfest to varyng degrees even when there are multple classes, pror probabltes for the classes are unequal, and the classes are not lnearly separable. Ths s a common problem wth many machne learnng approaches that estmate parameters by optmzaton, and s known as overfttng. To see why ths s a problem, note that your model s now very confdent about ts classfcaton of future ponts close to the decson boundary, even though t has never observed data there! Essentally, the only gude avalable to the algorthm s the data t has been gven, and the lnear constrant on the decson surface. However, we usually do not expect the probabltes of the classes to change abruptly between 0 and. Thus, we must fnd some way of takng ths nto account n our calculatons. There are two major approaches to dong ths, whch are closely related: frst, one can use a pror dstrbuton on the weght vector w, whch s then updated by the lkelhood calculaton to obtan a maxmum a posteror (MAP) estmate; second, one can penalze choces of w leadng to undesrable behavour n our classfer by addng an extra term to the lkelhood functon ths s known as regularzaton. The relatonshp between these approaches s that the sze of the penalty, or regularzaton, term, should depend on how lkely you thnk certan values of w are n advance. Thus, the choce of regularzaton functon effectvely encodes a pror dstrbuton on the parameters under nvestgaton nto the lkelhood, 9 Under farly general condtons, the Hessan can n fact be shown to be postve defnte, by notng that t s a weghted sum of the rank matrces x x T. However, we do not go nto that here, snce the regularzaton we apply later wll easly lead to a postve defnte Hessan, n any case. 0 Because each quadratc approxmaton step n the Newton-Raphson optmzaton s effectvely a weghted least squares ft to the data (a common approach for estmatng parameters n statstcs), ths procedure s sometmes called teratve reweghted least squares (IRLS). 7

8 so that the optmum of ths regularzed lkelhood functon s actually the MAP estmate for the correspondng pror. Prors and regularzaton Let us assume a normal pror dstrbuton on w. We would prefer smaller w, so let us set the mean of the pror to 0. Also, we have no reason to expect that certan components of w must be larger than others, or that they should be correlated, so let us assume a dagonal covarance matrx, wth equal entres on the dagonals (.e. Σ = λi for some λ > 0). Gven the data set (X, Y ), what s the posteror dstrbuton for w? We have p(w X, Y ) = p(w)p(x, Y w) p(x, Y ) where the denomnator s not dependent on w. We can maxmze ths by mnmzng the negatve logarthm of the numerator, log p(w) log p(x, Y w) = wt w 2λ + l(w) + C for a constant C, and n ths formulaton we see that the pror dstrbuton on w has led to the regularzaton penalty λ J(w) = wt w 2λ. Ths partcular form s very convenent from a calculus pont of vew, snce J(w) = w, and thus H J (w) = I. The choce of λ determnes how strong one wshes the penalty term to be, and must usually be determned emprcally. To apply regularzaton n ths context s a straght forward modfcaton of the earler approach: one stll uses Newton-Raphson optmzaton, but rather than optmzng l(w), one optmzes l(w) + λ J(w), whch has a slghtly modfed gradent and Hessan 2. Choce of λ Let us next dscuss the selecton of λ: λ s an example of an algorthm parameter whch we can adjust, or tune, n the hope of obtanng good performance for our classfer, although we have no gudance of our selecton. One way to get an ndcaton of a good choce s to keep some of our tranng data asde (let us call ths part the valdaton set), and do parameter estmaton on the remanng tranng set for varous choces of λ. Our fnal choce of λ can then be obtaned by comparng the performance of the varous classfers bult wth dfferng choces of λ on the valdaton set. Fnally, we mght re-estmate the parameters for ths choce of λ usng the whole orgnal tranng set for our fnal classfer. 3 Revewng the math, we see a general rule of thumb that the regularzaton penalty corresponds roughly to the negatve logarthm of the pror dstrbuton, snce for MAP estmates, we typcally mnmze the negatve log posteror, whch equals the negatve log-lkelhood plus the negatve log-pror. 2 Note that the modfed Hessan s postve defnte now, rather than postve sem-defnte. 3 Many other approaches, such as cross-valdaton, are possble, and fndng good approaches to handlng parameter tunng s somewhat of an art. Much research has been done n ths area, but t s fraught wth dffcultes. 8

9 An alternatve vew If we consder l(w) for the bnary case, and gnore the constant ln P (X), we have (t ln σ(a(x )) + ( t ) ln( σ(a(x )))) For each data pont, ths functon calculates the predcted class probablty p for the actual class of that pont, and adds ln p to a total. If the probablty s close to one, the amount added s small, but for ponts whch are badly msclassfed, the amount added can be much larger. Ths nterpretaton helps us understand why maxmum llelhood approaches overft there s pressure to reduce these penaltes. However, when regularzaton s performed, an extra λ J(w) s added to ths functon, n such a way that overfttng to reduce the loss functon s prevented by a compensatng ncrease n ths regularzaton term. Many other classfcaton technques can also be formulated n terms of a regularzaton functon combned wth penalty functon on classfcaton of ponts. 9

### Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

### Support Vector Machines

Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.

### L10: Linear discriminants analysis

L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

### What is Candidate Sampling

What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

### CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

### Lecture 5,6 Linear Methods for Classification. Summary

Lecture 5,6 Lnear Methods for Classfcaton Rce ELEC 697 Farnaz Koushanfar Fall 2006 Summary Bayes Classfers Lnear Classfers Lnear regresson of an ndcator matrx Lnear dscrmnant analyss (LDA) Logstc regresson

### Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Causal, Explanatory Forecastng Assumes cause-and-effect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of

### benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

### An Alternative Way to Measure Private Equity Performance

An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

### BERNSTEIN POLYNOMIALS

On-Lne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful

### Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

### The covariance is the two variable analog to the variance. The formula for the covariance between two variables is

Regresson Lectures So far we have talked only about statstcs that descrbe one varable. What we are gong to be dscussng for much of the remander of the course s relatonshps between two or more varables.

### 1 Approximation Algorithms

CME 305: Dscrete Mathematcs and Algorthms 1 Approxmaton Algorthms In lght of the apparent ntractablty of the problems we beleve not to le n P, t makes sense to pursue deas other than complete solutons

### THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

The goal: to measure (determne) an unknown quantty x (the value of a RV X) Realsaton: n results: y 1, y 2,..., y j,..., y n, (the measured values of Y 1, Y 2,..., Y j,..., Y n ) every result s encumbered

### 8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

### Inequality and The Accounting Period. Quentin Wodon and Shlomo Yitzhaki. World Bank and Hebrew University. September 2001.

Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.

### Single and multiple stage classifiers implementing logistic discrimination

Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

### 1. Measuring association using correlation and regression

How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a

### PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

### x f(x) 1 0.25 1 0.75 x 1 0 1 1 0.04 0.01 0.20 1 0.12 0.03 0.60

BIVARIATE DISTRIBUTIONS Let be a varable that assumes the values { 1,,..., n }. Then, a functon that epresses the relatve frequenc of these values s called a unvarate frequenc functon. It must be true

### Forecasting the Direction and Strength of Stock Market Movement

Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems

### 8 Algorithm for Binary Searching in Trees

8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the

### Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange

### CHAPTER 14 MORE ABOUT REGRESSION

CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp

### The OC Curve of Attribute Acceptance Plans

The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

### 1 Example 1: Axis-aligned rectangles

COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

### CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES In ths chapter, we wll learn how to descrbe the relatonshp between two quanttatve varables. Remember (from Chapter 2) that the terms quanttatve varable

### v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are

### Formula of Total Probability, Bayes Rule, and Applications

1 Formula of Total Probablty, Bayes Rule, and Applcatons Recall that for any event A, the par of events A and A has an ntersecton that s empty, whereas the unon A A represents the total populaton of nterest.

### CS 2750 Machine Learning. Lecture 17a. Clustering. CS 2750 Machine Learning. Clustering

Lecture 7a Clusterng Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Clusterng Groups together smlar nstances n the data sample Basc clusterng problem: dstrbute data nto k dfferent groups such that

### Machine Learning and Data Mining Lecture Notes

Machne Learnng and Data Mnng Lecture Notes CSC 411/D11 Computer Scence Department Unversty of Toronto Verson: February 6, 2012 Copyrght c 2010 Aaron Hertzmann and Davd Fleet CONTENTS Contents Conventons

### Probabilistic Linear Classifier: Logistic Regression. CS534-Machine Learning

robablstc Lnear Classfer: Logstc Regresson CS534-Machne Learnng Three Man Approaches to learnng a Classfer Learn a classfer: a functon f, ŷ f Learn a probablstc dscrmnatve model,.e., the condtonal dstrbuton

### The Mathematical Derivation of Least Squares

Pscholog 885 Prof. Federco The Mathematcal Dervaton of Least Squares Back when the powers that e forced ou to learn matr algera and calculus, I et ou all asked ourself the age-old queston: When the hell

### Lecture 3: Force of Interest, Real Interest Rate, Annuity

Lecture 3: Force of Interest, Real Interest Rate, Annuty Goals: Study contnuous compoundng and force of nterest Dscuss real nterest rate Learn annuty-mmedate, and ts present value Study annuty-due, and

### Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

SP 2005-02 August 2005 Staff Paper Department of Appled Economcs and Management Cornell Unversty, Ithaca, New York 14853-7801 USA Farm Savngs Accounts: Examnng Income Varablty, Elgblty, and Benefts Brent

### 6. EIGENVALUES AND EIGENVECTORS 3 = 3 2

EIGENVALUES AND EIGENVECTORS The Characterstc Polynomal If A s a square matrx and v s a non-zero vector such that Av v we say that v s an egenvector of A and s the correspondng egenvalue Av v Example :

### Regression Models for a Binary Response Using EXCEL and JMP

SEMATECH 997 Statstcal Methods Symposum Austn Regresson Models for a Bnary Response Usng EXCEL and JMP Davd C. Trndade, Ph.D. STAT-TECH Consultng and Tranng n Appled Statstcs San Jose, CA Topcs Practcal

### Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model

### ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, szhang12@llnos.edu Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In

### NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has

Table of Contents CHAPTER II - PATTERN RECOGNITION.... THE PATTERN RECOGNITION PROBLEM.... STATISTICAL FORMULATION OF CLASSIFIERS...6 3. CONCLUSIONS...30 UNDERSTANDING BAYES RULE...3 BAYESIAN THRESHOLD...33

### MANY machine learning and pattern recognition applications

1 Trace Rato Problem Revsted Yangqng Ja, Fepng Ne, and Changshu Zhang Abstract Dmensonalty reducton s an mportant ssue n many machne learnng and pattern recognton applcatons, and the trace rato problem

### 2.4 Bivariate distributions

page 28 2.4 Bvarate dstrbutons 2.4.1 Defntons Let X and Y be dscrete r.v.s defned on the same probablty space (S, F, P). Instead of treatng them separately, t s often necessary to thnk of them actng together

### Using Series to Analyze Financial Situations: Present Value

2.8 Usng Seres to Analyze Fnancal Stuatons: Present Value In the prevous secton, you learned how to calculate the amount, or future value, of an ordnary smple annuty. The amount s the sum of the accumulated

### Using Mixture Covariance Matrices to Improve Face and Facial Expression Recognitions

Usng Mxture Covarance Matrces to Improve Face and Facal Expresson Recogntons Carlos E. homaz, Duncan F. Glles and Raul Q. Fetosa 2 Imperal College of Scence echnology and Medcne, Department of Computng,

### Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

Computatonal Statstcs & Data Analyss 51 (26) 1643 1655 www.elsever.com/locate/csda Multclass sparse logstc regresson for classfcaton of multple cancer types usng gene expresson data Yongda Km a,, Sunghoon

### A Probabilistic Theory of Coherence

A Probablstc Theory of Coherence BRANDEN FITELSON. The Coherence Measure C Let E be a set of n propostons E,..., E n. We seek a probablstc measure C(E) of the degree of coherence of E. Intutvely, we want

### ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING Matthew J. Lberatore, Department of Management and Operatons, Vllanova Unversty, Vllanova, PA 19085, 610-519-4390,

### GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM BARRIOT Jean-Perre, SARRAILH Mchel BGI/CNES 18.av.E.Beln 31401 TOULOUSE Cedex 4 (France) Emal: jean-perre.barrot@cnes.fr 1/Introducton The

### Section 5.3 Annuities, Future Value, and Sinking Funds

Secton 5.3 Annutes, Future Value, and Snkng Funds Ordnary Annutes A sequence of equal payments made at equal perods of tme s called an annuty. The tme between payments s the payment perod, and the tme

### Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Chapter 4 ECOOMIC DISATCH AD UIT COMMITMET ITRODUCTIO A power system has several power plants. Each power plant has several generatng unts. At any pont of tme, the total load n the system s met by the

### Probabilities and Probabilistic Models

Probabltes and Probablstc Models Probablstc models A model means a system that smulates an obect under consderaton. A probablstc model s a model that produces dfferent outcomes wth dfferent probabltes

### Recurrence. 1 Definitions and main statements

Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

### Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The

### A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel

### Joe Pimbley, unpublished, 2005. Yield Curve Calculations

Joe Pmbley, unpublshed, 005. Yeld Curve Calculatons Background: Everythng s dscount factors Yeld curve calculatons nclude valuaton of forward rate agreements (FRAs), swaps, nterest rate optons, and forward

### Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton

### Study on CET4 Marks in China s Graded English Teaching

Study on CET4 Marks n Chna s Graded Englsh Teachng CHE We College of Foregn Studes, Shandong Insttute of Busness and Technology, P.R.Chna, 264005 Abstract: Ths paper deploys Logt model, and decomposes

### PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIIOUS AFFILIATION AND PARTICIPATION Danny Cohen-Zada Department of Economcs, Ben-uron Unversty, Beer-Sheva 84105, Israel Wllam Sander Department of Economcs, DePaul

### Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School

### Number of Levels Cumulative Annual operating Income per year construction costs costs (\$) (\$) (\$) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000

Problem Set 5 Solutons 1 MIT s consderng buldng a new car park near Kendall Square. o unversty funds are avalable (overhead rates are under pressure and the new faclty would have to pay for tself from

### Trade Adjustment and Productivity in Large Crises. Online Appendix May 2013. Appendix A: Derivation of Equations for Productivity

Trade Adjustment Productvty n Large Crses Gta Gopnath Department of Economcs Harvard Unversty NBER Brent Neman Booth School of Busness Unversty of Chcago NBER Onlne Appendx May 2013 Appendx A: Dervaton

### An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsn-yng Wu b a Professor (Management Scence), Natonal Chao

### 1 De nitions and Censoring

De ntons and Censorng. Survval Analyss We begn by consderng smple analyses but we wll lead up to and take a look at regresson on explanatory factors., as n lnear regresson part A. The mportant d erence

### Lecture 3: Annuity. Study annuities whose payments form a geometric progression or a arithmetic progression.

Lecture 3: Annuty Goals: Learn contnuous annuty and perpetuty. Study annutes whose payments form a geometrc progresson or a arthmetc progresson. Dscuss yeld rates. Introduce Amortzaton Suggested Textbook

### greatest common divisor

4. GCD 1 The greatest common dvsor of two ntegers a and b (not both zero) s the largest nteger whch s a common factor of both a and b. We denote ths number by gcd(a, b), or smply (a, b) when there s no

### Statistical Methods to Develop Rating Models

Statstcal Methods to Develop Ratng Models [Evelyn Hayden and Danel Porath, Österrechsche Natonalbank and Unversty of Appled Scences at Manz] Source: The Basel II Rsk Parameters Estmaton, Valdaton, and

### On Mean Squared Error of Hierarchical Estimator

S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta

### Fisher Markets and Convex Programs

Fsher Markets and Convex Programs Nkhl R. Devanur 1 Introducton Convex programmng dualty s usually stated n ts most general form, wth convex objectve functons and convex constrants. (The book by Boyd and

### Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

### THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo

### Credit Limit Optimization (CLO) for Credit Cards

Credt Lmt Optmzaton (CLO) for Credt Cards Vay S. Desa CSCC IX, Ednburgh September 8, 2005 Copyrght 2003, SAS Insttute Inc. All rghts reserved. SAS Propretary Agenda Background Tradtonal approaches to credt

### Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Lnear Crcuts Analyss. Superposton, Theenn /Norton Equalent crcuts So far we hae explored tmendependent (resste) elements that are also lnear. A tmendependent elements s one for whch we can plot an / cure.

### Nonlinear data mapping by neural networks

Nonlnear data mappng by neural networks R.P.W. Dun Delft Unversty of Technology, Netherlands Abstract A revew s gven of the use of neural networks for nonlnear mappng of hgh dmensonal data on lower dmensonal

### Lecture 9: Logit/Probit. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Lecture 9: Logt/Probt Prof. Sharyn O Halloran Sustanable Development U96 Econometrcs II Revew of Lnear Estmaton So far, we know how to handle lnear estmaton models of the type: Y = β 0 + β *X + β 2 *X

### Simon Acomb NAG Financial Mathematics Day

1 Why People Who Prce Dervatves Are Interested In Correlaton mon Acomb NAG Fnancal Mathematcs Day Correlaton Rsk What Is Correlaton No lnear relatonshp between ponts Co-movement between the ponts Postve

### Solution: Let i = 10% and d = 5%. By definition, the respective forces of interest on funds A and B are. i 1 + it. S A (t) = d (1 dt) 2 1. = d 1 dt.

Chapter 9 Revew problems 9.1 Interest rate measurement Example 9.1. Fund A accumulates at a smple nterest rate of 10%. Fund B accumulates at a smple dscount rate of 5%. Fnd the pont n tme at whch the forces

### Lecture 18: Clustering & classification

O CPS260/BGT204. Algorthms n Computatonal Bology October 30, 2003 Lecturer: Pana K. Agarwal Lecture 8: Clusterng & classfcaton Scrbe: Daun Hou Open Problem In HomeWor 2, problem 5 has an open problem whch

### FINANCIAL MATHEMATICS. A Practical Guide for Actuaries. and other Business Professionals

FINANCIAL MATHEMATICS A Practcal Gude for Actuares and other Busness Professonals Second Edton CHRIS RUCKMAN, FSA, MAAA JOE FRANCIS, FSA, MAAA, CFA Study Notes Prepared by Kevn Shand, FSA, FCIA Assstant

### + + + - - This circuit than can be reduced to a planar circuit

MeshCurrent Method The meshcurrent s analog of the nodeoltage method. We sole for a new set of arables, mesh currents, that automatcally satsfy KCLs. As such, meshcurrent method reduces crcut soluton to

### Extending Probabilistic Dynamic Epistemic Logic

Extendng Probablstc Dynamc Epstemc Logc Joshua Sack May 29, 2008 Probablty Space Defnton A probablty space s a tuple (S, A, µ), where 1 S s a set called the sample space. 2 A P(S) s a σ-algebra: a set

### Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Bag-of-Words models Lecture 9 Sldes from: S. Lazebnk, A. Torralba, L. Fe-Fe, D. Lowe, C. Szurka Bag-of-features models Overvew: Bag-of-features models Orgns and motvaton Image representaton Dscrmnatve

### Lecture 2: Single Layer Perceptrons Kevin Swingler

Lecture 2: Sngle Layer Perceptrons Kevn Sngler kms@cs.str.ac.uk Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses

### Clustering Gene Expression Data. (Slides thanks to Dr. Mark Craven)

Clusterng Gene Epresson Data Sldes thanks to Dr. Mark Craven Gene Epresson Proles we ll assume we have a D matr o gene epresson measurements rows represent genes columns represent derent eperments tme

### Transition Matrix Models of Consumer Credit Ratings

Transton Matrx Models of Consumer Credt Ratngs Abstract Although the corporate credt rsk lterature has many studes modellng the change n the credt rsk of corporate bonds over tme, there s far less analyss

### Can Auto Liability Insurance Purchases Signal Risk Attitude?

Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

### EE201 Circuit Theory I 2015 Spring. Dr. Yılmaz KALKAN

EE201 Crcut Theory I 2015 Sprng Dr. Yılmaz KALKAN 1. Basc Concepts (Chapter 1 of Nlsson - 3 Hrs.) Introducton, Current and Voltage, Power and Energy 2. Basc Laws (Chapter 2&3 of Nlsson - 6 Hrs.) Voltage

### Abteilung für Stadt- und Regionalentwicklung Department of Urban and Regional Development

Abtelung für Stadt- und Regonalentwcklung Department of Urban and Regonal Development Gunther Maer, Alexander Kaufmann The Development of Computer Networks Frst Results from a Mcroeconomc Model SRE-Dscusson

### Fast Fuzzy Clustering of Web Page Collections

Fast Fuzzy Clusterng of Web Page Collectons Chrstan Borgelt and Andreas Nürnberger Dept. of Knowledge Processng and Language Engneerng Otto-von-Guercke-Unversty of Magdeburg Unverstätsplatz, D-396 Magdeburg,

### General Iteration Algorithm for Classification Ratemaking

General Iteraton Algorthm for Classfcaton Ratemakng by Luyang Fu and Cheng-sheng eter Wu ABSTRACT In ths study, we propose a flexble and comprehensve teraton algorthm called general teraton algorthm (GIA)

### Learning from Multiple Outlooks

Learnng from Multple Outlooks Maayan Harel Department of Electrcal Engneerng, Technon, Hafa, Israel She Mannor Department of Electrcal Engneerng, Technon, Hafa, Israel maayanga@tx.technon.ac.l she@ee.technon.ac.l

### Learning to Classify Ordinal Data: The Data Replication Method

Journal of Machne Learnng Research 8 (7) 393-49 Submtted /6; Revsed 9/6; Publshed 7/7 Learnng to Classfy Ordnal Data: The Data Replcaton Method Jame S. Cardoso INESC Porto, Faculdade de Engenhara, Unversdade

### SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and

### Influence and Correlation in Social Networks

Influence and Correlaton n Socal Networks Ars Anagnostopoulos Rav Kumar Mohammad Mahdan Yahoo! Research 701 Frst Ave. Sunnyvale, CA 94089. {ars,ravkumar,mahdan}@yahoo-nc.com ABSTRACT In many onlne socal

### PERRON FROBENIUS THEOREM

PERRON FROBENIUS THEOREM R. CLARK ROBINSON Defnton. A n n matrx M wth real entres m, s called a stochastc matrx provded () all the entres m satsfy 0 m, () each of the columns sum to one, m = for all, ()

### On the Solution of Indefinite Systems Arising in Nonlinear Optimization

On the Soluton of Indefnte Systems Arsng n Nonlnear Optmzaton Slva Bonettn, Valera Ruggero and Federca Tnt Dpartmento d Matematca, Unverstà d Ferrara Abstract We consder the applcaton of the precondtoned

### Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME, ISSUE, FEBRUARY ISSN 77-866 Logcal Development Of Vogel s Approxmaton Method (LD- An Approach To Fnd Basc Feasble Soluton Of Transportaton