CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Transcription

1 Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there are avalable seats Rules for audt: Homework assgnments

2 Revew Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton

3 Data Data may need a lot of: Cleanng Preprocessng (conversons Cleanng: Get rd of errors, nose, Removal of redundances Preprocessng: Renamng Rescalng (normalzaton Dscretzatons Abstracton Aggregaton New attrbutes Data bases Watch out for data bases: Try to understand the data source It s very easy to derve unexpected results when data used for analyss and learnng are based (pre-selected Results (conclusons derved for pre-selected data do not hold n general!!!

4 Data bases Example 1: Rsks n pregnancy study Sponsored by DARPA at mltary hosptals Study of a large sample of pregnant woman who vsted mltary hosptals Concluson: the factor wth the largest mpact on reducng rsks durng pregnancy (statstcally sgnfcant s a pregnant woman beng sngle Sngle woman the smallest rsk What s wrong? Data Example 2: Stock market tradng (example by Andrew Lo Data on stock performances of companes traded on stock market over past 25 year Investment goal: pck a stock to hold long term Proposed strategy: nvest n a company stock wth an IPO correspondng to a Carmchael number - Evaluaton result: excellent return over 25 years - Where the magc comes from?

5 Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton Feature selecton The sze (dmensonalty of a sample can be enormous 1 2 d x = ( x, x,.., x d - very large Example: document classfcaton 10,000 dfferent words Inputs: counts of occurrences of dfferent words Too many parameters to learn (not enough samples to justfy the estmates the parameters of the model Dmensonalty reducton: replace nputs wth features Extract relevant nputs (e.g. mutual nformaton measure PCA prncpal component analyss Group (cluster smlar words (uses a smlarty measure Replace wth the group label

6 Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton Model selecton What s the rght model to learn? E.g what polynomal to use A pror knowledge helps a lot, but stll a lot of guessng Intal data analyss and vsualzaton We can make a good guess about the form of the dstrbuton, shape of the functon Overfttng problem Take nto account the bas and varance of error estmates Smpler (more based model parameters can be estmated more relably (smaller varance of estmates Complex model wth many parameters parameter estmates are less relable (large varance of the estmate

7 Solutons for overfttng How to make the learner avod the overft? Assure suffcent number of samples n the tranng set May not be possble (small number of examples Hold some data out of the tranng set = valdaton set Tran (ft on the tranng set (w/o data held out; Check for the generalzaton error on the valdaton set, choose the model based on the valdaton set error (random resamplng valdaton technques Regularzaton (Occam s Razor Penalze for the model complexty (number of parameters Explct preference towards smple models Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton

8 Learnng Learnng = optmzaton problem. Varous crtera: Mean square error * 1 w = arg mn Error ( w Error ( w = ( y f ( x, w w N Maxmum lkelhood (ML crteron Θ * = arg max P ( D Θ Error ( Θ = log P( D Θ Θ Maxmum posteror probablty (MAP Θ * = arg max P( Θ D P( Θ D = = 1,.. N P( D Θ P( Θ P D Θ ( 2 Learnng Learnng = optmzaton problem Optmzaton problems can be hard to solve. Rght choce of a model and an error functon makes a dfference. Parameter optmzatons Gradent descent, Conjugate gradent (1 st order method Newton-Rhapson (2 nd order method Levenberg-Marquard Some can be carred on-lne on a sample by sample bass Combnatoral optmzatons (over dscrete spaces: Hll-clmbng Smulated-annealng Genetc algorthms

9 Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton Evaluaton. Smple holdout method. Dvde the data to the tranng and test data. Other more complex methods Based on random re-samplng valdaton schemes: cross-valdaton, random sub-samplng. What f we want to compare the predctve performance on a classfcaton or a regresson problem for two dfferent learnng methods? Soluton: compare the error results on the test data set The method wth better (smaller testng error gves a better generalzaton error. But we need statstcs to show sgnfcance

10 Densty estmaton Outlne Outlne: Densty estmaton: Maxmum lkelhood (ML Bayesan parameter estmates MAP Bernoull dstrbuton. Bnomal dstrbuton Multnomal dstrbuton Normal dstrbuton

11 Densty estmaton Data: D = { D1, D2,.., Dn} D = x a vector of attrbute values Attrbutes: modeled by random varables X = { X1, X 2, K, X d} wth: Contnuous values Dscrete values E.g. blood pressure wth numercal values or chest pan wth dscrete values [no-pan, mld, moderate, strong] Underlyng true probablty dstrbuton: p(x Data: Densty estmaton D = { D1, D2,.., Dn} D = x a vector of attrbute values Objectve: try to estmate the underlyng true probablty dstrbuton over varables X, p(x, usng examples n D true dstrbuton n samples p (X D = D, D,.., D } { 1 2 n estmate pˆ ( X Standard (d assumptons: Samples are ndependent of each other come from the same (dentcal dstrbuton (fxed p(x

12 Densty estmaton Types of densty estmaton: Parametrc the dstrbuton s modeled usng a set of parameters Θ p( X Θ Example: mean and covarances of a multvarate normal Estmaton: fnd parameters Θ descrbng data D Non-parametrc The model of the dstrbuton utlzes all examples n D As f all examples were parameters of the dstrbuton Examples: Nearest-neghbor Sem-parametrc Learnng va parameter estmaton In ths lecture we consder parametrc densty estmaton Basc settngs: A set of random varables X = { X1, X 2, K, X d} A model of the dstrbuton over varables n X wth parameters Θ : pˆ ( X Θ Data D = { 1 2 n D, D,.., D } Objectve: fnd parameters Θ such that p( X Θ descrbes data D the best

13 Parameter estmaton. Maxmum lkelhood (ML maxmze p( D Θ, ξ yelds: one set of parameters Θ ML the target dstrbuton s approxmated as: pˆ ( X = p( X Θ ML Bayesan parameter estmaton uses the posteror dstrbuton over possble parameters p( D Θ, ξ p( Θ ξ p( Θ D, ξ = p( D ξ Yelds: all possble settngs of Θ (and ther weghts The target dstrbuton s approxmated as: p ˆ ( X = p( X D = p( X Θ p( Θ D, ξ dθ Θ Parameter estmaton. Other possble crtera: Maxmum a posteror probablty (MAP maxmze p( Θ D, ξ (mode of the posteror Yelds: one set of parameters Θ MAP Approxmaton: pˆ ( X = p( X Θ MAP Expected value of the parameter Θˆ = E( Θ (mean of the posteror Expectaton taken wth regard to posteror p( Θ D, ξ Yelds: one set of parameters Approxmaton: p ˆ( X = p( X Θˆ

14 Parameter estmaton. Con example. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x =1 tal = 0 x Model: probablty of a head probablty of a tal ( 1 Objectve: We would lke to estmate the probablty of a head from data ˆ Parameter estmaton. Example. Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What would be your estmate of the probablty of a head? ~ =?

15 Parameter estmaton. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What would be your choce of the probablty of a head? Soluton: use frequences of occurrences to do the estmate ~ 15 = = Ths s the maxmum lkelhood estmate of the parameter Probablty of an outcome Data: D a sequence of outcomes such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Assume: we know the probablty Probablty of an outcome of a con flp x (1 ( x P x = (1 Bernoull dstrbuton Combnes the probablty of a head and a tal So that x s gong to pck ts correct probablty Gves for x =1 Gves ( 1 for = 0 x x x

16 Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x =1 = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of ndependent con flps D = H H T H T H (encoded as D= What s the probablty of observng the data sequence D: P( D =? x Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of con flps D = H H T H T H encoded as D= What s the probablty of observng a data sequence D: P( D = (1 (1 x

17 Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of con flps D = H H T H T H encoded as D= What s the probablty of observng a data sequence D: P( D = (1 (1 lkelhood of the data x Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x =1 = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of con flps D = H H T H T H encoded as D= What s the probablty of observng a data sequence D: P( D = (1 (1 6 x P( D = (1 = 1 Can be rewrtten usng the Bernoull dstrbuton: x (1 x

18 The goodness of ft to the data. Learnng: we do not know the value of the parameter Our learnng goal: Fnd the parameter that fts the data D the best? One soluton to the best : Maxmze the lkelhood n x P( D = (1 = 1 (1 x Intuton: more lkely are the data gven the model, the better s the ft Note: Instead of an error functon that measures how bad the data ft the model we have a measure that tells us how well the data ft : Error ( D, = P( D Example: Bernoull dstrbuton. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Objectve: We would lke to estmate the probablty of a head ˆ Probablty of an outcome P( = (1 x x x (1 x Bernoull dstrbuton

19 Maxmum lkelhood (ML estmate. Lkelhood of data: n x P( D, ξ = (1 Maxmum lkelhood estmate ML = arg max P( D, ξ = 1 N1 - number of heads seen N 2 - number of tals seen (1 x Optmze log-lkelhood (the same as maxmzng lkelhood = 1 n x (1 x l( D, = log P( D, ξ = log (1 n = 1 x log + (1 x log(1 = log n = 1 = x + log(1 n = 1 (1 x Maxmum lkelhood (ML estmate. Optmze log-lkelhood l( D, = N1 log + N2 log(1 Set dervatve to zero Solvng l( D, N N2 = (1 1 = = 0 N1 N + N 1 2 ML Soluton: ML = N N 1 = N1 N + N 1 2

20 Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What s the ML estmate of the probablty of a head and a tal? Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What s the ML estmate of the probablty of head and tal? Head: Tal: ML (1 N1 N1 = = = N N1 + N 2 N 2 N 2 = = N N + N ML = 0.6 = = 0.4