CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Similar documents

L10: Linear discriminants analysis

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Steve Kroon

The Greedy Method. Introduction. 0/1 Knapsack Problem

Quantization Effects in Digital Filters

What is Candidate Sampling

Designing a learning system

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Single and multiple stage classifiers implementing logistic discrimination

Forecasting the Direction and Strength of Stock Market Movement

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC

Statistical Methods to Develop Rating Models

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

How To Calculate The Accountng Perod Of Nequalty

1 Example 1: Axis-aligned rectangles

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Discussion Papers. Support Vector Machines (SVM) as a Technique for Solvency Analysis. Laura Auria Rouslan A. Moro. Berlin, August 2008

A novel Method for Data Mining and Classification based on

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

Machine Learning and Data Mining Lecture Notes

Lecture 2: Single Layer Perceptrons Kevin Swingler

Data Visualization by Pairwise Distortion Minimization

Sketching Sampled Data Streams

Calculation of Sampling Weights

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

STATISTICAL DATA ANALYSIS IN EXCEL

Credit Limit Optimization (CLO) for Credit Cards

Media Mix Modeling vs. ANCOVA. An Analytical Debate

1.2 DISTRIBUTIONS FOR CATEGORICAL DATA

Georey E. Hinton. University oftoronto. Technical Report CRG-TR May 21, 1996 (revised Feb 27, 1997) Abstract

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

1 De nitions and Censoring

A PROBABILITY-MAPPING ALGORITHM FOR CALIBRATING THE POSTERIOR PROBABILITIES: A DIRECT MARKETING APPLICATION

How To Evaluate A Dia Fund Suffcency

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Support vector domain description

Lecture 5,6 Linear Methods for Classification. Summary

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Regression Models for a Binary Response Using EXCEL and JMP

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Active Learning for Interactive Visualization

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Transition Matrix Models of Consumer Credit Ratings

Inverse Modeling of Tight Gas Reservoirs

An Alternative Way to Measure Private Equity Performance

Optimal Customized Pricing in Competitive Settings

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Traffic-light a stress test for life insurance provisions

Economic Interpretation of Regression. Theory and Applications

Chapter 6. Classification and Prediction

CHAPTER 14 MORE ABOUT REGRESSION

Method for assessment of companies' credit rating (AJPES S.BON model) Short description of the methodology

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Hedging Interest-Rate Risk with Duration

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Prediction of Stock Market Index Movement by Ten Data Mining Techniques

Calculating the high frequency transmission line parameters of power cables

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Binomial Link Functions. Lori Murray, Phil Munz

Imperial College London

A Practitioner's Guide to Generalized Linear Models

Gender differences in revealed risk taking: evidence from mutual fund investors

Support Vector Machines

The Journal of Systems and Software

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Estimation of Dispersion Parameters in GLMs with and without Random Effects

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Gender Classification for Real-Time Audience Analysis System

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Forecasting and Stress Testing Credit Card Default using Dynamic Models

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.

Analysis of Premium Liabilities for Australian Lines of Business

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Prediction of Disability Frequencies in Life Insurance

A statistical approach to determine Microbiologically Influenced Corrosion (MIC) Rates of underground gas pipelines.

Transcription:

Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there are avalable seats Rules for audt: Homework assgnments

Revew Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton

Data Data may need a lot of: Cleanng Preprocessng (conversons Cleanng: Get rd of errors, nose, Removal of redundances Preprocessng: Renamng Rescalng (normalzaton Dscretzatons Abstracton Aggregaton New attrbutes Data bases Watch out for data bases: Try to understand the data source It s very easy to derve unexpected results when data used for analyss and learnng are based (pre-selected Results (conclusons derved for pre-selected data do not hold n general!!!

Data bases Example 1: Rsks n pregnancy study Sponsored by DARPA at mltary hosptals Study of a large sample of pregnant woman who vsted mltary hosptals Concluson: the factor wth the largest mpact on reducng rsks durng pregnancy (statstcally sgnfcant s a pregnant woman beng sngle Sngle woman the smallest rsk What s wrong? Data Example 2: Stock market tradng (example by Andrew Lo Data on stock performances of companes traded on stock market over past 25 year Investment goal: pck a stock to hold long term Proposed strategy: nvest n a company stock wth an IPO correspondng to a Carmchael number - Evaluaton result: excellent return over 25 years - Where the magc comes from?

Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton Feature selecton The sze (dmensonalty of a sample can be enormous 1 2 d x = ( x, x,.., x d - very large Example: document classfcaton 10,000 dfferent words Inputs: counts of occurrences of dfferent words Too many parameters to learn (not enough samples to justfy the estmates the parameters of the model Dmensonalty reducton: replace nputs wth features Extract relevant nputs (e.g. mutual nformaton measure PCA prncpal component analyss Group (cluster smlar words (uses a smlarty measure Replace wth the group label

Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton Model selecton What s the rght model to learn? E.g what polynomal to use A pror knowledge helps a lot, but stll a lot of guessng Intal data analyss and vsualzaton We can make a good guess about the form of the dstrbuton, shape of the functon Overfttng problem Take nto account the bas and varance of error estmates Smpler (more based model parameters can be estmated more relably (smaller varance of estmates Complex model wth many parameters parameter estmates are less relable (large varance of the estmate

Solutons for overfttng How to make the learner avod the overft? Assure suffcent number of samples n the tranng set May not be possble (small number of examples Hold some data out of the tranng set = valdaton set Tran (ft on the tranng set (w/o data held out; Check for the generalzaton error on the valdaton set, choose the model based on the valdaton set error (random resamplng valdaton technques Regularzaton (Occam s Razor Penalze for the model complexty (number of parameters Explct preference towards smple models Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton

Learnng Learnng = optmzaton problem. Varous crtera: Mean square error * 1 w = arg mn Error ( w Error ( w = ( y f ( x, w w N Maxmum lkelhood (ML crteron Θ * = arg max P ( D Θ Error ( Θ = log P( D Θ Θ Maxmum posteror probablty (MAP Θ * = arg max P( Θ D P( Θ D = = 1,.. N P( D Θ P( Θ P D Θ ( 2 Learnng Learnng = optmzaton problem Optmzaton problems can be hard to solve. Rght choce of a model and an error functon makes a dfference. Parameter optmzatons Gradent descent, Conjugate gradent (1 st order method Newton-Rhapson (2 nd order method Levenberg-Marquard Some can be carred on-lne on a sample by sample bass Combnatoral optmzatons (over dscrete spaces: Hll-clmbng Smulated-annealng Genetc algorthms

Desgn cycle Data Feature selecton Model selecton Requre pror knowledge Learnng Evaluaton Evaluaton. Smple holdout method. Dvde the data to the tranng and test data. Other more complex methods Based on random re-samplng valdaton schemes: cross-valdaton, random sub-samplng. What f we want to compare the predctve performance on a classfcaton or a regresson problem for two dfferent learnng methods? Soluton: compare the error results on the test data set The method wth better (smaller testng error gves a better generalzaton error. But we need statstcs to show sgnfcance

Densty estmaton Outlne Outlne: Densty estmaton: Maxmum lkelhood (ML Bayesan parameter estmates MAP Bernoull dstrbuton. Bnomal dstrbuton Multnomal dstrbuton Normal dstrbuton

Densty estmaton Data: D = { D1, D2,.., Dn} D = x a vector of attrbute values Attrbutes: modeled by random varables X = { X1, X 2, K, X d} wth: Contnuous values Dscrete values E.g. blood pressure wth numercal values or chest pan wth dscrete values [no-pan, mld, moderate, strong] Underlyng true probablty dstrbuton: p(x Data: Densty estmaton D = { D1, D2,.., Dn} D = x a vector of attrbute values Objectve: try to estmate the underlyng true probablty dstrbuton over varables X, p(x, usng examples n D true dstrbuton n samples p (X D = D, D,.., D } { 1 2 n estmate pˆ ( X Standard (d assumptons: Samples are ndependent of each other come from the same (dentcal dstrbuton (fxed p(x

Densty estmaton Types of densty estmaton: Parametrc the dstrbuton s modeled usng a set of parameters Θ p( X Θ Example: mean and covarances of a multvarate normal Estmaton: fnd parameters Θ descrbng data D Non-parametrc The model of the dstrbuton utlzes all examples n D As f all examples were parameters of the dstrbuton Examples: Nearest-neghbor Sem-parametrc Learnng va parameter estmaton In ths lecture we consder parametrc densty estmaton Basc settngs: A set of random varables X = { X1, X 2, K, X d} A model of the dstrbuton over varables n X wth parameters Θ : pˆ ( X Θ Data D = { 1 2 n D, D,.., D } Objectve: fnd parameters Θ such that p( X Θ descrbes data D the best

Parameter estmaton. Maxmum lkelhood (ML maxmze p( D Θ, ξ yelds: one set of parameters Θ ML the target dstrbuton s approxmated as: pˆ ( X = p( X Θ ML Bayesan parameter estmaton uses the posteror dstrbuton over possble parameters p( D Θ, ξ p( Θ ξ p( Θ D, ξ = p( D ξ Yelds: all possble settngs of Θ (and ther weghts The target dstrbuton s approxmated as: p ˆ ( X = p( X D = p( X Θ p( Θ D, ξ dθ Θ Parameter estmaton. Other possble crtera: Maxmum a posteror probablty (MAP maxmze p( Θ D, ξ (mode of the posteror Yelds: one set of parameters Θ MAP Approxmaton: pˆ ( X = p( X Θ MAP Expected value of the parameter Θˆ = E( Θ (mean of the posteror Expectaton taken wth regard to posteror p( Θ D, ξ Yelds: one set of parameters Approxmaton: p ˆ( X = p( X Θˆ

Parameter estmaton. Con example. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x =1 tal = 0 x Model: probablty of a head probablty of a tal ( 1 Objectve: We would lke to estmate the probablty of a head from data ˆ Parameter estmaton. Example. Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What would be your estmate of the probablty of a head? ~ =?

Parameter estmaton. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What would be your choce of the probablty of a head? Soluton: use frequences of occurrences to do the estmate ~ 15 = = 0.6 25 Ths s the maxmum lkelhood estmate of the parameter Probablty of an outcome Data: D a sequence of outcomes such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Assume: we know the probablty Probablty of an outcome of a con flp x (1 ( x P x = (1 Bernoull dstrbuton Combnes the probablty of a head and a tal So that x s gong to pck ts correct probablty Gves for x =1 Gves ( 1 for = 0 x x x

Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x =1 = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of ndependent con flps D = H H T H T H (encoded as D= 110101 What s the probablty of observng the data sequence D: P( D =? x Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of con flps D = H H T H T H encoded as D= 110101 What s the probablty of observng a data sequence D: P( D = (1 (1 x

Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of con flps D = H H T H T H encoded as D= 110101 What s the probablty of observng a data sequence D: P( D = (1 (1 lkelhood of the data x Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x =1 = 0 Model: probablty of a head probablty of a tal ( 1 Assume: a sequence of con flps D = H H T H T H encoded as D= 110101 What s the probablty of observng a data sequence D: P( D = (1 (1 6 x P( D = (1 = 1 Can be rewrtten usng the Bernoull dstrbuton: x (1 x

The goodness of ft to the data. Learnng: we do not know the value of the parameter Our learnng goal: Fnd the parameter that fts the data D the best? One soluton to the best : Maxmze the lkelhood n x P( D = (1 = 1 (1 x Intuton: more lkely are the data gven the model, the better s the ft Note: Instead of an error functon that measures how bad the data ft the model we have a measure that tells us how well the data ft : Error ( D, = P( D Example: Bernoull dstrbuton. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x =1 tal x = 0 Model: probablty of a head probablty of a tal ( 1 Objectve: We would lke to estmate the probablty of a head ˆ Probablty of an outcome P( = (1 x x x (1 x Bernoull dstrbuton

Maxmum lkelhood (ML estmate. Lkelhood of data: n x P( D, ξ = (1 Maxmum lkelhood estmate ML = arg max P( D, ξ = 1 N1 - number of heads seen N 2 - number of tals seen (1 x Optmze log-lkelhood (the same as maxmzng lkelhood = 1 n x (1 x l( D, = log P( D, ξ = log (1 n = 1 x log + (1 x log(1 = log n = 1 = x + log(1 n = 1 (1 x Maxmum lkelhood (ML estmate. Optmze log-lkelhood l( D, = N1 log + N2 log(1 Set dervatve to zero Solvng l( D, N N2 = (1 1 = = 0 N1 N + N 1 2 ML Soluton: ML = N N 1 = N1 N + N 1 2

Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What s the ML estmate of the probablty of a head and a tal? Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 15 Tals: 10 What s the ML estmate of the probablty of head and tal? Head: Tal: ML (1 N1 N1 = = = N N1 + N 2 N 2 N 2 = = N N + N ML 1 15 25 2 = 0.6 = 10 25 = 0.4