Probabilistic and Bayesian Analytics



Similar documents
Online Bagging and Boosting

How To Get A Loan From A Bank For Free

Machine Learning Applications in Grid Computing

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

Software Quality Characteristics Tested For Mobile Application Development

SOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008

arxiv: v1 [math.pr] 9 May 2008

How To Attract Ore Traffic On A Network With A Daoi (Orca) On A Gpa Network

FUTURE LIFE-TABLES BASED ON THE LEE-CARTER METHODOLOGY AND THEIR APPLICATION TO CALCULATING THE PENSION ANNUITIES 1

Analyzing Methods Study of Outer Loop Current Sharing Control for Paralleled DC/DC Converters

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

Searching strategy for multi-target discovery in wireless networks

Managing Complex Network Operation with Predictive Analytics

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Quality evaluation of the model-based forecasts of implied volatility index

6. Time (or Space) Series Analysis

Kinetic Molecular Theory of Ideal Gases

Project Evaluation Roadmap. Capital Budgeting Process. Capital Expenditure. Major Cash Flow Components. Cash Flows... COMM2501 Financial Management

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

Doppler effect, moving sources/receivers

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

Applying Multiple Neural Networks on Large Scale Data

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

Information Processing Letters

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

Factored Models for Probabilistic Modal Logic

Preference-based Search and Multi-criteria Optimization

Cross-validation for detecting and preventing overfitting

( C) CLASS 10. TEMPERATURE AND ATOMS

OpenGamma Documentation Bond Pricing

An Approach to Combating Free-riding in Peer-to-Peer Networks

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

Investing in corporate bonds?

Fuzzy Sets in HR Management

AUC Optimization vs. Error Rate Minimization

Insurance Spirals and the Lloyd s Market

Don t Run With Your Retirement Money

Investing in corporate bonds?

and that of the outgoing water is mv f

Use of extrapolation to forecast the working capital in the mechanical engineering companies

Pricing Asian Options using Monte Carlo Methods

Energy Proportionality for Disk Storage Using Replication

CLOSED-LOOP SUPPLY CHAIN NETWORK OPTIMIZATION FOR HONG KONG CARTRIDGE RECYCLING INDUSTRY

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks

Comment on On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

ADJUSTING FOR QUALITY CHANGE

Endogenous Credit-Card Acceptance in a Model of Precautionary Demand for Money

Evaluating Software Quality of Vendors using Fuzzy Analytic Hierarchy Process

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Halloween Costume Ideas for the Wii Game

Expected Value. 24 February Expected Value 24 February /19

Invention of NFV Technique and Its Relationship with NPV

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

A Study on the Chain Restaurants Dynamic Negotiation Games of the Optimization of Joint Procurement of Food Materials

Modeling operational risk data reported above a time-varying threshold

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

Position Auctions and Non-uniform Conversion Rates

SAMPLING METHODS LEARNING OBJECTIVES

Data Set Generation for Rectangular Placement Problems

AutoHelp. An 'Intelligent' Case-Based Help Desk Providing. Web-Based Support for EOSDIS Customers. A Concept and Proof-of-Concept Implementation

HOW CLOSE ARE THE OPTION PRICING FORMULAS OF BACHELIER AND BLACK-MERTON-SCHOLES?

Introduction to Learning & Decision Trees

An Optimal Task Allocation Model for System Cost Analysis in Heterogeneous Distributed Computing Systems: A Heuristic Approach

AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

Boolean Algebra Part 1

On the Mutual Coefficient of Restitution in Two Car Collinear Collisions

YOUNG CHILDREN CONTINUE TO REINVENT ARITHMETIC 3rd Grade IMPLICATIONS OF PIAGET'S THEORY

Mathematical Model for Glucose-Insulin Regulatory System of Diabetes Mellitus

Standards and Protocols for the Collection and Dissemination of Graduating Student Initial Career Outcomes Information For Undergraduates

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

An Innovate Dynamic Load Balancing Algorithm Based on Task

COMBINING CRASH RECORDER AND PAIRED COMPARISON TECHNIQUE: INJURY RISK FUNCTIONS IN FRONTAL AND REAR IMPACTS WITH SPECIAL REFERENCE TO NECK INJURIES

A WISER Guide. Financial Steps for Caregivers: What You Need to Know About Money and Retirement

DELTA-V AS A MEASURE OF TRAFFIC CONFLICT SEVERITY

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Resource Allocation in Wireless Networks with Multiple Relays

An Integrated Approach for Monitoring Service Level Parameters of Software-Defined Networking

The United States was in the midst of a

Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure

Calculus-Based Physics I by Jeffrey W. Schnick

An improved TF-IDF approach for text classification *

Markov Models and Their Use for Calculations of Important Traffic Parameters of Contact Center

Image restoration for a rectangular poor-pixels detector

Transcription:

Probabilistic and Bayesian Analytics Note to other teachers and users of these slides. Andrew would be delighted if you found this source aterial useful in giing your own lectures. Feel free to use these slides erbati, or to odify the to fit your own needs. PowerPoint originals are aailable. If you ake use of a significant portion of these slides in your own lecture, please include this essage, or the following link to the source repository of Andrew s tutorials: http://www.cs.cu.edu/~aw/tutorials. Coents and corrections gratefully receied. Andrew W. Moore Professor School of Coputer Science Carnegie Mellon Uniersity www.cs.cu.edu/~aw aw@cs.cu.edu 42-268-7599 Copyright Andrew W. Moore Slide Probability The world is a ery uncertain place 3 years of Artificial Intelligence and Database research danced around this fact And then a few AI researchers decided to use soe ideas fro the eighteenth century Copyright Andrew W. Moore Slide 2

What we re going to do We will reiew the fundaentals of probability. It s really going to be worth it In this lecture, you ll see an exaple of probabilistic analytics in action: Bayes Classifiers Copyright Andrew W. Moore Slide 3 Discrete Rando Variables A is a Boolean-alued rando ariable if A denotes an eent, and there is soe degree of uncertainty as to whether A occurs. Exaples A The US president in 223 will be ale A ou wake up toorrow with a headache A ou hae Ebola Copyright Andrew W. Moore Slide 4 2

Probabilities We write A as the fraction of possible worlds in which A is true We could at this point spend 2 hours on the philosophy of this. But we won t. Copyright Andrew W. Moore Slide 5 Visualizing A Eent space of all possible worlds Worlds in which A is true A Area of reddish oal Its area is Worlds in which A is False Copyright Andrew W. Moore Slide 6 3

The Axios Of Probabi lity Copyright Andrew W. Moore Slide 7 The Axios of Probability < A < True False A or B A + B - A and B Where do these axios coe fro? Were they discoered? Answers coing up later. Copyright Andrew W. Moore Slide 8 4

Interpreting the axios < A < True False A or B A + B - A and B The area of A can t get any saller than And a zero area would ean no world could eer hae A true Copyright Andrew W. Moore Slide 9 Interpreting the axios < A < True False A or B A + B - A and B The area of A can t get any bigger than And an area of would ean all worlds will hae A true Copyright Andrew W. Moore Slide 5

Interpreting the axios < A < True False A or B A + B - A and B A B Copyright Andrew W. Moore Slide Interpreting the axios < A < True False A or B A + B - A and B A B A or B A and B B Siple addition and subtraction Copyright Andrew W. Moore Slide 2 6

These Axios are Not to be Trifled With There hae been attepts to do different ethodologies for uncertainty Fuzzy Logic Three-alued logic Depster-Shafer Non-onotonic reasoning But the axios of probability are the only syste with this property: If you gable using the you can t be unfairly exploited by an opponent using soe other syste [di Finetti 93] Copyright Andrew W. Moore Slide 3 Theores fro the Axios < A <, True, False A or B A + B - A and B Fro these we can proe: not A ~A -A How? Copyright Andrew W. Moore Slide 4 7

Side Note I a inflicting these proofs on you for two reasons:. These kind of anipulations will need to be second nature to you if you use probabilistic analytics in depth 2. Suffering is good for you Copyright Andrew W. Moore Slide 5 Another iportant theore < A <, True, False A or B A + B - A and B Fro these we can proe: A A ^ B + A ^ ~B How? Copyright Andrew W. Moore Slide 6 8

Multialued Rando Variables Suppose A can take on ore than 2 alues A is a rando ariable with arity k if it can take on exactly one alue out of {, 2,.. k } Thus A P A A A if i A i j ( 2 k j Copyright Andrew W. Moore Slide 7 An easy fact about Multialued Rando Variables: Using the axios of probability < A <, True, False A or B A + B - A and B And assuing that A obeys A P A It s easy to proe that A A if i A i j ( 2 k j A A 2 A i A i j j Copyright Andrew W. Moore Slide 8 9

An easy fact about Multialued Rando Variables: Using the axios of probability < A <, True, False A or B A + B - A and B And assuing that A obeys A P A It s easy to proe that A A if i A i j ( 2 k j A A 2 A And thus we can proe k j A j i A i j j Copyright Andrew W. Moore Slide 9 Another fact about Multialued Rando Variables: Using the axios of probability < A <, True, False A or B A + B - A and B And assuing that A obeys A P A It s easy to proe that A A if i A i j ( 2 k j B [ A A 2 A ] i B A i j j Copyright Andrew W. Moore Slide 2

Another fact about Multialued Rando Variables: Using the axios of probability < A <, True, False A or B A + B - A and B And assuing that A obeys A P A It s easy to proe that A A if i A i j ( 2 k j B [ A A 2 A ] And thus we can proe B k j B A i B A i j j j Copyright Andrew W. Moore Slide 2 Eleentary Probability in Pictures ~A + A Copyright Andrew W. Moore Slide 22

Eleentary Probability in Pictures B B ^ A + B ^ ~A Copyright Andrew W. Moore Slide 23 Eleentary Probability in Pictures k j A j Copyright Andrew W. Moore Slide 24 2

Eleentary Probability in Pictures B k j B A j Copyright Andrew W. Moore Slide 25 Conditional Probability A B Fraction of worlds in which B is true that also hae A true H Hae a headache F Coing down with Flu F H / F /4 H F /2 H Headaches are rare and flu is rarer, but if you re coing down with flu there s a 5-5 chance you ll hae a headache. Copyright Andrew W. Moore Slide 26 3

Conditional Probability F H F Fraction of flu-inflicted worlds in which you hae a headache H Hae a headache F Coing down with Flu H / F /4 H F /2 H #worlds with flu and headache ------------------------------------ #worlds with flu Area of H and F region ------------------------------ Area of F region H ^ F ----------- F Copyright Andrew W. Moore Slide 27 Definition of Conditional Probability A ^ B A B ----------- B Corollary: The Chain Rule A ^ B A B B Copyright Andrew W. Moore Slide 28 4

Probabilistic Inference F H Hae a headache F Coing down with Flu H H / F /4 H F /2 One day you wake up with a headache. ou think: Drat! 5% of flus are associated with headaches so I ust hae a 5-5 chance of coing down with flu Is this reasoning good? Copyright Andrew W. Moore Slide 29 Probabilistic Inference F H Hae a headache F Coing down with Flu H H / F /4 H F /2 F ^ H F H Copyright Andrew W. Moore Slide 3 5

Another way to understand the intuition Thanks to Jahanzeb Sherwani for contributing this explanation: Copyright Andrew W. Moore Slide 3 What we just did A ^ B A B B B A ----------- --------------- A A This is Bayes Rule Bayes, Thoas (763 An essay towards soling a proble in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:37-48 Copyright Andrew W. Moore Slide 32 6

Using Bayes Rule to Gable $. The Win enelope has a dollar and four beads in it The Lose enelope has three beads and no oney Triial question: soeone draws an enelope at rando and offers to sell it to you. How uch should you pay? Copyright Andrew W. Moore Slide 33 Using Bayes Rule to Gable $. The Win enelope has a dollar and four beads in it The Lose enelope has three beads and no oney Interesting question: before deciding, you are allowed to see one bead drawn fro the enelope. Suppose it s black: How uch should you pay? Suppose it s red: How uch should you pay? Copyright Andrew W. Moore Slide 34 7

Calculation $. Copyright Andrew W. Moore Slide 35 More General Fors of Bayes Rule B A A A B B A A + B ~ A ~ A B A X A X A B X B X Copyright Andrew W. Moore Slide 36 8

9 Copyright Andrew W. Moore Slide 37 More General Fors of Bayes Rule n A k k k i i i P A A P B P A A P B B P A ( ( ( ( ( Copyright Andrew W. Moore Slide 38 Useful Easy-to-proe facts ( ( + B A P B A P ( n A k k B A P

The Joint Distribution Recipe for aking a joint distribution of M ariables: Exaple: Boolean ariables A, B, C Copyright Andrew W. Moore Slide 39 The Joint Distribution Recipe for aking a joint distribution of M ariables:. Make a truth table listing all cobinations of alues of your ariables (if there are M Boolean ariables then the table will hae 2 M rows. A Exaple: Boolean ariables A, B, C B C Copyright Andrew W. Moore Slide 4 2

The Joint Distribution Recipe for aking a joint distribution of M ariables:. Make a truth table listing all cobinations of alues of your ariables (if there are M Boolean ariables then the table will hae 2 M rows. 2. For each cobination of alues, say how probable it is. A Exaple: Boolean ariables A, B, C B C Prob.3.5..5.5..25. Copyright Andrew W. Moore Slide 4 The Joint Distribution Recipe for aking a joint distribution of M ariables:. Make a truth table listing all cobinations of alues of your ariables (if there are M Boolean ariables then the table will hae 2 M rows. 2. For each cobination of alues, say how probable it is. 3. If you subscribe to the axios of probability, those nubers ust su to. A A Exaple: Boolean ariables A, B, C B.5 C..25 Prob.3.5..5.5..25...5.5 C.3 B. Copyright Andrew W. Moore Slide 42 2

Using the Joint One you hae the JD you can ask for the probability of any logical expression inoling your attribute E rows atching E row Copyright Andrew W. Moore Slide 43 Using the Joint Poor Male.4654 E rows atching E row Copyright Andrew W. Moore Slide 44 22

Using the Joint Poor.764 E rows atching E row Copyright Andrew W. Moore Slide 45 Inference with the Joint E E 2 E E2 E 2 rows atching E and E rows atching E row 2 2 row Copyright Andrew W. Moore Slide 46 23

Inference with the Joint E E 2 E E2 E 2 rows atching E and E rows atching E row 2 2 row Male Poor.4654 /.764.62 Copyright Andrew W. Moore Slide 47 Inference is a big deal I e got this eidence. What s the chance that this conclusion is true? I e got a sore neck: how likely a I to hae eningitis? I see y lights are out and it s 9p. What s the chance y spouse is already asleep? Copyright Andrew W. Moore Slide 48 24

Inference is a big deal I e got this eidence. What s the chance that this conclusion is true? I e got a sore neck: how likely a I to hae eningitis? I see y lights are out and it s 9p. What s the chance y spouse is already asleep? Copyright Andrew W. Moore Slide 49 Inference is a big deal I e got this eidence. What s the chance that this conclusion is true? I e got a sore neck: how likely a I to hae eningitis? I see y lights are out and it s 9p. What s the chance y spouse is already asleep? There s a thriing set of industries growing based around Bayesian Inference. Highlights are: Medicine, Phara, Help Desk Support, Engine Fault Diagnosis Copyright Andrew W. Moore Slide 5 25

Where do Joint Distributions coe fro? Idea One: Expert Huans Idea Two: Sipler probabilistic facts and soe algebra Exaple: Suppose you knew A.7 B A.2 B ~A. C A^B. C A^~B.8 C ~A^B.3 C ~A^~B. Then you can autoatically copute the JD using the chain rule Ax ^ By ^ Cz Cz Ax^ By By Ax Ax In another lecture: Bayes Nets, a systeatic way to do this. Copyright Andrew W. Moore Slide 5 Where do Joint Distributions coe fro? Idea Three: Learn the fro data! Prepare to see one of the ost ipressie learning algoriths you ll coe across in the entire course. Copyright Andrew W. Moore Slide 52 26

Learning a joint distribution Build a JD table for your attributes in which the probabilities are unspecified A B C Prob???????? Fraction of all records in which A and B are True but C is False The fill in each row with records atching row Pˆ (row total nuber of records A B C Prob.3.5..5.5..25. Copyright Andrew W. Moore Slide 53 Exaple of Learning a Joint This Joint was obtained by learning fro three attributes in the UCI Adult Census Database [Kohai 995] Copyright Andrew W. Moore Slide 54 27

Where are we? We hae recalled the fundaentals of probability We hae becoe content with what JDs are and how to use the And we een know how to learn JDs fro data. Copyright Andrew W. Moore Slide 55 Density Estiation Our Joint Distribution learner is our first exaple of soething called Density Estiation A Density Estiator learns a apping fro a set of attributes to a Probability Input Attributes Density Estiator Probability Copyright Andrew W. Moore Slide 56 28

Density Estiation Copare it against the two other ajor kinds of odels: Input Attributes Classifier Prediction of categorical output Input Attributes Density Estiator Probability Input Attributes Regressor Prediction of real-alued output Copyright Andrew W. Moore Slide 57 Ealuating Density Estiation Test-set criterion for estiating perforance on future data* * See the Decision Tree or Cross Validation lecture for ore detail Input Attributes Classifier Prediction of categorical output Test set Accuracy Input Attributes Density Estiator Probability? Input Attributes Regressor Prediction of real-alued output Test set Accuracy Copyright Andrew W. Moore Slide 58 29

Ealuating a density estiator Gien a record x, a density estiator M can tell you how likely the record is: Pˆ ( x M Gien a dataset with R records, a density estiator can tell you how likely the dataset is: (Under the assuption that all records were independently generated fro the Density Estiator s JD Pˆ (dataset M Pˆ( x R x2 K xr M Pˆ( xk M k Copyright Andrew W. Moore Slide 59 A sall dataset: Miles Per Gallon pg odelyear aker 92 Training Set Records good 75to78 asia bad 7to74 aerica bad 75to78 europe bad 7to74 aerica bad 7to74 aerica bad 7to74 asia bad 7to74 asia bad 75to78 aerica : : : : : : : : : bad 7to74 aerica good 79to83 aerica bad 75to78 aerica good 79to83 aerica bad 75to78 aerica good 79to83 aerica good 79to83 aerica bad 7to74 aerica good 75to78 europe bad 75to78 europe Fro the UCI repository (thanks to Ross Quinlan Copyright Andrew W. Moore Slide 6 3

A sall dataset: Miles Per Gallon pg odelyear aker 92 Training Set Records good 75to78 asia bad 7to74 aerica bad 75to78 europe bad 7to74 aerica bad 7to74 aerica bad 7to74 asia bad 7to74 asia bad 75to78 aerica : : : : : : : : : bad 7to74 aerica good 79to83 aerica bad 75to78 aerica good 79to83 aerica bad 75to78 aerica good 79to83 aerica good 79to83 aerica bad 7to74 aerica good 75to78 europe bad 75to78 europe Copyright Andrew W. Moore Slide 6 A sall dataset: Miles Per Gallon pg odelyear aker 92 Training Set Records Pˆ(dataset M good 75to78 asia bad 7to74 aerica bad 75to78 europe bad 7to74 aerica bad 7to74 aerica bad 7to74 asia bad 7to74 asia bad 75to78 aerica : : : : : : : : : bad 7to74 aerica good 79to83 aerica bad 75to78P ˆ( aerica x good 79to83 aerica bad 75to78 aerica good 79to83 aerica good 79to83 (in this aerica bad 7to74 aerica good 75to78 europe bad 75to78 europe R x2 K x M k -23 case 3.4 R Pˆ( xk M Copyright Andrew W. Moore Slide 62 3

Log Probabilities Since probabilities of datasets get so sall we usually use log probabilities log Pˆ(dataset M log R k R Pˆ( xk M log Pˆ( xk M k Copyright Andrew W. Moore Slide 63 A sall dataset: Miles Per Gallon pg odelyear aker 92 Training Set Records log Pˆ(dataset good 75to78 asia bad 7to74 aerica bad 75to78 europe bad 7to74 aerica bad 7to74 aerica bad 7to74 asia bad 7to74 asia bad 75to78 aerica : : : : : : : : : bad 7to74 aerica good 79to83 aerica bad M75to78 aerica log good 79to83 aerica bad 75to78 aerica 79to83 aerica good 79to83 (in this aerica bad 7to74 aerica good 75to78 europe bad 75to78 europe R Pˆ( xk M log Pˆ( xk M k k case 466.9 R Copyright Andrew W. Moore Slide 64 32

Suary: The Good News We hae a way to learn a Density Estiator fro data. Density estiators can do any good things Can sort the records by probability, and thus spot weird records (anoaly detection Can do inference: E E2 Autoatic Doctor / Help Desk etc Ingredient for Bayes Classifiers (see later Copyright Andrew W. Moore Slide 65 Suary: The Bad News Density estiation by directly learning the joint is triial, indless and dangerous Copyright Andrew W. Moore Slide 66 33

Using a test set An independent test set with 96 cars has a worse log likelihood (actually it s a billion quintillion quintillion quintillion quintillion ties less likely.density estiators can oerfit. And the full joint density estiator is the oerfittiest of the all! Copyright Andrew W. Moore Slide 67 Oerfitting Density Estiators If this eer happens, it eans there are certain cobinations that we learn are ipossible R log Pˆ(testset M log Pˆ( xk M log Pˆ( xk M k k if for any k Pˆ( x M k R Copyright Andrew W. Moore Slide 68 34

Using a test set The only reason that our test set didn t score -infinity is that y code is hard-wired to always predict a probability of at least one in 2 We need Density Estiators that are less prone to oerfitting Copyright Andrew W. Moore Slide 69 Naïe Density Estiation The proble with the Joint Estiator is that it just irrors the training data. We need soething which generalizes ore usefully. The naïe odel generalizes strongly: Assue that each attribute is distributed independently of any of the other attributes. Copyright Andrew W. Moore Slide 7 35

Independently Distributed Data Let x[i] denote the i th field of record x. The independently distributed assuption says that for any i,, u u 2 u i- u i+ u M x[ i] x[] u, x[2] u2, Kx[ i ] ui, x[ i + ] ui+, Kx[ M ] u x[ i] Or in other words, x[i] is independent of {x[],x[2],..x[i-], x[i+], x[m]} This is often written as x[ i] { x[], x[2], K x[ i ], x[ i + ], Kx[ M ]} M Copyright Andrew W. Moore Slide 7 A note about independence Assue A and B are Boolean Rando Variables. Then A and B are independent if and only if A B A A and B are independent is often notated as A B Copyright Andrew W. Moore Slide 72 36

Independence Theores Assue A B A Then A^B Assue A B A Then B A A B B Copyright Andrew W. Moore Slide 73 Independence Theores Assue A B A Then ~A B Assue A B A Then A ~B ~A A Copyright Andrew W. Moore Slide 74 37

Multialued Independence For ultialued Rando Variables A and B, A B if and only if u, : A u B A u fro which you can then proe things like u, : A u B A u B u, : B A B Copyright Andrew W. Moore Slide 75 Back to Naïe Density Estiation Let x[i] denote the i th field of record x: Naïe DE assues x[i] is independent of {x[],x[2],..x[i-], x[i+], x[m]} Exaple: Suppose that each record is generated by randoly shaking a green dice and a red dice Dataset : A red alue, B green alue Dataset 2: A red alue, B su of alues Dataset 3: A su of alues, B difference of alues Which of these datasets iolates the naïe assuption? Copyright Andrew W. Moore Slide 76 38

Using the Naïe Distribution Once you hae a Naïe Distribution you can easily copute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is A^~B^C^~D? Copyright Andrew W. Moore Slide 77 Using the Naïe Distribution Once you hae a Naïe Distribution you can easily copute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is A^~B^C^~D? A ~B^C^~D ~B^C^~D A ~B^C^~D A ~B C^~D C^~D A ~B C^~D A ~B C ~D ~D A ~B C ~D Copyright Andrew W. Moore Slide 78 39

Naïe Distribution General Case Suppose x[], x[2], x[m] are independently distributed. M, x[2] u2, x[ M ] um x[ k] u k k x[] u K So if we hae a Naïe Distribution we can construct any row of the iplied Joint Distribution on deand. So we can do any inference But how do we learn a Naïe Density Estiator? Copyright Andrew W. Moore Slide 79 Learning a Naïe Density Estiator Pˆ ( x[ i] u # records in which x[ i] u total nuber of records Another triial learning algorith! Copyright Andrew W. Moore Slide 8 4

Joint DE Contrast Naïe DE Can odel anything No proble to odel C is a noisy copy of A Gien records and ore than 6 Boolean attributes will screw up badly Can odel only ery boring distributions Outside Naïe s scope Gien records and, ultialued attributes will be fine Copyright Andrew W. Moore Slide 8 Epirical Results: Hopeless The hopeless dataset consists of 4, records and 2 Boolean attributes called a,b,c, u. Each attribute in each record is generated 5-5 randoly as or. Aerage test set log probability during folds of k-fold cross-alidation* Described in a future Andrew lecture Despite the ast aount of data, Joint oerfits hopelessly and does uch worse Copyright Andrew W. Moore Slide 82 4

Epirical Results: Logical The logical dataset consists of 4, records and 4 Boolean attributes called a,b,c,d where a,b,c are generated 5-5 randoly as or. D A^~C, except that in % of records it is flipped The DE learned by Joint The DE learned by Naie Copyright Andrew W. Moore Slide 83 Epirical Results: Logical The logical dataset consists of 4, records and 4 Boolean attributes called a,b,c,d where a,b,c are generated 5-5 randoly as or. D A^~C, except that in % of records it is flipped The DE learned by Joint The DE learned by Naie Copyright Andrew W. Moore Slide 84 42

Epirical Results: MPG The MPG dataset consists of 392 records and 8 attributes A tiny part of the DE learned by Joint The DE learned by Naie Copyright Andrew W. Moore Slide 85 Epirical Results: MPG The MPG dataset consists of 392 records and 8 attributes A tiny part of the DE learned by Joint The DE learned by Naie Copyright Andrew W. Moore Slide 86 43

Epirical Results: Weight s. MPG Suppose we train only fro the Weight and MPG attributes The DE learned by Joint The DE learned by Naie Copyright Andrew W. Moore Slide 87 Epirical Results: Weight s. MPG Suppose we train only fro the Weight and MPG attributes The DE learned by Joint The DE learned by Naie Copyright Andrew W. Moore Slide 88 44

Weight s. MPG : The best that Naïe can do The DE learned by Joint The DE learned by Naie Copyright Andrew W. Moore Slide 89 Reinder: The Good News We hae two ways to learn a Density Estiator fro data. *In other lectures we ll see astly ore ipressie Density Estiators (Mixture Models, Bayesian Networks, Density Trees, Kernel Densities and any ore Density estiators can do any good things Anoaly detection Can do inference: E E2 Autoatic Doctor / Help Desk etc Ingredient for Bayes Classifiers Copyright Andrew W. Moore Slide 9 45

Bayes Classifiers A foridable and sworn eney of decision trees Input Attributes Classifier Prediction of categorical output DT BC Copyright Andrew W. Moore Slide 9 How to build a Bayes Classifier Assue you want to predict output which has arity n and alues, 2, ny. Assue there are input attributes called X, X 2, X Break dataset into n saller datasets called DS, DS 2, DS ny. Define DS i Records in which i For each DS i, learn Density Estiator M i to odel the input distribution aong the i records. Copyright Andrew W. Moore Slide 92 46

How to build a Bayes Classifier Assue you want to predict output which has arity n and alues, 2, ny. Assue there are input attributes called X, X 2, X Break dataset into n saller datasets called DS, DS 2, DS ny. Define DS i Records in which i For each DS i, learn Density Estiator M i to odel the input distribution aong the i records. M i estiates X, X 2, X i Copyright Andrew W. Moore Slide 93 How to build a Bayes Classifier Assue you want to predict output which has arity n and alues, 2, ny. Assue there are input attributes called X, X 2, X Break dataset into n saller datasets called DS, DS 2, DS ny. Define DS i Records in which i For each DS i, learn Density Estiator M i to odel the input distribution aong the i records. M i estiates X, X 2, X i Idea: When a new set of input alues (X u, X 2 u 2,. X u coe along to be ealuated predict the alue of that akes X, X 2, X i ost likely predict Is this a good idea? argax X u L X u Copyright Andrew W. Moore Slide 94 47

How to build a Bayes Classifier Assue you want to predict output which has arity n and alues, 2, ny. Assue there are input attributes This called is a Maxiu X, X 2, X Likelihood classifier. Break dataset into n saller datasets called DS, DS 2, DS ny. Define DS i Records in which It i can get silly if soe s are For each DS i, learn Density Estiator M i to odel the input ery unlikely distribution aong the i records. M i estiates X, X 2, X i Idea: When a new set of input alues (X u, X 2 u 2,. X u coe along to be ealuated predict the alue of that akes X, X 2, X i ost likely predict Is this a good idea? argax X u L X u Copyright Andrew W. Moore Slide 95 How to build a Bayes Classifier Assue you want to predict output which has arity n and alues, 2, ny. Assue there are input attributes called X, X 2, X Break dataset into n saller datasets called DS, DS 2, DS ny. Define DS i Records in which i Much Better Idea For each DS i, learn Density Estiator M i to odel the input distribution aong the i records. M i estiates X, X 2, X i Idea: When a new set of input alues (X u, X 2 u 2,. X u coe along to be ealuated predict the alue of that akes i X, X 2, X ost likely predict argax X u L X u Is this a good idea? Copyright Andrew W. Moore Slide 96 48

Terinology MLE (Maxiu Likelihood Estiator: predict argax X u L X u MAP (Maxiu A-Posteriori Estiator: predict argax X u L X u Copyright Andrew W. Moore Slide 97 Getting what we need predict argax X u L X u Copyright Andrew W. Moore Slide 98 49

5 Copyright Andrew W. Moore Slide 99 Getting a posterior probability n j j j P u X u X P P u X u X P u X u X P P u X u X P u X u X P ( ( ( ( ( ( ( ( L L L L L Copyright Andrew W. Moore Slide Bayes Classifiers in a nutshell ( ( argax ( argax predict P u X u X P u X u X P L L. Learn the distribution oer inputs for each alue. 2. This gies X, X 2, X i. 3. Estiate i. as fraction of records with i. 4. For a new prediction:

Bayes Classifiers in a nutshell. Learn the distribution oer inputs for each alue. 2. This gies X, X 2, X i. 3. Estiate i. as fraction of records with i. 4. For a new prediction: We can use our faorite Density Estiator here. predict argax argax X u L X X u u L X Right now we hae two options: u Joint Density Estiator Naïe Density Estiator Copyright Andrew W. Moore Slide Joint Density Bayes Classifier predict argax X u L X u In the case of the joint Bayes Classifier this degenerates to a ery siple rule: predict the ost coon alue of aong records in which X u, X 2 u 2,. X u. Note that if no records hae the exact set of inputs X u, X 2 u 2,. X u, then X, X 2, X i for all alues of. In that case we just hae to guess s alue Copyright Andrew W. Moore Slide 2 5

Joint BC Results: Logical The logical dataset consists of 4, records and 4 Boolean attributes called a,b,c,d where a,b,c are generated 5-5 randoly as or. D A^~C, except that in % of records it is flipped The Classifier learned by Joint BC Copyright Andrew W. Moore Slide 3 Joint BC Results: All Irreleant The all irreleant dataset consists of 4, records and 5 Boolean attributes called a,b,c,d..o where a,b,c are generated 5-5 randoly as or. (output with probability.75, with prob.25 Copyright Andrew W. Moore Slide 4 52

53 Copyright Andrew W. Moore Slide 5 Naïe Bayes Classifier ( ( argax predict P u X u X P L In the case of the naie Bayes Classifier this can be siplified: n j j j u X P P predict ( ( argax Copyright Andrew W. Moore Slide 6 Naïe Bayes Classifier ( ( argax predict P u X u X P L In the case of the naie Bayes Classifier this can be siplified: n j j j u X P P predict ( ( argax Technical Hint: If you hae, input attributes that product will underflow in floating point ath. ou should use logs: + n j j j u X P P predict ( log ( log argax

BC Results: XOR The XOR dataset consists of 4, records and 2 Boolean inputs called a and b, generated 5-5 randoly as or. c (output a XOR b The Classifier learned by Joint BC The Classifier learned by Naie BC Copyright Andrew W. Moore Slide 7 Naie BC Results: Logical The logical dataset consists of 4, records and 4 Boolean attributes called a,b,c,d where a,b,c are generated 5-5 randoly as or. D A^~C, except that in % of records it is flipped The Classifier learned by Naie BC Copyright Andrew W. Moore Slide 8 54

Naie BC Results: Logical The logical dataset consists of 4, records and 4 Boolean attributes called a,b,c,d where a,b,c are generated 5-5 randoly as or. D A^~C, except that in % of records it is flipped This result surprised Andrew until he had thought about it a little The Classifier learned by Joint BC Copyright Andrew W. Moore Slide 9 Naïe BC Results: All Irreleant The all irreleant dataset consists of 4, records and 5 Boolean attributes called a,b,c,d..o where a,b,c are generated 5-5 randoly as or. (output with probability.75, with prob.25 The Classifier learned by Naie BC Copyright Andrew W. Moore Slide 55

BC Results: MPG : 392 records The Classifier learned by Naie BC Copyright Andrew W. Moore Slide BC Results: MPG : 4 records Copyright Andrew W. Moore Slide 2 56

More Facts About Bayes Classifiers Many other density estiators can be slotted in*. Density estiation can be perfored with real-alued inputs* Bayes Classifiers can be built with real-alued inputs* Rather Technical Coplaint: Bayes Classifiers don t try to be axially discriinatie---they erely try to honestly odel what s going on* Zero probabilities are painful for Joint and Naïe. A hack (justifiable with the agic words Dirichlet Prior can help*. Naïe Bayes is wonderfully cheap. And suries, attributes cheerfully! *See future Andrew Lectures Copyright Andrew W. Moore Slide 3 Probability What you should know Fundaentals of Probability and Bayes Rule What s a Joint Distribution How to do inference (i.e. E E2 once you hae a JD Density Estiation What is DE and what is it good for How to learn a Joint DE How to learn a naïe DE Copyright Andrew W. Moore Slide 4 57

What you should know Bayes Classifiers How to build one How to predict with a BC Contrast between naïe and joint BCs Copyright Andrew W. Moore Slide 5 Interesting Questions Suppose you were ealuating NaieBC, JointBC, and Decision Trees Inent a proble where only NaieBC would do well Inent a proble where only Dtree would do well Inent a proble where only JointBC would do well Inent a proble where only NaieBC would do poorly Inent a proble where only Dtree would do poorly Inent a proble where only JointBC would do poorly Copyright Andrew W. Moore Slide 6 58