Learning Bayesian Networks: Naïve Bayes

Similar documents
Linear Threshold Units

1 Maximum likelihood estimation

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Machine Learning

Bayes and Naïve Bayes. cs534-machine Learning

Content-Based Recommendation

Detecting Corporate Fraud: An Application of Machine Learning

Classification Problems

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Classification. Volker Tresp Summer 2015

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Data Mining Classification: Decision Trees

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Lecture 8: Signal Detection and Noise Assumption

Lecture 9: Introduction to Pattern Analysis

Decompose Error Rate into components, some of which can be measured on unlabeled data

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Introduction to Learning & Decision Trees

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

STA 4273H: Statistical Machine Learning

Lecture 10: Regression Trees

An Introduction to Machine Learning

E-commerce Transaction Anomaly Classification

Sections 2.11 and 5.8

Lecture 3: Linear methods for classification

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Predict Influencers in the Social Network

Decision Trees from large Databases: SLIQ

Christfried Webers. Canberra February June 2015

On Entropy in Network Traffic Anomaly Detection

CSCI567 Machine Learning (Fall 2014)

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Data Mining: An Overview. David Madigan

Classification Techniques for Remote Sensing

5 Directed acyclic graphs

Model-based Synthesis. Tony O Hagan

Social Media Mining. Data Mining Essentials

LCs for Binary Classification

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Mining - Evaluation of Classifiers

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Classification of Bad Accounts in Credit Card Industry

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Question 2 Naïve Bayes (16 points)

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

INTRODUCTION TO RATING MODELS

Data, Measurements, Features

Machine Learning Final Project Spam Filtering

The Artificial Prediction Market

Comparison of machine learning methods for intelligent tutoring systems

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Decision-Tree Learning

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Principles of Data Mining by Hand&Mannila&Smyth

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Data Mining for Knowledge Management. Classification

Azure Machine Learning, SQL Data Mining and R

Chapter 12 Discovering New Knowledge Data Mining

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

LECTURE 4. Last time: Lecture outline

Evaluation of Machine Learning Techniques for Green Energy Prediction

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Active Learning SVM for Blogs recommendation

II. DISTRIBUTIONS distribution normal distribution. standard scores

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Basics of Statistical Machine Learning

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

OUTLIER ANALYSIS. Data Mining 1

Towards better accuracy for Spam predictions

Ex. 2.1 (Davide Basilio Bartolini)

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Data Mining Techniques for Prognosis in Pancreatic Cancer

Probability for Estimation (review)

Machine Learning and Pattern Recognition Logistic Regression

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Big Data Analytics CSCI 4030

KEITH LEHNERT AND ERIC FRIEDRICH

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

Environmental Remote Sensing GEOG 2021

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Data Mining. Nonlinear Classification

Full and Complete Binary Trees

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Better credit models benefit us all

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

Cluster Analysis: Advanced Concepts

Transcription:

Learning Baian Networks: Naïve and n-na Naïve Ba Hypothesis Space fixed size stochastic continuous parameters Learning Algorithm direct computation eager batch

Multivariate Gaussian Classifier y x The multivariate Gaussian Classifier is equivalent to a simple Baian network This models the joint distribution P(x,y) under the assumption that the class conditional distributions P(x y) are multivariate gaussians P(y): multimial random variable (K-sided coin) P(x y): multivariate gaussian mean µ k covariance matrix Σ k

Naïve Ba Model y x 1 x 2 x 3 x n Each de contains a probability table y: : P(y = k) x j : = v y = k) class conditional probability Interpret as a generative model Choose the class k according to P(y = k) Generate each feature independently according to =v y=k) The feature values are conditionally independent P(x i,x j y) ) = P(x i y) y)

Representing y) Many representations are possible Univariate Gaussian if x j is a continuous random variable, then we can use a rmal distribution and learn the mean µ and variance σ 2 Multimial if x j is a discrete random variable, x j {v 1,, v m }, then we construct the conditional probability table y = 1 y=2 y=k x j =v 1 =v 1 y = 1) =v 1 y = 2) =v m y = K) x j =v 2 =v 2 y = 1) =v 2 y = 2) =v m y = K) x j =v m =v m y = 1) =v m y = 2) =v m y = K) Discretization convert continuous x j into a discrete variable Kernel Density Estimates apply a kind of nearest-neighbor neighbor algorithm to compute y) ) in neighborhood of query point

Discretization via Mutual Information Many discretization algorithms have been studied. One of the best is mutual information discretization To discretize feature x j, grow a decision tree considering only splits on x j. Each leaf of the resulting tree will correspond to a single value of the discretized x j. Stopping rule (applied at each de). Stop when I(x j ; y) < log 2(N 1) N + N =log 2 (3 K 2) [K H(S) K l H(S l ) Kr H(Sr)] where S is the training data in the parent de; S l and S r are the examples in the left and right child. K, K l, and K r are the corresponding number of classes present in these examples. I is the mutual information, H is the entropy, and N is the number of examples in the de.

Kernel Density Estimators Define exp 2πσ σ to be the Gaussian Kernel with parameter σ Estimate K(x j,x i,j )= 1 P (x j y = k) = µ xj x i,j P {i y=k} K(x j,x i,j ) N k where N k is the number of training examples in class k. 2

Kernel Density Estimators (2) This is equivalent to placing a Gaussian bump of height 1/N k on each trianing data point from class k and then adding them up y) x j

Kernel Density Estimators Resulting probability density y) x j

The value chosen for σ is critical σ=0.15 σ=0.50

Naïve Ba learns a Linear Threshold Unit For multimial and discretized attributes (but t Gaussian), Naïve Ba gives a linear decision boundary P (x Y = y) =P (x 1 = v 1 Y = y) P (x 2 = v 2 Y = y) P (x n = v n Y = y) Define a discriminant function for class 1 versus class K P (Y =1 X) h(x) = P (Y = K X) = P (x 1 = v 1 Y =1) P (x 1 = v 1 Y = K) P (x n = vn Y =1) P (Y =1) P (x n = v n Y = K) P (Y = K)

Log of Odds Ratio log P (y =1 x) P (y = K x) = P (x 1 = v 1 y =1) P (x 1 = v 1 y = K) P (x n = vn y =1) P (y =1) P (xn = vn y = K) P (y = K) P (y =1 x) P (y = K x) = log P (x 1 = v 1 y =1) P (x 1 = v 1 y = K) +...log P (x n = v n y =1) (y =1) +logp P (xn = vn y = K) P (y = K) Suppose each x j is binary and define α j,0 = log P (x j =0 y =1) P (x j =0 y = K) α j,1 = log P (x j =1 y =1) P (x j =1 y = K)

Log Odds (2) Now rewrite as P (y =1 x) log P (y = K x) = X j P (y =1 x) log P (y = K x) = X j P (y =1) (α j,1 α j,0 )x j + α j,0 +log P (y = K) P (y =1) (α j,1 α j,0 )x j + α j,0 +log P (y = K) X j We classify into class 1 if this is 0 and into class K otherwise

Learning the Probability Distributions by Direct Computation P(y=k) ) is just the fraction of training examples belonging to class k. For multimial variables, = v y = k) ) is the fraction of training examples in class k where x j = v ˆµ jk For Gaussian variables, is the average value of x j for training examples in class k. ˆσ jk is the sample standard deviation of those points: ˆσ jk = v u t 1 N k X {i y i =k} (x i,j ˆµ jk ) 2

Improved Probability Estimates via Laplace Corrections When we have very little training data, direct probability computation can give probabilities of 0 or 1. Such extreme probabilities are too strong and cause problems Suppose we are estimate a probability P(z) and we have n 0 examples where z is false and n 1 examples where z is true. Our direct estimate is P (z =1)= n 1 n 0 + n 1 Laplace Estimate. Add 1 to the numerator and 2 to the deminator or P (z =1)= n 1 +1 n 0 + n 1 +2 This says that in the absence of any evidence, we expect P(z) ) = 0.5, but our belief is weak (equivalent to 1 example for each outcome). Generalized Laplace Estimate. If z has K different outcomes, then we estimate it as P (z =1)= n 1 +1 n 0 + + n K 1 + K

Naïve Ba Applied to Diabetes Diagsis Ba nets and causality Ba nets work best when arrows follow the direction of causality two things with a common cause are likely to be conditionally independent given the cause; arrows in the causal direction capture this independence In a Naïve Ba network, arrows are often t in the causal direction diabetes does t cause pregnancies diabetes does t cause age But some arrows are correct diabetes does cause the level of blood insulin and blood glucose

Non-Na Naïve Ba Manually construct a graph in which all arcs are causal Learning the probability tables is still easy. For example, P(Mass Age, Preg) involves counting the number of patients of a given age and number of pregnancies that have a given body mass Classification: P (D = d A, P, M, I, G) = P (I D = d)p (G I,D = d)p (D = d A, M, P) P (I,G)

Evaluation of Naïve Ba Criterion LMS Logistic LDA Trees Nets NNbr SVM NB Mixed data Missing values some Outliers disc Motone transformations some disc Scalability Irrelevant inputs some some some Linear combinations some Interpretable some Accurate Naïve Ba is very popular, particularly in natural language processing and information retrieval where there are many features compared to the number of examples In applications with lots of data, Naïve Ba does t usually perform as well as more sophisticated methods

Naïve Ba Summary Advantages of Baian networks Produces stochastic classifiers can be combined with utility functions to make optimal decisions Easy to incorporate causal kwledge resulting probabilities are easy to interpret Very simple learning algorithms if all variables are observed in training data Disadvantages of Baian networks Fixed sized hypothesis space may underfit or overfit the data may t contain any good classifiers if prior kwledge is wrong Harder to handle continuous features