Statistical learning. Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 1

Similar documents
HT2015: SC4 Statistical Data Mining and Machine Learning

CHAPTER 2 Estimating Probabilities

STA 4273H: Statistical Machine Learning

1 Maximum likelihood estimation

Basics of Statistical Machine Learning

Question 2 Naïve Bayes (16 points)

Statistical Machine Learning

Linear Threshold Units

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning.

Christfried Webers. Canberra February June 2015

Bayes and Naïve Bayes. cs534-machine Learning

Bayesian probability theory

Model Combination. 24 Novembre 2009

Part 2: One-parameter models

Free Human Memory Experiments For College Students

Notes on the Negative Binomial Distribution

Introduction to Detection Theory

How To Understand The Relation Between Simplicity And Probability In Computer Science

Lecture 3: Linear methods for classification

Statistical Machine Learning from Data

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Lecture 9: Bayesian hypothesis testing

Analysis of Bayesian Dynamic Linear Models

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Linear Classification. Volker Tresp Summer 2015

Inference of Probability Distributions for Trust and Security applications

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Logistic Regression (1/24/13)

Bayesian Analysis for the Social Sciences

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Bayesian Image Super-Resolution

Data Mining Practical Machine Learning Tools and Techniques

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Course: Model, Learning, and Inference: Lecture 5

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Predict Influencers in the Social Network

The CUSUM algorithm a small review. Pierre Granjon

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

1 The Brownian bridge construction

Monotonicity Hints. Abstract

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

E3: PROBABILITY AND STATISTICS lecture notes

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Linear regression methods for large n and streaming data

Introduction to Logistic Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Introduction to General and Generalized Linear Models

Customer Classification And Prediction Based On Data Mining Technique

Linear Models for Classification

Coding and decoding with convolutional codes. The Viterbi Algor

APPLIED MISSING DATA ANALYSIS

CMPSCI 240: Reasoning about Uncertainty

Gaussian Processes in Machine Learning

Gamma Distribution Fitting

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Lecture 10: Regression Trees

CSE 473: Artificial Intelligence Autumn 2010

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Point Biserial Correlation Tests

A Statistical Framework for Operational Infrasound Monitoring

Likelihood: Frequentist vs Bayesian Reasoning

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Model-based Synthesis. Tony O Hagan

An Introduction to Machine Learning

Parallelization Strategies for Multicore Data Analysis

Knowledge Discovery in Stock Market Data

What you CANNOT ignore about Probs and Stats

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Estimation and comparison of multiple change-point models

Stock Option Pricing Using Bayes Filters

Some Essential Statistics The Lure of Statistics

Predict the Popularity of YouTube Videos Using Early View Data

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Learning from Data: Naive Bayes

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Least Squares Estimation

Bayesian networks - Time-series models - Apache Spark & Scala

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

The Kelly Betting System for Favorable Games.

Detecting Corporate Fraud: An Application of Machine Learning

3F3: Signal and Pattern Processing

Basic Bayesian Methods

Introduction to Probability

Penalized regression: Introduction

Master s Theory Exam Spring 2006

Employer Health Insurance Premium Prediction Elliott Lui

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Decision Support System For A Customer Relationship Management Case Study

Spring 2007 Assignment 6: Learning

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Transcription:

Statistical learning Chapter 20 of AI: a Modern Approach, Sections 1 3 Presented by: Jicheng Zhao Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 1

Goal Learn probabilistic theories of the world from experience We focus on the learning of Bayesian networks More specifically, input data (or evidence), learn probabilistic theories of the world (or hypotheses) Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 2

Outline Bayesian learning Approximate Bayesian learning Maximum a posteriori learning (MAP) Maximum likelihood learning (ML) Parameter learning with complete data ML parameter learning with complete data in discrete models ML parameter learning with complete data in continuous models (linear regression) Naive Bayes models Bayesian parameter learning Learning Bayes net structure with complete data (If time allows) Learning with hidden variables or incomplete data (EM algorithm) Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 3

Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h 1, h 2,..., prior P(H) jth observation d j gives the outcome of random variable D j training data d = d 1,..., d N Given the data so far, each hypothesis has a posterior probability: P (h i d) = αp (d h i )P (h i ) where P (d h i ) is called the likelihood Predictions use a likelihood-weighted average over all hypotheses: P(X d) = Σ i P(X d, h i )P (h i d) = Σ i P(X h i )P (h i d) No need to pick one best-guess hypothesis! Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 4

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 5

Posterior probability of hypotheses Posterior probability of hypothesis 1 0.8 0.6 0.4 0.2 0 P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) 0 2 4 6 8 10 Number of samples in d Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 6

Prediction probability 1 P(next candy is lime d) 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples in d Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 7

Properties of full Bayesian learning 1. The true hypothesis eventually dominates the Bayesian prediction given that the true hypothesis is in the prior 2. The Bayesian prediction is optimal, whether the data set be small or large [?] On the other hand 1. The hypothesis space is usually very large or infinite summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 8

MAP approximation Maximum a posteriori (MAP) learning: choose h MAP maximizing P (h i d) instead of calculating P (h i d) for all hypothesis h i I.e., maximize P (d h i )P (h i ) or log P (d h i ) + log P (h i ) Overfitting in MAP and Bayesian learning Overfitting when the hypothesis space is too expressive such that some hypotheses fit the date set well. Use prior to penalize complexity Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses (simplest), P (d h i ) is 1 if consistent, 0 otherwise MAP = simplest hypothesis that is consistent with the data Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 9

ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P (d h i ) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method 1. Researchers distrust the subjective nature of hypotheses priors 2. Hypotheses are of the same complexity 3. Hypotheses priors is of less important when date set is large 4. Huge space of the hypotheses Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 10

Outline Bayesian learning Approximate Bayesian learning Maximum a posteriori learning (MAP) Maximum likelihood learning (ML) Parameter learning with complete data ML parameter learning with complete data in discrete models ML parameter learning with complete data in continuous models (linear regression) Naive Bayes models Bayesian parameter learning Learning Bayes net structure with complete data (If time allows) Learning with hidden variables or incomplete data (EM algorithm) Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 11

ML parameter learning in Bayes nets Bag from a new manufacturer; fraction θ of cherry candies? Any θ is possible: continuum of hypotheses h θ θ is a parameter for this simple (binomial) family of models We assume all hypotheses are equally possible a priori ML approach P( F=cherry) θ Flavor Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so P (d h θ ) = N j = 1 P (d j h θ ) = θ c (1 θ) l Maximize this w.r.t. θ which is easier for the log-likelihood: L(d h θ ) = log P (d h θ ) = N dl(d h θ ) dθ j = 1 log P (d j h θ ) = c log θ + l log(1 θ) = c θ l 1 θ = 0 θ = c c + l = c N Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 12

Seems sensible, but causes problems with 0 counts! This means that if the data set is small enough that some events have not yet been observed, the ML hypotheses assigns zero to those events. - tricks in dealing with this including initialize the counts for each event to 1 instead of 0. ML appraoch: 1. Write down an expression forthe likelihood of the data as a function of the parameter(s); 2. Write down the derivative of the log likelihood with respect to each parameter; 3. Find the parameter values such that the derivatives are zero. Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 13

Multiple parameters Red/green wrapper depends probabilistically on flavor: Likelihood for, e.g., cherry candy in green wrapper: P (F = cherry, W = green h θ,θ1,θ 2 ) = P (F = cherry h θ,θ1,θ 2 )P (W = green F = cherry, h θ,θ1,θ 2 ) = θ (1 θ 1 ) N candies, r c red-wrapped cherry candies, etc.: P( F=cherry) θ Flavor F Wrapper cherry lime P( W=red F) θ 1 θ 2 P (d h θ,θ1,θ 2 ) = θ c (1 θ) l θ r c 1 (1 θ 1 ) gc θ r l 2 (1 θ 2 ) g l L = [c log θ + l log(1 θ)] + [r c log θ 1 + g c log(1 θ 1 )] + [r l log θ 2 + g l log(1 θ 2 )] Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 14

Multiple parameters contd. Derivatives of L contain only the relevant parameter: L θ = c θ l 1 θ = 0 θ = c c + l L θ 1 = r c θ 1 g c 1 θ 1 = 0 θ 1 = r c r c + g c L θ 2 = r l θ 2 g l 1 θ 2 = 0 θ 2 = r l r l + g l With complete data, parameters can be learned separately Parameters values for a variable only depends on the observations of itself and its parents Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 15

ML parameter learning (continuous model) Hypothesis: Learning the parameters (σ and µ) of a Gaussian density function on a single variable Data: given data generated from this distrubution: x 1,, x N. P (x) = 1 e (x µ)2 2σ 2 2πσ The log likelihood is L = N log 1 e (x j µ)2 2σ 2 j=1 2πσ Setting the derivatives to zero = N( log 2π log σ) N j=1 (x j µ) 2 2σ 2 µ = σ = j x j N j (x j µ) 2 N (1) Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 16

0.5 1 0 ML parameter learning (continuous model) P(y x) 3.5 4 2.5 3 1.5 2 0 0.2 0.4 0.6 x 0.8 1 0 0.20.4 0.6 0.81 y Maximizing P (y x) = 1 2πσ e (y (θ 1 x+θ 2 ))2 2σ 2 w.r.t. θ 1, θ 2 y 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x = minimizing E = N j = 1 (y j (θ 1 x j + θ 2 )) 2 That is, minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 17

Naive Bayes models Observation: Attributes of one example Hypothesis: The class this example belongs to All attributes are conditionly independent of each other, given the class. P (C x 1,, x n ) = αp (C) P (x i C). i Choosing the most likely class Simple: 2n+1 parameters, no need to search for h ML Surprisingly well in a wide range of applications Can deal with noisy data and can give probabilistic prediction Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 18

Parameter learning in Bayes nets Assume prior as beta distributions: beta[a, b](θ) = αθ a 1 (1 θ) b 1 α is the normalization constant Full Bayesian learning Beta distribution: Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 19

Mean value of the distribution is: a a+b larger values of a suggest a belief that is closer to 1 than to 0 larger values of a+b make the distribution more peaked (greater certainty about Θ) if the prior of Θ is a beta distribution, after a data point is observed, the posterior distribution of Θ is also a beta distribution P (θ D 1 = cherry) = αp (D 1 = cherry θ)p (θ) = α θ beta[a, b](θ) = α θ θ a 1 (1 θ) b 1 = α θ a (1 θ) b 1 = beta[a + 1, b](θ) (2) The distribution is converging to a narrow peak around the true vale of Θ as data comes in. For large data set, Bayesian learning converges to give the same results as ML learning. Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 20

Parameter learning in Bayes nets (contd.) P (Θ, Θ 1, Θ 2 ) Usually, we assume parameter independence. Each parameter has its own beta distribution. Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 21

Outline Bayesian learning Approximate Bayesian learning Maximum a posteriori learning (MAP) Maximum likelihood learning (ML) Parameter learning with complete data ML parameter learning with complete data in discrete models ML parameter learning with complete data in continuous models (linear regression) Naive Bayes models Bayesian parameter learning Learning Bayes net structure with complete data (If time allows) Learning with hidden variables or incomplete data (EM algorithm) Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 22

Learning Bayes net structures Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 23

Outline Bayesian learning Approximate Bayesian learning Maximum a posteriori learning (MAP) Maximum likelihood learning (ML) Parameter learning with complete data ML parameter learning with complete data in discrete models ML parameter learning with complete data in continuous models (linear regression) Naive Bayes models Bayesian parameter learning Learning Bayes net structure with complete data (If time allows) Learning with hidden variables or incomplete data (EM algorithm) Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 24

Learning with hidden variables or incompte date Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 25

Summary Full Bayesian learning gives best possible predictions but is intractable MAP learning balances complexity with accuracy on training data Maximum likelihood assumes uniform prior, OK for large data sets 1. Choose a parameterized family of models to describe the data requires substantial insight and sometimes new models 2. Write down the likelihood of the data as a function of the parameters may require summing over hidden variables, i.e., inference 3. Write down the derivative of the log likelihood w.r.t. each parameter 4. Find the parameter values such that the derivatives are zero may be hard/impossible; modern optimization techniques help Chapter 20 of AI: a Modern Approach, Sections 1 3Presented by: Jicheng Zhao 26