# CHAPTER 2 Estimating Probabilities

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 CHAPTER 2 Estimating Probabilities Machine Learning Copyright c Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a rough draft chapter intended for inclusion in the upcoming second edition of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You are welcome to use this for educational purposes, but do not duplicate or repost it on the internet. For online copies of this and other materials related to this book, visit the web site tom/mlbook.html. Please send suggestions for improvements, or suggested exercises, to Many machine learning methods depend on probabilistic approaches. The reason is simple: when we are interested in learning some target function f : X Y, we can more generally learn the probabilistic function P(Y X). By using a probabilistic approach, we can design algorithms that learn functions with uncertain outcomes (e.g., predicting tomorrow s stock price) and that incorporate prior knowledge to guide learning (e.g., a bias that tomorrow s stock price is likely to be similar to today s price). This chapter describes joint probability distributions over many variables, and shows how they can be used to calculate a target P(Y X). It also considers the problem of learning, or estimating, probability distributions from training data, presenting the two most common approaches: maximum likelihood estimation and maximum a posteriori estimation. 1 Joint Probability Distributions The key to building probabilistic models is to define a set of random variables, and to consider the joint probability distribution over them. For example, Table 1 defines a joint probability distribution over three random variables: a person s 1

2 Copyright c 2016, Tom M. Mitchell. 2 Gender HoursWorked Wealth probability female < 40.5 poor female < 40.5 rich female 40.5 poor female 40.5 rich male < 40.5 poor male < 40.5 rich male 40.5 poor male 40.5 rich Table 1: A Joint Probability Distribution. This table defines a joint probability distribution over three random variables: Gender, HoursWorked, and Wealth. Gender, the number of HoursWorked each week, and their Wealth. In general, defining a joint probability distribution over a set of discrete-valued variables involves three simple steps: 1. Define the random variables, and the set of values each variable can take on. For example, in Table 1 the variable Gender can take on the value male or female, the variable HoursWorked can take on the value < 40.5 or 40.5, and Wealth can take on values rich or poor. 2. Create a table containing one row for each possible joint assignment of values to the variables. For example, Table 1 has 8 rows, corresponding to the 8 possible ways of jointly assigning values to three boolean-valued variables. More generally, if we have n boolean valued variables, there will be 2 n rows in the table. 3. Define a probability for each possible joint assignment of values to the variables. Because the rows cover every possible joint assignment of values, their probabilities must sum to 1. The joint probability distribution is central to probabilistic inference. We can calculate conditional or joint probabilities over any subset of variables, given their joint distribution. This is accomplished by operating on the probabilities for the relevant rows in the table. For example, we can calculate: The probability that any single variable will take on any specific value. For example, we can calculate that the probability P(Gender = male) = for the joint distribution in Table 1, by summing the four rows for which Gender = male. Similarly, we can calculate the probability P(Wealth = rich) = by adding together the probabilities for the four rows covering the cases for which Wealth=rich. The probability that any subset of the variables will take on a particular joint assignment. For example, we can calculate that the probability P(Wealth=rich

3 Copyright c 2016, Tom M. Mitchell. 3 Gender=female) = , by summing the two table rows that satisfy this joint assignment. Any conditional probability defined over subsets of the variables. Recall the definition of conditional probability P(Y X) = P(X Y )/P(X). We can calculate both the numerator and denominator in this definition by summing appropriate rows, to obtain the conditional probability. For example, according to Table 1, P(Wealth=rich Gender=female) = / = To summarize, if we know the joint probability distribution over an arbitrary set of random variables {X 1...X n }, then we can calculate the conditional and joint probability distributions for arbitrary subsets of these variables (e.g., P(X n X 1...X n 1 )). In theory, we can in this way solve any classification, regression, or other function approximation problem defined over these variables, and furthermore produce probabilistic rather than deterministic predictions for any given input to the target function. 1 For example, if we wish to learn to predict which people are rich or poor based on their gender and hours worked, we can use the above approach to simply calculate the probability distribution P(Wealth Gender, HoursWorked). 1.1 Learning the Joint Distribution How can we learn joint distributions from observed training data? In the example of Table 1 it will be easy if we begin with a large database containing, say, descriptions of a million people in terms of their values for our three variables. Given a large data set such as this, one can easily estimate a probability for each row in the table by calculating the fraction of database entries (people) that satisfy the joint assignment specified for that row. If thousands of database entries fall into each row, we will obtain highly reliable probability estimates using this strategy. In other cases, however, it can be difficult to learn the joint distribution due to the very large amount of training data required. To see the point, consider how our learning problem would change if we were to add additional variables to describe a total of 100 boolean features for each person in Table 1 (e.g., we could add do they have a college degree?, are they healthy? ). Given 100 boolean features, the number of rows in the table would now expand to 2 100, which is greater than Unfortunately, even if our database describes every single person on earth we would not have enough data to obtain reliable probability estimates for most rows. There are only approximately people on earth, which means that for most of the rows in our table, we would have zero training examples! This is a significant problem given that real-world machine learning applications often 1 Of course if our random variables have continuous values instead of discrete, we would need an infinitely large table. In such cases we represent the joint distribution by a function instead of a table, but the principles for using the joint distribution remain unchanged.

6 Copyright c 2016, Tom M. Mitchell. 6 Figure 1: MLE and MAP estimates of as the number of coin flips grows. Data was generated by a random number generator that output a value of 1 with probability = 0.3, and a value of 0 with probability of (1 ) = 0.7. Each plot shows the two estimates of as the number of observed coin flips grows. Plots on the left correspond to values of γ 1 and γ 0 that reflect the correct prior assumption about the value of, plots on the right reflect the incorrect prior assumption that is most probably 0.4. Plots in the top row reflect lower confidence in the prior assumption, by including only 60 = γ 1 + γ 0 imaginary data points, whereas bottom plots assume 120. Note as the size of the data grows, the MLE and MAP estimates converge toward each other, and toward the correct estimate for. the estimate of that is most probable, given the observed data, plus background assumptions about its value. Thus, the difference between these two principles is that Algorithm 2 assumes background knowledge is available, whereas Algorithm 1 does not. Both principles have been widely used to derive and to justify a vast range of machine learning algorithms, from Bayesian networks, to linear regression, to neural network learning. Our coin flip example represents just one of many such learning problems. The experimental behavior of these two algorithms is shown in Figure 1. Here the learning task is to estimate the unknown value of = P(X = 1) for a booleanvalued random variable X, based on a sample of n values of X drawn independently (e.g., n independent flips of a coin with probability of heads). In this figure, the true value of is 0.3, and the same sequence of training examples is

7 Copyright c 2016, Tom M. Mitchell. 7 used in each plot. Consider first the plot in the upper left. The blue line shows the estimates of produced by Algorithm 1 (MLE) as the number n of training examples grows. The red line shows the estimates produced by Algorithm 2, using the same training examples and using priors γ 0 = 42 and γ 1 = 18. This prior assumption aligns with the correct value of (i.e., [γ 1 /(γ 1 + γ 0 )] = 0.3). Note that as the number of training example coin flips grows, both algorithms converge toward the correct estimate of, though Algorithm 2 provides much better estimates than Algorithm 1 when little data is available. The bottom left plot shows the estimates if Algorithm 2 uses even more confident priors, captured by twice as many hallucinated examples (γ 0 = 84 and γ 1 = 36). The two plots on the right side of the figure show the estimates produced when Algorithm 2 (MAP) uses incorrect priors (where [γ 1 /(γ 1 + γ 0 )] = 0.4). The difference between the top right and bottom right plots is again only a difference in the number of hallucinated examples, reflecting the difference in confidence that should be close to Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation, often abbreviated MLE, estimates one or more probability parameters based on the principle that if we observe training data D, we should choose the value of that makes D most probable. When applied to the coin flipping problem discussed above, it yields Algorithm 1. The definition of the MLE in general is ˆ MLE = argmaxp(d ) (1) The intuition underlying this principle is simple: we are more likely to observe data D if we are in a world where the appearance of this data is highly probable. Therefore, we should estimate by assigning it whatever value maximizes the probability of having observed D. Beginning with this principle for choosing among possible estimates of, it is possible to mathematically derive a formula for the value of that provably maximizes P(D ). Many machine learning algorithms are defined so that they provably learn a collection of parameter values that follow this maximum likelihood principle. Below we derive Algorithm 1 for our above coin flip example, beginning with the maximum likelihood principle. To precisely define our coin flipping example, let X be a random variable which can take on either value 1 or 0, and let = P(X = 1) refer to the true, but possibly unknown, probability that a random draw of X will take on the value 1. 2 Assume we flip the coin X a number of times to produce training data D, in which we observe X = 1 a total of α 1 times, and X = 0 a total of α 0 times. We further assume that the outcomes of the flips are independent (i.e., the result of one coin flip has no influence on other coin flips), and identically distributed (i.e., the same value of governs each coin flip). Taken together, these assumptions are that the 2 A random variable defined in this way is called a Bernoulli random variable, and the probability distribution it follows, defined by, is called a Bernoulli distribution.

8 Copyright c 2016, Tom M. Mitchell. 8 coin flips are independent, identically distributed (which is often abbreviated to i.i.d. ). The maximum likelihood principle involves choosing to maximize P(D ). Therefore, we must begin by writing an expression for P(D ), or equivalently P(α 1,α 0 ) in terms of, then find an algorithm that chooses a value for that maximizes this quantify. To begin, note that if data D consists of just one coin flip, then P(D ) = if that one coin flip results in X = 1, and P(D ) = (1 ) if the result is instead X = 0. Furthermore, if we observe a set of i.i.d. coin flips such as D = 1,1,0,1,0, then we can easily calculate P(D ) by multiplying together the probabilities of each individual coin flip: P(D = 1,1,0,1,0 ) = (1 ) (1 ) = 3 (1 ) 2 In other words, if we summarize D by the total number of observed times α 1 when X =1 and α 0 when X =0, we have in general P(D = α 1,α 0 ) = α 1 (1 ) α 0 (2) The above expression gives us a formula for P(D = α 1,α 0 ). The quantity P(D ) is often called the likelihood function because it expresses the probability of the observed data D as a function of. This likelihood function is often written L() = P(D ). Our final step in this derivation is to determine the value of that maximizes the likelihood function P(D = α 1,α 0 ). Notice that maximizing P(D ) with respect to is equivalent to maximizing its logarithm, lnp(d ) with respect to, because ln(x) increases monotonically with x: arg maxp(d ) = argmax lnp(d ) It often simplifies the mathematics to maximize ln P(D ) rather than P(D ), as is the case in our current example. In fact, this log likelihood is so common that it has its own notation, l() = lnp(d ). To find the value of that maximizes ln P(D ), and therefore also maximizes P(D ), we can calculate the derivative of lnp(d = α 1,α 0 ) with respect to, then solve for the value of that makes this derivative equal to zero. First, we calculate the derivative of the log of the likelihood function of Eq. (2): l() l() = lnp(d ) = ln[α 1(1 ) α 0] = [α 1 ln + α 0 ln(1 )] ln = α 1 + α ln(1 ) 0 = α 1 ln + α 0 ln(1 ) (1 ) (1 ) 1 = α 1 + α 1 0 ( 1) (3) (1 )

9 Copyright c 2016, Tom M. Mitchell. 9 where the last step follows from the equality lnx x step follows from the chain rule f (x) x = f (x) g(x) g(x) x. = 1 x, and where the next to last Finally, to calculate the value of that maximizes l(), we set the derivative in equation (3) to zero, and solve for. 1 0 = α 1 α α 0 1 = α 1 1 α 0 = α 1 (1 ) (α 1 + α 0 ) = α 1 = α 1 α 1 + α 0 (4) Thus, we have derived in equation (4) the intuitive Algorithm 1 for estimating, starting from the principle that we want to choose the value of that maximizes P(D ). ˆ MLE = argmax P(D ) = argmax lnp(d ) = α 1 α 1 + α 0 This same maximum likelihood principle is used as the basis for deriving many machine learning algorithms for more complex problems where the solution is not so intuitively obvious. 2.2 Maximum a Posteriori Probability Estimation (MAP) Maximum a Posteriori Estimation, often abbreviated to MAP estimation, estimates one or more probability parameters based on the principle that we should choose the value of that is most probable, given the observed data D and our prior assumptions summarized by P(). ˆ MAP = argmax P( D) When applied to the coin flipping problem discussed above, it yields Algorithm 2. Using Bayes rule, we can rewrite the MAP principle as: ˆ MAP = argmax P( D) = argmax P(D )P() P(D) and given that P(D) does not depend on, we can simplify this by ignoring the denominator: ˆ MAP = argmax P( D) = argmaxp(d )P() (5) Comparing this to the MLE principle described in equation (1), we see that whereas the MLE principle is to choose to maximize P(D ), the MAP principle instead maximizes P(D )P(). The only difference is the extra P().

10 Copyright c 2016, Tom M. Mitchell. 10 To produce a MAP estimate for we must specify a prior distribution P() that summarizes our a priori assumptions about the value of. In the case where data is generated by multiple i.i.d. draws of a Bernoulli random variable, as in our coin flip example, the most common form of prior is a Beta distribution: P() = Beta(β 0,β 1 ) = β 1 1 (1 ) β 0 1 B(β 0,β 1 ) (6) Here β 0 and β 1 are parameters whose values we must specify in advance to define a specific P(). As we shall see, choosing values for β 0 and β 1 corresponds to choosing the number of imaginary examples γ 0 and γ 1 in the above Algorithm 2. The denominator B(β 0,β 1 ) is a normalization term defined by the function B, which assures the probability integrates to one, but which is independent of. As defined in Eq. (5), the MAP estimate involves choosing the value of that maximizes P(D )P(). Recall we already have an expression for P(D ) in Eq. (2). Combining this with the above expression for P() we have: ˆ MAP = argmax theta P(D )P() = arg max = arg max = arg max α 1 (1 ) α β1 1 (1 ) β B(β 0,β 1 ) α 1+β 1 1 (1 ) α 0+β 0 1 B(β 0,β 1 ) α 1+β 1 1 (1 ) α 0+β 0 1 (7) where the final line follows from the previous line because B(β 0,β 1 ) is independent of. How can we solve for the value of that maximizes the expression in Eq. (7)? Fortunately, we have already answered this question! Notice that the quantity we seek to maximize in Eq. (7) can be made identical to the likelihood function in Eq. (2) if we substitute (α 1 + β 1 1) for α 1 in Eq. (2), and substitute (α 0 + β 0 1) for α 0. We can therefore reuse the derivation of ˆ MLE beginning from Eq. (2) and ending with Eq. (4), simply by carrying through this substitution. Applying this same substitution to Eq. (4) implies the solution to Eq. (7) is therefore ˆ MAP = argmax P(D )P() = (α 1 + β 1 1) (α 1 + β 1 1) + (α 0 + β 0 1) (8) Thus, we have derived in Eq. (8) the intuitive Algorithm 2 for estimating, starting from the principle that we want to choose the value of that maximizes P( D). The number γ 1 of imaginary heads in Algorithm 2 is equal to β 1 1, and the number γ 0 of imaginary tails is equal to β 0 1. This same maximum a posteriori probability principle is used as the basis for deriving many machine learning algorithms for more complex problems where the solution is not so intuitively obvious as it is in our coin flipping example.

11 Copyright c 2016, Tom M. Mitchell Notes on Terminology A boolean valued random variable X {0,1}, governed by the probability distribution P(X = 1) = ; P(X = 0) = (1 ) is called a Bernoulli random variable, and this probability distribution is called a Bernoulli distribution. A convenient mathematical expression for a Bernoulli distribution P(X) is: P(X = x) = x (1 ) 1 x The Beta(β 0,β 1 ) distribution defined in Eq. (6) is called the conjugate prior for the binomial likelihood function α 1(1 ) α 0, because the posterior distribution P(D )P() is also a Beta distribution. More generally, any P() is called the conjugate prior for a likelihood function L() = P(D ) if the posterior P( D) is of the same form as P(). 4 What You Should Know The main points of this chapter include: Joint probability distributions lie at the core of probabilistic machine learning approaches. Given the joint probability distribution P(X 1...X n ) over a set of random variables, it is possible in principle to compute any joint or conditional probability defined over any subset of these variables. Learning, or estimating, the joint probability distribution from training data can be easy if the data set is large compared to the number of distinct probability terms we must estimate. But in many practical problems the data is more sparse, requiring methods that rely on prior knowledge or assumptions, in addition to observed data. Maximum likelihood estimation (MLE) is one of two widely used principles for estimating the parameters that define a probability distribution. This principle is to choose the set of parameter values ˆ MLE that makes the observed training data most probable (over all the possible choices of ): ˆ MLE = argmax P(data ) In many cases, maximum likelihood estimates correspond to the intuitive notion that we should base probability estimates on observed ratios. For example, given the problem of estimating the probability that a coin will turn up heads, given α 1 observed flips resulting in heads, and α 0 observed flips resulting in tails, the maximum likelihood estimate corresponds exactly to taking the fraction of flips that turn up heads: ˆ MLE = argmax P(data ) = α 1 α 1 + α 0

12 Copyright c 2016, Tom M. Mitchell. 12 Maximium a posteriori probability (MAP) estimation is the other of the two widely used principles. This principle is to choose the most probable value of, given the observed training data plus a prior probability distribution P() which captures prior knowledge or assumptions about the value of : ˆ MAP = argmax P( data) = argmax P(data )P() In many cases, MAP estimates correspond to the intuitive notion that we can represent prior assumptions by making up imaginary data which reflects these assumptions. For example, the MAP estimate for the above coin flip example, assuming a prior P() = Beta(γ 0 + 1,γ 1 + 1), yields a MAP estimate which is equivalent to the MLE estimate if we simply add in an imaginary γ 1 heads and γ 0 tails to the actual observed α 1 heads and α 0 tails: EXERCISES ˆ MAP = argmax P(data )P() = (α 1 + γ 1 ) (α 1 + γ 1 ) + (α 0 + γ 0 ) 1. In the MAP estimation of for our Bernoulli random variable X in this chapter, we used a Beta(β 0,β 1 ) prior probability distribution to capture our prior beliefs about the prior probability of different values of, before seeing the observed data. Plot this prior probability distribution over, corresponding to the number of hallucinated examples used in the top left plot of Figure 1 (i.e., γ 0 = 42,γ 1 = 18). Specifically create a plot showing the prior probability (vertical axis) for each possible value of between 0 and 1 (horizontal axis), as represented by the prior distribution Beta(β 0,β 1 ). Recall the correspondence β i = γ i + 1. Note you will want to write a simple computer program to create this plot. Above, you plotted the prior probability over possible values of. Now plot the posterior probability distribution over given that prior, plus observed data in which 6 heads (X = 1) were observed, along with 9 tails (X = 0). View the plot you created above to visually determine the approximate Maximum a Posterior probability estimate MAP. What is it? What is the exact value of the MAP estimate? What is the exact value of the Maximum Likelihood Estimate MLE? 5 Acknowledgements I very much appreciate receiving helpful comments on earlier drafts of this chapter from Akshay Mishra.

13 Copyright c 2016, Tom M. Mitchell. 13 REFERENCES Mitchell, T (1997). Machine Learning, McGraw Hill. Wasserman, L. (2004). All of Statistics, Springer-Verlag.

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Senior Secondary Australian Curriculum

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero

### Likelihood: Frequentist vs Bayesian Reasoning

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B University of California, Berkeley Spring 2009 N Hallinan Likelihood: Frequentist vs Bayesian Reasoning Stochastic odels and

### Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

1 Learning Goals Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1. Be able to apply Bayes theorem to compute probabilities. 2. Be able to identify

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Part 2: One-parameter models

Part 2: One-parameter models Bernoilli/binomial models Return to iid Y 1,...,Y n Bin(1, θ). The sampling model/likelihood is p(y 1,...,y n θ) =θ P y i (1 θ) n P y i When combined with a prior p(θ), Bayes

### Basic Probability Theory I

A Probability puzzler!! Basic Probability Theory I Dr. Tom Ilvento FREC 408 Our Strategy with Probability Generally, we want to get to an inference from a sample to a population. In this case the population

### Linear Programming. Solving LP Models Using MS Excel, 18

SUPPLEMENT TO CHAPTER SIX Linear Programming SUPPLEMENT OUTLINE Introduction, 2 Linear Programming Models, 2 Model Formulation, 4 Graphical Linear Programming, 5 Outline of Graphical Procedure, 5 Plotting

### Review of Fundamental Mathematics

Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Test of Hypotheses. Since the Neyman-Pearson approach involves two statistical hypotheses, one has to decide which one

Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win \$1 if head and lose \$1 if tail. After 10 plays, you lost \$2 in net (4 heads

### What Is Probability?

1 What Is Probability? The idea: Uncertainty can often be "quantified" i.e., we can talk about degrees of certainty or uncertainty. This is the idea of probability: a higher probability expresses a higher

Section 5.4 The Quadratic Formula 481 5.4 The Quadratic Formula Consider the general quadratic function f(x) = ax + bx + c. In the previous section, we learned that we can find the zeros of this function

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### CHAPTER FIVE. Solutions for Section 5.1. Skill Refresher. Exercises

CHAPTER FIVE 5.1 SOLUTIONS 265 Solutions for Section 5.1 Skill Refresher S1. Since 1,000,000 = 10 6, we have x = 6. S2. Since 0.01 = 10 2, we have t = 2. S3. Since e 3 = ( e 3) 1/2 = e 3/2, we have z =

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

### Tossing a Biased Coin

Tossing a Biased Coin Michael Mitzenmacher When we talk about a coin toss, we think of it as unbiased: with probability one-half it comes up heads, and with probability one-half it comes up tails. An ideal

### Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Be able to explain the difference between the p-value and a posterior

### Unit 4 The Bernoulli and Binomial Distributions

PubHlth 540 4. Bernoulli and Binomial Page 1 of 19 Unit 4 The Bernoulli and Binomial Distributions Topic 1. Review What is a Discrete Probability Distribution... 2. Statistical Expectation.. 3. The Population

### People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A COURSE ON MOLECULAR EVOLUTION (BIOL 3046) Probability The subject of PROBABILITY is a branch of mathematics dedicated to building models

### Language Modeling. Chapter 1. 1.1 Introduction

Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

### 1 Prior Probability and Posterior Probability

Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### Bayesian probability theory

Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem Time on my hands: Coin tosses. Problem Formulation: Suppose that I have

### Simple Regression Theory II 2010 Samuel L. Baker

SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According

### 6.4 Normal Distribution

Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

### Binomial lattice model for stock prices

Copyright c 2007 by Karl Sigman Binomial lattice model for stock prices Here we model the price of a stock in discrete time by a Markov chain of the recursive form S n+ S n Y n+, n 0, where the {Y i }

### The data is more curved than linear, but not so much that linear regression is not an option. The line of best fit for linear regression is

CHAPTER 5 Nonlinear Models 1. Quadratic Function Models Consider the following table of the number of Americans that are over 100 in various years. Year Number (in 1000 s) 1994 50 1996 56 1998 65 2000

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Introduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures.

Introduction to Hypothesis Testing Point estimation and confidence intervals are useful statistical inference procedures. Another type of inference is used frequently used concerns tests of hypotheses.

### 1.7 Graphs of Functions

64 Relations and Functions 1.7 Graphs of Functions In Section 1.4 we defined a function as a special type of relation; one in which each x-coordinate was matched with only one y-coordinate. We spent most

### Chapter 4 Online Appendix: The Mathematics of Utility Functions

Chapter 4 Online Appendix: The Mathematics of Utility Functions We saw in the text that utility functions and indifference curves are different ways to represent a consumer s preferences. Calculus can

### Notes 10: An Equation Based Model of the Macroeconomy

Notes 10: An Equation Based Model of the Macroeconomy In this note, I am going to provide a simple mathematical framework for 8 of the 9 major curves in our class (excluding only the labor supply curve).

### Combinatorics: The Fine Art of Counting

Combinatorics: The Fine Art of Counting Week 7 Lecture Notes Discrete Probability Continued Note Binomial coefficients are written horizontally. The symbol ~ is used to mean approximately equal. The Bernoulli

### Note on growth and growth accounting

CHAPTER 0 Note on growth and growth accounting 1. Growth and the growth rate In this section aspects of the mathematical concept of the rate of growth used in growth models and in the empirical analysis

### Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

### Rational Polynomial Functions

Rational Polynomial Functions Rational Polynomial Functions and Their Domains Today we discuss rational polynomial functions. A function f(x) is a rational polynomial function if it is the quotient of

### THE MULTINOMIAL DISTRIBUTION. Throwing Dice and the Multinomial Distribution

THE MULTINOMIAL DISTRIBUTION Discrete distribution -- The Outcomes Are Discrete. A generalization of the binomial distribution from only 2 outcomes to k outcomes. Typical Multinomial Outcomes: red A area1

### Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

### 2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

### Characteristics of Binomial Distributions

Lesson2 Characteristics of Binomial Distributions In the last lesson, you constructed several binomial distributions, observed their shapes, and estimated their means and standard deviations. In Investigation

### Review for Calculus Rational Functions, Logarithms & Exponentials

Definition and Domain of Rational Functions A rational function is defined as the quotient of two polynomial functions. F(x) = P(x) / Q(x) The domain of F is the set of all real numbers except those for

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Bayesian Analysis for the Social Sciences

Bayesian Analysis for the Social Sciences Simon Jackman Stanford University http://jackman.stanford.edu/bass November 9, 2012 Simon Jackman (Stanford) Bayesian Analysis for the Social Sciences November

### Zeros of a Polynomial Function

Zeros of a Polynomial Function An important consequence of the Factor Theorem is that finding the zeros of a polynomial is really the same thing as factoring it into linear factors. In this section we

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Section 1.3 P 1 = 1 2. = 1 4 2 8. P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., = 1 2 4.

Difference Equations to Differential Equations Section. The Sum of a Sequence This section considers the problem of adding together the terms of a sequence. Of course, this is a problem only if more than

### Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

### Sampling Distributions and the Central Limit Theorem

135 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics Chapter 10 Sampling Distributions and the Central Limit Theorem In the previous chapter we explained

### Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Know the definitions of conditional probability and independence

### 1 Sufficient statistics

1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

### EC 6310: Advanced Econometric Theory

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

### Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

ECS20 Discrete Mathematics Quarter: Spring 2007 Instructor: John Steinberger Assistant: Sophie Engle (prepared by Sophie Engle) Homework 8 Hints Due Wednesday June 6 th 2007 Section 6.1 #16 What is the

### This is a square root. The number under the radical is 9. (An asterisk * means multiply.)

Page of Review of Radical Expressions and Equations Skills involving radicals can be divided into the following groups: Evaluate square roots or higher order roots. Simplify radical expressions. Rationalize

### Hypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 One-Sided and Two-Sided Tests

Hypothesis Testing 1 Introduction This document is a simple tutorial on hypothesis testing. It presents the basic concepts and definitions as well as some frequently asked questions associated with hypothesis

### Gamma Distribution Fitting

Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

### Linear Programming Supplement E

Linear Programming Supplement E Linear Programming Linear programming: A technique that is useful for allocating scarce resources among competing demands. Objective function: An expression in linear programming

### Ex. 2.1 (Davide Basilio Bartolini)

ECE 54: Elements of Information Theory, Fall 00 Homework Solutions Ex.. (Davide Basilio Bartolini) Text Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips

### DO NOT COPY PORTFOLIO SELECTION AND THE CAPITAL ASSET PRICING MODEL

UVA-F-1604 Rev. Mar. 4, 011 PORTFOLIO SELECTION AND THE CAPITAL ASSET PRICING MODEL What portfolio would you recommend to a 8-year-old who has just been promoted to a management position, and what portfolio

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### CHAPTER 5 Round-off errors

CHAPTER 5 Round-off errors In the two previous chapters we have seen how numbers can be represented in the binary numeral system and how this is the basis for representing numbers in computers. Since any

This assignment will help you to prepare for Algebra 1 by reviewing some of the things you learned in Middle School. If you cannot remember how to complete a specific problem, there is an example at the

### HOSPITALITY Math Assessment Preparation Guide. Introduction Operations with Whole Numbers Operations with Integers 9

HOSPITALITY Math Assessment Preparation Guide Please note that the guide is for reference only and that it does not represent an exact match with the assessment content. The Assessment Centre at George

### Click on the links below to jump directly to the relevant section

Click on the links below to jump directly to the relevant section Basic review Writing fractions in simplest form Comparing fractions Converting between Improper fractions and whole/mixed numbers Operations

### 2. THE x-y PLANE 7 C7

2. THE x-y PLANE 2.1. The Real Line When we plot quantities on a graph we can plot not only integer values like 1, 2 and 3 but also fractions, like 3½ or 4¾. In fact we can, in principle, plot any real

### Overview of Math Standards

Algebra 2 Welcome to math curriculum design maps for Manhattan- Ogden USD 383, striving to produce learners who are: Effective Communicators who clearly express ideas and effectively communicate with diverse

### Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Chapter 14: Production Possibility Frontiers

Chapter 14: Production Possibility Frontiers 14.1: Introduction In chapter 8 we considered the allocation of a given amount of goods in society. We saw that the final allocation depends upon the initial

### PROBABILITIES AND PROBABILITY DISTRIBUTIONS

Published in "Random Walks in Biology", 1983, Princeton University Press PROBABILITIES AND PROBABILITY DISTRIBUTIONS Howard C. Berg Table of Contents PROBABILITIES PROBABILITY DISTRIBUTIONS THE BINOMIAL

### Confidence Intervals for One Standard Deviation Using Standard Deviation

Chapter 640 Confidence Intervals for One Standard Deviation Using Standard Deviation Introduction This routine calculates the sample size necessary to achieve a specified interval width or distance from

### Exercises with solutions (1)

Exercises with solutions (). Investigate the relationship between independence and correlation. (a) Two random variables X and Y are said to be correlated if and only if their covariance C XY is not equal

### Using Laws of Probability. Sloan Fellows/Management of Technology Summer 2003

Using Laws of Probability Sloan Fellows/Management of Technology Summer 2003 Uncertain events Outline The laws of probability Random variables (discrete and continuous) Probability distribution Histogram

### Mathematics 96 (3581) CA (Class Addendum) 3: Inverse Properties Mt. San Jacinto College Menifee Valley Campus Spring 2013

Mathematics 96 (358) CA (Class Addendum) 3: Inverse Properties Mt. San Jacinto College Menifee Valley Campus Spring 203 Name This class handout is worth a maimum of five (5) points. It is due no later

### 6.3 Conditional Probability and Independence

222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### Prentice Hall Mathematics: Algebra 1 2007 Correlated to: Michigan Merit Curriculum for Algebra 1

STRAND 1: QUANTITATIVE LITERACY AND LOGIC STANDARD L1: REASONING ABOUT NUMBERS, SYSTEMS, AND QUANTITATIVE SITUATIONS Based on their knowledge of the properties of arithmetic, students understand and reason

### Domain and Range. Many problems will ask you to find the domain of a function. What does this mean?

Domain and Range The domain of a function is the set of values that we are allowed to plug into our function. This set is the x values in a function such as f(x). The range of a function is the set of

### Linear Programming Notes V Problem Transformations

Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material

### Statistics 104: Section 6!

Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC

### Probability OPRE 6301

Probability OPRE 6301 Random Experiment... Recall that our eventual goal in this course is to go from the random sample to the population. The theory that allows for this transition is the theory of probability.

### Basic Quantum Mechanics Prof. Ajoy Ghatak Department of Physics. Indian Institute of Technology, Delhi

Basic Quantum Mechanics Prof. Ajoy Ghatak Department of Physics. Indian Institute of Technology, Delhi Module No. # 02 Simple Solutions of the 1 Dimensional Schrodinger Equation Lecture No. # 7. The Free

### L4: Bayesian Decision Theory

L4: Bayesian Decision Theory Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna