Lecture 4: Joint probability distributions; covariance; correlation



Similar documents
Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

Covariance and Correlation

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Statistics 100A Homework 4 Solutions

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

Normal Approximation. Contents. 1 Normal Approximation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

Math 431 An Introduction to Probability. Final Exam Solutions

MULTIVARIATE PROBABILITY DISTRIBUTIONS

University of California, Los Angeles Department of Statistics. Random variables

Chapter 5. Random variables

Correlation in Random Variables

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Joint Exam 1/P Sample Exam 1

Data Mining: Algorithms and Applications Matrix Math Review

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Some probability and statistics

The Bivariate Normal Distribution

Random variables, probability distributions, binomial random variable

Lecture 6: Discrete & Continuous Probability and Random Variables

Normal distribution. ) 2 /2σ. 2π σ

Math 461 Fall 2006 Test 2 Solutions

So, using the new notation, P X,Y (0,1) =.08 This is the value which the joint probability function for X and Y takes when X=0 and Y=1.

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Lecture 8: More Continuous Random Variables

Chapter 4 Lecture Notes

Statistics 100A Homework 7 Solutions

2 Binomial, Poisson, Normal Distribution

Lecture Notes 1. Brief Review of Basic Probability

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

Aggregate Loss Models

5. Continuous Random Variables

General Sampling Methods

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Examples: Joint Densities and Joint Mass Functions Example 1: X and Y are jointly continuous with joint pdf

You flip a fair coin four times, what is the probability that you obtain three heads.

Lecture 9: Introduction to Pattern Analysis

Sections 2.11 and 5.8

Lecture 2. Summarizing the Sample

Feb 28 Homework Solutions Math 151, Winter Chapter 6 Problems (pages )

Lecture 5 : The Poisson Distribution

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Section 5.1 Continuous Random Variables: Introduction

Transformations and Expectations of random variables

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

1.1 Introduction, and Review of Probability Theory Random Variable, Range, Types of Random Variables CDF, PDF, Quantiles...

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

1 Prior Probability and Posterior Probability

Exploratory Data Analysis

M2S1 Lecture Notes. G. A. Young ayoung

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

POLYNOMIAL FUNCTIONS

( ) is proportional to ( 10 + x)!2. Calculate the

WHERE DOES THE 10% CONDITION COME FROM?

E3: PROBABILITY AND STATISTICS lecture notes

An Introduction to Basic Statistics and Probability

Some special discrete probability distributions

The normal approximation to the binomial

Chapter 3 RANDOM VARIATE GENERATION

Multivariate Normal Distribution

Lecture 3: Linear methods for classification

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

ST 371 (IV): Discrete Random Variables

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Introduction to General and Generalized Linear Models

Econometrics Simple Linear Regression

Math 151. Rumbos Spring Solutions to Assignment #22

1 Sufficient statistics

Autocovariance and Autocorrelation

10.1. Solving Quadratic Equations. Investigation: Rocket Science CONDENSED

Unit 4 The Bernoulli and Binomial Distributions

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

Practice problems for Homework 11 - Point Estimation

Lecture 8. Confidence intervals and the central limit theorem

4. Continuous Random Variables, the Pareto and Normal Distributions

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment Statistics and Probability Example Part 1

Example 1. so the Binomial Distrubtion can be considered normal

Lecture 7: Continuous Random Variables

MAS108 Probability I

Sales forecasting # 2

Statistics 100A Homework 8 Solutions

Random Variables. Chapter 2. Random Variables 1

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Math/Stats 342: Solutions to Homework

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Statistics 104: Section 6!

Multivariate normal distribution and testing for means (see MKB Ch 3)

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

The normal approximation to the binomial

A Tutorial on Probability Theory

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

Polynomial Invariants

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

東 海 大 學 資 訊 工 程 研 究 所 碩 士 論 文

Transcription:

Lecture 4: Joint probability distributions; covariance; correlation 10 October 2007 In this lecture we ll learn the following: 1. what joint probability distributions are; 2. visualizing multiple variables/joint probability distributions; 3. marginalization; 4. what covariariance and correlation are; 5. a bit more about variance. 1 Joint probability distributions Recall that a basic probability distribution is defined over a random variable, and a random variable maps from the sample space to the real numbers (R). What about when you are interested in the outcome of an event that is not naturally characterizable as a single real-valued number, such as the two formants of a vowel? The answer is really quite simple: probability distributions can be generalized over multiple random variables at once, in which case they are called joint probability distributions (jpd s). If a jpd is over N random variables at once then it maps from the sample space to R N, which is short-hand for real-valued vectors of dimension N. Notationally, for random variables X 1, X 2,, X N, the joint probability density function is written as 1

p(x 1 = 1, X 2 = 2,, X N = n ) or simply p( 1, 2,, n ) for short. Whereas for a single r.v., the cumulative distribution function is used to indicate the probability of the outcome falling on a segment of the real number line, the joint cumulative probability distribution function indicates the probability of the outcome falling in a region of N-dimensional space. The joint cpd, which is sometimes notated as F( 1,, n ) is defined as the probability of the set of random variables all falling at or below the specified values of X i : 1 F( 1,, n ) P(X 1 1,, X N n ) The natural thing to do is to use the joint cpd to describe the probabilities of rectangular volumes. For eample, suppose X is the f 1 formant and Y is the f 2 formant of a given utterance of a vowel. The probability that the vowel will lie in the region 480Hz f 1 530Hz, 940Hz f 2 1020Hz is given below: P(480Hz f 1 530Hz, 940Hz f 2 1020Hz) = F(530Hz, 1020Hz) F(530Hz, 940Hz) F(480Hz, 1020Hz)+F(480Hz, 940Hz) and visualized in Figure 1 using the code below. 1 Technically, the definition of the multivariate cpd is then F( 1,, n ) P(X 1 1,,X N n ) = 1,, N p( ) 1 N F( 1,, n ) P(X 1 1,,X N n ) = p( )d N d 1 [Discrete] (1) [Continuous] (2) Linguistics 251 lecture 4 notes, page 2 Roger Levy, Fall 2007

f2 500 1000 1500 2000 2500 200 300 400 500 600 700 800 f1 Figure 1: The probability of the formants of a vowel landing in the grey rectangle can be calculated using the joint cumulative distribution function. plot(c(),c(),lim=c(200,800),ylim=c(500,2500),lab="f1",ylab="f2") rect(480,940,530,1020,col=8) 1.1 Multinomial distributions as jpd s We touched on the multinomial distribution very briefly in a previous lecture. To refresh your memory, a multinomial can be thought of as a random set of n outcomes into r distinct classes that is, as n rolls of an r-sided die where the tabulated outcome is the number of rolls that came up in each of the r classes. This outcome can be represented as a set of random variables X 1,, X r, or equivalently an r-dimensional real-valued vector. The r-class multinomial distribution is characterized by r 1 parameters, p 1, p 2, p r 1, which are the probabilities of each die roll coming out as each class. The probability of the die roll coming out in the rth class is 1 r 1 i=1 p i, which is sometimes called p r but is not a true parameter of the model. The probability mass function looks like this: ( ) r n p(n 1,, n r ) = n 1 n r i=1 p i Linguistics 251 lecture 4 notes, page 3 Roger Levy, Fall 2007

tabs(~animacyofrec + AnimacyOfTheme, data = dative) animate animate inanimate AnimacyOfTheme inanimate AnimacyOfRec Figure 2: Distribution of animacy for recipient and theme 2 Visualizing joint probability distributions We ll look at two eamples of visiualizing jpd s. First is a discrete distribution one out of the dative dataset in languager. We ll look at the distribution of animacy of theme and recipient in this dataset, shown in Figure 2 using the code below: > tabs(~ AnimacyOfRec + AnimacyOfTheme, data=dative) AnimacyOfTheme AnimacyOfRec animate inanimate animate 68 2956 inanimate 6 233 > mosaicplot(tabs(~ AnimacyOfRec + AnimacyOfTheme, data=dative)) The second eample is of the joint distribution of frequency and length in the Brown corpus. First we ll put together an estimate of the joint distribution. persp(kde2d($length,log($count),n=50),theta=210,phi=15, lab="word Length",ylab="Word Frequency",zlab="p(Length,Freq)") tabs(), mosaicplot() persp(), kde2d() This gives a perspective plot of the joint distribution of length and frequency. Another way of visualizing the joint distribution is going along the lengthais slice by slice in a coplot: histogram() Linguistics 251 lecture 4 notes, page 4 Roger Levy, Fall 2007

0 2 4 6 810 0 2 4 6 810 p(length,freq) Word Length Word Frequency Percent of Total 100 80 60 40 20 0 100 80 60 40 20 0 100 80 60 40 20 0 $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length 100 80 60 40 20 0 100 80 60 40 20 0 0 2 4 6 810 0 2 4 6 810 0 2 4 6 810 log($count) Figure 3: Perspective plot and coplot (i.e., slice-by-slice histogram plot) of the relationship between length and log-frequency in the Brown corpus histogram(~ log($count) $Length) Both resulting plots are shown in Figure 3. 3 Marginalization Often we have direct access to a joint density function but we are more interested in the probability of an outcome of a subset of the random variables in the joint density. Obtaining this probability is called marginalization, and it involves taking a weighted sum 2 over the possible outcomes of the r.v. s that are not of interest. For two variables X, Y : P(X = ) = y = y P(, y) P(X = Y = y)p(y) In this case P(X) is often called a marginal probability and the process of calculating it from the joint density P(X, Y ) is known as marginalization. 2 or integral in the continuous case Linguistics 251 lecture 4 notes, page 5 Roger Levy, Fall 2007

4 Covariance The covariance between two random variables X and Y is defined as follows: Simple eample: Cov(X, Y ) = E[(X E(X))(Y E(Y ))] (1) Coding for Y 0 1 Coding for X Pronoun Not Pronoun 0 Object Preverbal 0.224 0.655.879 1 Object Postverbal 0.014 0.107.121.238.762 Each of X and Y can be treated as a Bernoulli random variable with arbitrary codings of 1 for Postverbal and Not Pronoun, and 0 for the others. As a resunt, we have µ X = 0.121, µ Y = 0.762. The covariance between the two is: (0.121) (0.762).224 (0,0) +(1.121) (0.762) 0.014 (1,0) +(0.121) (1.762) 0.0655 (0,1) +(1.121) (1.762) 0.107 (1,1) =0.0148 In R, we can use the cov() function to get the covariance between two random variables, such as word length versus frequency across the English leicon: > cov($length,$count) [1] -42.44823 > cov($length,log($count)) [1] -0.9333345 The covariance in both cases is negative, indicating that longer words tend to be less frequent. If we shuffle one of the covariates around, it eliminates this covariance: Linguistics 251 lecture 4 notes, page 6 Roger Levy, Fall 2007 order() plus runif() give a nice way of randomizing a vector.

> cov($length,log($count)[order(runif(length($count)))]) [1] 6211629 The covariance is essentially zero now. An important aside: the variance of a random variable X is just its covariance with itself: Var(X) = Cov(X, X) 4.1 Covariance and scaling random variables What happens to Cov(X, Y ) when you scale X? Let Z = a + bx. It turns out that the covariance with Y increases by b: 3 Cov(Z, Y ) = bcov(x, Y ) As an important consequence of this, rescaling a random variable by Z = a + bx rescales the variance by b 2 : Var(Z) = b 2 Var(X). 4.2 Correlation We just saw that the covariance of word length with frequency was much higher than with log frequency. However, the covariance cannot be compared directly across different pairs of random variables, because we also saw that random variables on different scales (e.g., those with larger versus smaller ranges) have different covariances due to the scale. For this reason, it is commmon to use the correlation ρ as a standardized form of covariance: 3 The reason for this is as follows. By linearity of epectation, E(Z) = a+be(x). This gives us Cov(Z, Y ) = E[(Z a + be(x))(y E(Y ))] = E[((bX be(x))(y E(Y ))] = E[b(X E(X))(Y E(Y ))] = be[(x E(X))(Y E(Y ))] [by linearity of epectation] = bcov(x, Y ) [by linearity of epectation] Linguistics 251 lecture 4 notes, page 7 Roger Levy, Fall 2007

ρ XY = Cov(X, Y ) V ar(x)v ar(y ) If X and Y are independent, then their covariance (and hence correlation) is zero. 4.3 Multivariate normal distributions We re now ready to deal with the multivariate normal distribution. Here we ll just work with a 2-dimensional, or bivariate, distribution. Whereas the univariate normal distribution was characterized by two parameters mean µ and variance σ 2 the bivariate normal distribution is characterized by two mean parameters (µ X, µ Y ), two variance terms (one for the X ais and one for the Y ais), and one covariance term showing the tendency for X and Y to go together. The three variance and covariance terms are often grouped together into a symmetric covariance matri as follows: [ σ 2 XX σ 2 XY ] σ 2 XY Note that the terms σxx 2 and σ2 Y Y are simply the variances in the X and Y aes (the subscripts appear doubled, XX, for notational consistency). The term σxy 2 is the covariance between the two aes. cbind(), library(mvtnorm) sigma. <- 4 sigma.yy <- 1 sigma.y <- 0 # no covariance sigma <- matri(c(sigma.,sigma.y,sigma.y,sigma.yy),ncol=2) old.par <- par(mfrow=c(1,2)) <- seq(-5,5,by=0.25) y <- f <- function(,y) { #cat("x: ",,"\n") #cat("y:", y," \n") y <- cbind(,y) #cat("xy: ", y,"\n") dmvnorm(y,c(0,0),sigma) σ 2 Y Y function(), outer(), dmvnorm() Linguistics 251 lecture 4 notes, page 8 Roger Levy, Fall 2007

} z <- outer(,y,f) persp(, y, z, theta = 30, phi = 30, epand = 0.5, col = "lightblue", ltheta = 120, shade = 0.75) contour(, y, z, method = "edge",lab="x",ylab="y") par(old.par) # do the same thing again with sigma.y <- 1. # Ma sigma.y for this case is 2 5 A bit more about variance, the binomial distribution, and the normal distribution With covariance in hand, we can now epress the variance of a sum of random variables. If Z = X + Y then V ar(z) = V ar(x) + V ar(y ) + 2Cov(X, Y ) As a result, if two random variables are independent then the variance of their sum is the sum of their variances. As an application of this fact, we can now calculate the variance of a binomially distributed random variable X with parameters n, p. X can be epressed as the sum of n identically distributed Bernoulli random variables X i, each with parameter p. We ve already established that the variance of each X i is p(1 p). So the variance of X must be np(1 p). 5.1 Normal approimation to the binomial distribution Finally, we can combine these facts with the central limit theorem to show how the binomial distribution can be approimated with the normal distribution. The central limit theorem says that the sum of many independent random variables tends toward a normal distribution with appropriate mean and variance. In our case, the binomial distribution has mean np and variance np(1 p). Thus, when n is large, the binomial distribution should look approimately normal. We can test this: p <- 0.3 Linguistics 251 lecture 4 notes, page 9 Roger Levy, Fall 2007

dbinom(, n, p) n= 5 1 2 3 4 5 dbinom(, n, p) n= 10 2 6 10 dbinom(, n, p) n= 15 2 6 12 dbinom(, n, p) n= 20 5 15 dbinom(, n, p) n= 5 1 2 3 4 5 dbinom(, n, p) n= 10 2 6 10 dbinom(, n, p) 0.0 n= 15 2 6 12 dbinom(, n, p) 0.0 n= 20 5 15 dbinom(, n, p) n= 25 5 15 25 dbinom(, n, p) n= 30 0 10 20 30 dbinom(, n, p) n= 35 0 10 25 dbinom(, n, p) n= 40 dbinom(, n, p) 0.0 n= 25 5 15 25 dbinom(, n, p) n= 30 0 10 20 30 dbinom(, n, p) n= 35 0 10 25 dbinom(, n, p) n= 40 dbinom(, n, p) n= 45 dbinom(, n, p) n= 50 dbinom(, n, p) n= 55 dbinom(, n, p) n= 60 0 20 50 dbinom(, n, p) n= 45 dbinom(, n, p) n= 50 dbinom(, n, p) n= 55 dbinom(, n, p) n= 60 0 20 50 dbinom(, n, p) n= 65 0 20 50 dbinom(, n, p) n= 70 0 30 60 dbinom(, n, p) n= 75 0 40 dbinom(, n, p) n= 80 0 40 80 dbinom(, n, p) n= 65 0 20 50 dbinom(, n, p) n= 70 0 30 60 dbinom(, n, p) n= 75 0 40 dbinom(, n, p) n= 80 0 40 80 Figure 4: Approimation of binomial distribution with normal distribution for p = 0.4 (left) and p = 0.05 (right). old.par <- par(mfrow=c(4,4)) for(n1 in 1:16) { n <- n1*5 <- 1:n plot(, dbinom(,n,p),type="l",lty=1,main=paste("n=",n)) lines(, dnorm(,n*p,sqrt(n*p*(1-p))),lty=2) if(n1==1) { legend(3,0.35,c("binomial","normal"),lty=c(1,2),ce=1.5) } } par(old.par) # then try with p=0.05 Linguistics 251 lecture 4 notes, page 10 Roger Levy, Fall 2007