Lecture 4: Joint probability distributions; covariance; correlation

Lecture 4: Joint probability distributions; covariance; correlation 10 October 2007 In this lecture we ll learn the following: 1. what joint probability distributions are; 2. visualizing multiple variables/joint probability distributions; 3. marginalization; 4. what covariariance and correlation are; 5. a bit more about variance. 1 Joint probability distributions Recall that a basic probability distribution is defined over a random variable, and a random variable maps from the sample space to the real numbers (R). What about when you are interested in the outcome of an event that is not naturally characterizable as a single real-valued number, such as the two formants of a vowel? The answer is really quite simple: probability distributions can be generalized over multiple random variables at once, in which case they are called joint probability distributions (jpd s). If a jpd is over N random variables at once then it maps from the sample space to R N, which is short-hand for real-valued vectors of dimension N. Notationally, for random variables X 1, X 2,, X N, the joint probability density function is written as 1

p(x 1 = 1, X 2 = 2,, X N = n ) or simply p( 1, 2,, n ) for short. Whereas for a single r.v., the cumulative distribution function is used to indicate the probability of the outcome falling on a segment of the real number line, the joint cumulative probability distribution function indicates the probability of the outcome falling in a region of N-dimensional space. The joint cpd, which is sometimes notated as F( 1,, n ) is defined as the probability of the set of random variables all falling at or below the specified values of X i : 1 F( 1,, n ) P(X 1 1,, X N n ) The natural thing to do is to use the joint cpd to describe the probabilities of rectangular volumes. For eample, suppose X is the f 1 formant and Y is the f 2 formant of a given utterance of a vowel. The probability that the vowel will lie in the region 480Hz f 1 530Hz, 940Hz f 2 1020Hz is given below: P(480Hz f 1 530Hz, 940Hz f 2 1020Hz) = F(530Hz, 1020Hz) F(530Hz, 940Hz) F(480Hz, 1020Hz)+F(480Hz, 940Hz) and visualized in Figure 1 using the code below. 1 Technically, the definition of the multivariate cpd is then F( 1,, n ) P(X 1 1,,X N n ) = 1,, N p( ) 1 N F( 1,, n ) P(X 1 1,,X N n ) = p( )d N d 1 [Discrete] (1) [Continuous] (2) Linguistics 251 lecture 4 notes, page 2 Roger Levy, Fall 2007

f2 500 1000 1500 2000 2500 200 300 400 500 600 700 800 f1 Figure 1: The probability of the formants of a vowel landing in the grey rectangle can be calculated using the joint cumulative distribution function. plot(c(),c(),lim=c(200,800),ylim=c(500,2500),lab="f1",ylab="f2") rect(480,940,530,1020,col=8) 1.1 Multinomial distributions as jpd s We touched on the multinomial distribution very briefly in a previous lecture. To refresh your memory, a multinomial can be thought of as a random set of n outcomes into r distinct classes that is, as n rolls of an r-sided die where the tabulated outcome is the number of rolls that came up in each of the r classes. This outcome can be represented as a set of random variables X 1,, X r, or equivalently an r-dimensional real-valued vector. The r-class multinomial distribution is characterized by r 1 parameters, p 1, p 2, p r 1, which are the probabilities of each die roll coming out as each class. The probability of the die roll coming out in the rth class is 1 r 1 i=1 p i, which is sometimes called p r but is not a true parameter of the model. The probability mass function looks like this: ( ) r n p(n 1,, n r ) = n 1 n r i=1 p i Linguistics 251 lecture 4 notes, page 3 Roger Levy, Fall 2007

tabs(~animacyofrec + AnimacyOfTheme, data = dative) animate animate inanimate AnimacyOfTheme inanimate AnimacyOfRec Figure 2: Distribution of animacy for recipient and theme 2 Visualizing joint probability distributions We ll look at two eamples of visiualizing jpd s. First is a discrete distribution one out of the dative dataset in languager. We ll look at the distribution of animacy of theme and recipient in this dataset, shown in Figure 2 using the code below: > tabs(~ AnimacyOfRec + AnimacyOfTheme, data=dative) AnimacyOfTheme AnimacyOfRec animate inanimate animate 68 2956 inanimate 6 233 > mosaicplot(tabs(~ AnimacyOfRec + AnimacyOfTheme, data=dative)) The second eample is of the joint distribution of frequency and length in the Brown corpus. First we ll put together an estimate of the joint distribution. persp(kde2d($length,log($count),n=50),theta=210,phi=15, lab="word Length",ylab="Word Frequency",zlab="p(Length,Freq)") tabs(), mosaicplot() persp(), kde2d() This gives a perspective plot of the joint distribution of length and frequency. Another way of visualizing the joint distribution is going along the lengthais slice by slice in a coplot: histogram() Linguistics 251 lecture 4 notes, page 4 Roger Levy, Fall 2007

0 2 4 6 810 0 2 4 6 810 p(length,freq) Word Length Word Frequency Percent of Total 100 80 60 40 20 0 100 80 60 40 20 0 100 80 60 40 20 0 $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length $Length 100 80 60 40 20 0 100 80 60 40 20 0 0 2 4 6 810 0 2 4 6 810 0 2 4 6 810 log($count) Figure 3: Perspective plot and coplot (i.e., slice-by-slice histogram plot) of the relationship between length and log-frequency in the Brown corpus histogram(~ log($count) $Length) Both resulting plots are shown in Figure 3. 3 Marginalization Often we have direct access to a joint density function but we are more interested in the probability of an outcome of a subset of the random variables in the joint density. Obtaining this probability is called marginalization, and it involves taking a weighted sum 2 over the possible outcomes of the r.v. s that are not of interest. For two variables X, Y : P(X = ) = y = y P(, y) P(X = Y = y)p(y) In this case P(X) is often called a marginal probability and the process of calculating it from the joint density P(X, Y ) is known as marginalization. 2 or integral in the continuous case Linguistics 251 lecture 4 notes, page 5 Roger Levy, Fall 2007

4 Covariance The covariance between two random variables X and Y is defined as follows: Simple eample: Cov(X, Y ) = E[(X E(X))(Y E(Y ))] (1) Coding for Y 0 1 Coding for X Pronoun Not Pronoun 0 Object Preverbal 0.224 0.655.879 1 Object Postverbal 0.014 0.107.121.238.762 Each of X and Y can be treated as a Bernoulli random variable with arbitrary codings of 1 for Postverbal and Not Pronoun, and 0 for the others. As a resunt, we have µ X = 0.121, µ Y = 0.762. The covariance between the two is: (0.121) (0.762).224 (0,0) +(1.121) (0.762) 0.014 (1,0) +(0.121) (1.762) 0.0655 (0,1) +(1.121) (1.762) 0.107 (1,1) =0.0148 In R, we can use the cov() function to get the covariance between two random variables, such as word length versus frequency across the English leicon: > cov($length,$count) [1] -42.44823 > cov($length,log($count)) [1] -0.9333345 The covariance in both cases is negative, indicating that longer words tend to be less frequent. If we shuffle one of the covariates around, it eliminates this covariance: Linguistics 251 lecture 4 notes, page 6 Roger Levy, Fall 2007 order() plus runif() give a nice way of randomizing a vector.

> cov($length,log($count)[order(runif(length($count)))]) [1] 6211629 The covariance is essentially zero now. An important aside: the variance of a random variable X is just its covariance with itself: Var(X) = Cov(X, X) 4.1 Covariance and scaling random variables What happens to Cov(X, Y ) when you scale X? Let Z = a + bx. It turns out that the covariance with Y increases by b: 3 Cov(Z, Y ) = bcov(x, Y ) As an important consequence of this, rescaling a random variable by Z = a + bx rescales the variance by b 2 : Var(Z) = b 2 Var(X). 4.2 Correlation We just saw that the covariance of word length with frequency was much higher than with log frequency. However, the covariance cannot be compared directly across different pairs of random variables, because we also saw that random variables on different scales (e.g., those with larger versus smaller ranges) have different covariances due to the scale. For this reason, it is commmon to use the correlation ρ as a standardized form of covariance: 3 The reason for this is as follows. By linearity of epectation, E(Z) = a+be(x). This gives us Cov(Z, Y ) = E[(Z a + be(x))(y E(Y ))] = E[((bX be(x))(y E(Y ))] = E[b(X E(X))(Y E(Y ))] = be[(x E(X))(Y E(Y ))] [by linearity of epectation] = bcov(x, Y ) [by linearity of epectation] Linguistics 251 lecture 4 notes, page 7 Roger Levy, Fall 2007

ρ XY = Cov(X, Y ) V ar(x)v ar(y ) If X and Y are independent, then their covariance (and hence correlation) is zero. 4.3 Multivariate normal distributions We re now ready to deal with the multivariate normal distribution. Here we ll just work with a 2-dimensional, or bivariate, distribution. Whereas the univariate normal distribution was characterized by two parameters mean µ and variance σ 2 the bivariate normal distribution is characterized by two mean parameters (µ X, µ Y ), two variance terms (one for the X ais and one for the Y ais), and one covariance term showing the tendency for X and Y to go together. The three variance and covariance terms are often grouped together into a symmetric covariance matri as follows: [ σ 2 XX σ 2 XY ] σ 2 XY Note that the terms σxx 2 and σ2 Y Y are simply the variances in the X and Y aes (the subscripts appear doubled, XX, for notational consistency). The term σxy 2 is the covariance between the two aes. cbind(), library(mvtnorm) sigma. <- 4 sigma.yy <- 1 sigma.y <- 0 # no covariance sigma <- matri(c(sigma.,sigma.y,sigma.y,sigma.yy),ncol=2) old.par <- par(mfrow=c(1,2)) <- seq(-5,5,by=0.25) y <- f <- function(,y) { #cat("x: ",,"\n") #cat("y:", y," \n") y <- cbind(,y) #cat("xy: ", y,"\n") dmvnorm(y,c(0,0),sigma) σ 2 Y Y function(), outer(), dmvnorm() Linguistics 251 lecture 4 notes, page 8 Roger Levy, Fall 2007

} z <- outer(,y,f) persp(, y, z, theta = 30, phi = 30, epand = 0.5, col = "lightblue", ltheta = 120, shade = 0.75) contour(, y, z, method = "edge",lab="x",ylab="y") par(old.par) # do the same thing again with sigma.y <- 1. # Ma sigma.y for this case is 2 5 A bit more about variance, the binomial distribution, and the normal distribution With covariance in hand, we can now epress the variance of a sum of random variables. If Z = X + Y then V ar(z) = V ar(x) + V ar(y ) + 2Cov(X, Y ) As a result, if two random variables are independent then the variance of their sum is the sum of their variances. As an application of this fact, we can now calculate the variance of a binomially distributed random variable X with parameters n, p. X can be epressed as the sum of n identically distributed Bernoulli random variables X i, each with parameter p. We ve already established that the variance of each X i is p(1 p). So the variance of X must be np(1 p). 5.1 Normal approimation to the binomial distribution Finally, we can combine these facts with the central limit theorem to show how the binomial distribution can be approimated with the normal distribution. The central limit theorem says that the sum of many independent random variables tends toward a normal distribution with appropriate mean and variance. In our case, the binomial distribution has mean np and variance np(1 p). Thus, when n is large, the binomial distribution should look approimately normal. We can test this: p <- 0.3 Linguistics 251 lecture 4 notes, page 9 Roger Levy, Fall 2007

dbinom(, n, p) n= 5 1 2 3 4 5 dbinom(, n, p) n= 10 2 6 10 dbinom(, n, p) n= 15 2 6 12 dbinom(, n, p) n= 20 5 15 dbinom(, n, p) n= 5 1 2 3 4 5 dbinom(, n, p) n= 10 2 6 10 dbinom(, n, p) 0.0 n= 15 2 6 12 dbinom(, n, p) 0.0 n= 20 5 15 dbinom(, n, p) n= 25 5 15 25 dbinom(, n, p) n= 30 0 10 20 30 dbinom(, n, p) n= 35 0 10 25 dbinom(, n, p) n= 40 dbinom(, n, p) 0.0 n= 25 5 15 25 dbinom(, n, p) n= 30 0 10 20 30 dbinom(, n, p) n= 35 0 10 25 dbinom(, n, p) n= 40 dbinom(, n, p) n= 45 dbinom(, n, p) n= 50 dbinom(, n, p) n= 55 dbinom(, n, p) n= 60 0 20 50 dbinom(, n, p) n= 45 dbinom(, n, p) n= 50 dbinom(, n, p) n= 55 dbinom(, n, p) n= 60 0 20 50 dbinom(, n, p) n= 65 0 20 50 dbinom(, n, p) n= 70 0 30 60 dbinom(, n, p) n= 75 0 40 dbinom(, n, p) n= 80 0 40 80 dbinom(, n, p) n= 65 0 20 50 dbinom(, n, p) n= 70 0 30 60 dbinom(, n, p) n= 75 0 40 dbinom(, n, p) n= 80 0 40 80 Figure 4: Approimation of binomial distribution with normal distribution for p = 0.4 (left) and p = 0.05 (right). old.par <- par(mfrow=c(4,4)) for(n1 in 1:16) { n <- n1*5 <- 1:n plot(, dbinom(,n,p),type="l",lty=1,main=paste("n=",n)) lines(, dnorm(,n*p,sqrt(n*p*(1-p))),lty=2) if(n1==1) { legend(3,0.35,c("binomial","normal"),lty=c(1,2),ce=1.5) } } par(old.par) # then try with p=0.05 Linguistics 251 lecture 4 notes, page 10 Roger Levy, Fall 2007