Measurements of central tendency express whether the numbers tend to be high or low. The most common of these are:

Transcription

1 A PRIMER IN PROBABILITY This handout is intended to refresh you on the elements of probability and statistics that are relevant for econometric analysis. In order to help you prioritize the information you need to retain, I have used the symbol to mark any essential formula that is universally true. I. Descriptive statistics First of all, we should distinguish between two concepts: the population is the entire group that we wish to study, while the sample is the subset of the population for which we have information. Some formulas will differ slightly, depending on whether we are describing a population or sample. When faced with a bunch of numbers, in either a population or a sample, we often look for simple ways to summarize the data. For example, look at the results of two groups systolic blood pressure readings: Group A: 95, 102, 98, 104, 101, 104, 99 Group B: 131, 129, 167, 103, 142, 126, 153 We might notice two differences between these groups: first, that Group B tends to have much higher blood pressure readings; next, that Group A tends to be clustered closely together, while Group B is more spread out. Those two statements are descriptive statistics, in essence though in this case, they are very informal observations, and they might invite the question what exactly do you mean by more spread out? We have a number of standard measurements to express characteristics of samples or populations. Measurements of central tendency express whether the numbers tend to be high or low. The most common of these are: Mean: the average value. Median: the middle value. Mode: the most common value. (In practice, almost nobody uses the mode.) The mean and median of a population will be different of the distribution is skewed, meaning that there are larger (or smaller) gaps between values at the high end than at the low end. For example, the distribution of income is very skewed: the income of the wealthiest people differs by billions of dollars, while the income of the poorest people differs by pennies. Because of this, mean income might be a slightly misleading indicator, since a few

2 wealthy people can pull the average up, so that most people actually have incomes below the average. The median addresses this issue, by reporting the income of the person in right in the middle of the distribution. 1 The second characteristic that we might wish to describe is the spread of the distribution: whether observations are clustered closely together, or spread apart. Most often, we use the variance or the standard deviation to express this concept. The standard deviation in a group is the average distance between each observation and the mean; the variance is just the standard deviation, squared (or the average squared difference between the observations and their mean). Skewness refers to whether the gaps at the top of the distribution are larger or smaller than those at the bottom. (Formally, this is calculated at the average cubed difference between the observations and their mean.) Skewness is not synonymous with biased avoid saying that results are skewed unless you are certain that you are using the word correctly. The maximum and minimum values should be self-evident. Finally, the Xth percentile refers to the value that X% of the group lies below. 2 For example, the median is exactly the same thing as the 50th percentile. II. Probability In probability, an event is something, determined by chance, that either does or does not happen. An event can be described as simple, meaning that there is only one way to achieve the outcome, or complex, meaning that there are a number of simple events that would satisfy the condition. For example, an (American) roulette wheel contains the numbers 1 through 36, plus 0 and 00. Aside from 0 and 00, half of the numbers are red, and half black. The betting board looks something like this: In this context, an event would be anything that you could place a bet on. A simple event would be a bet on the number 17, since there is only one outcome that 1 In the March 2005 Current Population study, mean household income was $61,905. However, 63% of households earned less than the average. The median income was $46, The 90th percentile in household income is $122,324: 90% of households earn less than this, and 10% earn more. A primer in probability, p. 2

3 would win this bet. Betting on all odd numbers would be a complex event, since the outcomes 1, 3, 5,, 35 would all satisfy this condition. Formally, let S denote the space of all possible outcomes. Any event is a subset of S. We will use letters like A and B to denote generic events, while A or B will denote the complement of A or B: all the things that are not part of the event. For example, if A is the event that red wins then A is the event black or house wins. The union of two events, A! B, consists of all outcomes that satisfy one event or the other (or both); while the intersection, A! B, are all outcomes satisfying both conditions. For all practical purposes, you can read A! B as A or B and A! B as A and B. We use the following logical rules for combining ands, ors, and nots: (A! B) = ( A) " ( B) (A " B) = ( A)! ( B) To put this in words: neither A nor B happened is equivalent to saying A did not happen, and B did not happen ; and it was not the case that both A and B occurred is the same as A did not happen, or B did not happen (or neither happened). Let s go back to the roulette example, where A was the event that odd wins and B was red wins. Then we can write the following: S = {1,2, 3, 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25, 26,27,28,29,30, 31, 32, 33, 34, 35, 36,0,00} A = {1, 3,5, 7,9,11,13,15,17,19,21,23,25,27,29, 31, 33, 35} B = {1, 3,5, 7,9,12,14,16,18,19,21,23,25,27, 30, 32, 34, 36} A = {2, 4,6,8,10,12,14,16,18,20,22,24,26,28, 30, 32, 34, 36,0,00} A! B = {1, 3,5, 7,9,11,12,13,14,15,16,17,18,19,21,23,25,27,29, 30, 31, 32, 33, 34, 35, 36} A " B = {1, 3,5, 7,9,19,21,23,25,27} A probability measure is a function P[A] that tells us the fraction of times that an event occurs. The probability measure must satisfy three properties: 0! P[A]! 1 P[S] = 1 P[ A] = 1! P[A] A primer in probability, p. 3

4 In words: a probability cannot be negative, nor can it exceed one (a completely impossible event has a probability of zero, while a certain event has a probability of one); it is certain that something in the space of all possible outcomes will occur; and finally, if the chance that A happens is X, then the chance that A doesn t happen is1! X. We can calculate the chance of any complex event by adding up the probabilities of the chances of simple events in this complex event. Calculating the odds in roulette is fairly simply, since there is a 1 / 38 chance that the ball lands on any specific number. The probability that odd wins is therefore P[A] = P[1] + P[3] + + P[35] = 18 / 38. If we already know the probabilities that some complex events occur, and we want to calculate the chance that their union occurs (that one or the other, or both, happens), we cannot simply add the probabilities together. For example, there is an 18 / 38 chance that odd wins, and there is an 18 / 38 chance that red wins. The chance that red or odd wins is not18 / / 38 = 36 / 38. Look at the roulette board again: only 26 of the 38 outcomes are either red or odd, so this should be the chance that A! B occurs. By simply adding P[A] to P[B], we have double-counted the outcomes that are both red and odd. A correct calculation of the probability is: P[A! B] = P[A] + P[B] " P[A # B] This is always the rule for calculating the probability of the union of events. When there is nothing in the intersection of two events, we say that the events are disjoint or mutually exclusive. For example, even wins and odd wins are mutually exclusive events, since there is no outcome satisfying both conditions. In this special case, the probability that one or the other occurs is simply their sum: P[A! B] = P[A] + P[B] if A and B are mutually exclusive events. Finally, we should address conditional probability. Suppose that you are playing roulette, and you have placed a bet on odd. In general, your chance of winning is18 / 38, slightly under half. The wheel spins and the ball stops, but your view is obscured. However, you hear someone call out, yes, red wins! Even though you did not bet on red, you should be a bit excited about this news, since your chance of winning has increased: exactly half of the reds ( 9 of18 ) are odd. If we know that event B has occurred, we can use this information to revise our expectations about A. The probability of A conditional on B or the probability of A given B is always calculated as: A primer in probability, p. 4

5 P[A B] = P[A! B] / P[B] In this case, there are nine outcomes that are both red and odd, so P[A! B] = 9 / 38. Eighteen outcomes are red, so P[B] = 18 / 38. Therefore, given that the outcome is red, the chance that it is odd is P[A B] = (9 / 38) / (18 / 38) = 9 /18. We say that two events are independent if P[A B] = P[A] ; in other words, knowing B does not help us revise our probabilities that A occurred. In this example, red wins and odd wins are not independent, since P[A] = 18 / 38, while P[A B] = 9 /18. (They are not independent because red winning indicates that the house has not won; that is, that 0 or 00 did not come up.) If we want to calculate the probability that an intersection occurs (that both A and B happen), we can get this formula by rearranging the one for conditional probabilities: P[A! B] = P[A B]" P[B], in general; and P[A! B] = P[A]" P[B], if the events are independent. This last rule is incredibly useful. The heroine of the movie Run Lola Run places a hundred-mark bet on the number 20 on a roulette wheel. When 20 wins, she keeps her earnings on the same number, and 20 wins again. What are the odds that this occurs? The odds of any individual number winning are1 / 38, 3 so assuming that the spins of the roulette wheel are independent, the chance that 20 wins twice in a row is(1 / 38)! (1 / 38) = 1 /1444. When we have repeated, independent trials, this rule is convenient for calculating the probability that A and B occur. If we want to know instead the chance that A or B occurs, we have to combine several of our rules. One roulette strategy is to walk into the casino with $36, and place a $1 bet on a single number (my lucky number would be19 ) for 36 spins. Since a winning bet on a single number pays off 36 :1, you will come out ahead if your number comes up at least once within these 36 spins. So what is the chance that this happens? P[(Spin 1 = 19) or (Spin 2 = 19) or or (Spin 36 = 19)] = 1! P[ {(Spin 1 = 19) or (Spin 2 = 19) or or (Spin 36 = 19)}] = 1! P[(Spin 1 " 19) and (Spin 2 " 19) and and (Spin 36 " 19)] = 1! P[(Spin 1 " 19)]# P[(Spin 2 " 19)]# # P[(Spin 36 " 19)] = 1! (37 / 38) #(37 / 38) # #(37 / 38) = 1! (37 / 38) 36 = In truth, since Lola plays on a European wheel without 00, the odds are 1 / 37. A primer in probability, p. 5

6 Finally, we should talk about Bayes Rule. Suppose that you are being tested for some horrible disease. Fortunately, this disease is fairly rare: in the population overall, only 1 in 10,000 people have it, so we ll say that P[D] = 1 /10,000, where D is the event you have the disease. The test for this disease is very accurate, but not absolutely perfect. Among people who have the disease, 99.5% get a positive result, and 0.5% get a false negative; among those without, 99.9% get a correct negative, while 0.1% get a false positive. We can write the probabilities of obtaining a positive test (the event P) for these populations as P[P D] = and P[P D] = You take the test, and you are shocked to obtain a positive result. Given the accuracy of the test, does this mean you are likely to die? The answer is no, in fact. Think of it this way: in a population of 10,000,000 people, we would expect that 1,000 people have the disease, while 9,999,000 do not. If everyone were to take the test, 995 diseased people would get positive results and 9,999 well people would get (false) positives. Among the 10,994 people who get positive results, only 995 have the disease, so the chance of actually having the disease, given a positive test, is only Most likely, your result was a false positive. We have informally used Bayes Rule to calculate the chance of having the disease, given a positive result: P[D P]. Bayes Rule is used when you have an unconditional probability (also called a prior probability or an ex-ante probability) that you want to revise after the arrival of some news (the final conditional probability is sometimes called a posterior probability or an ex-post probability). In general, the rule is: P[D P] = P[P D]! P[D] P[P D]! P[D] + P[P D]! P[ D] In this case, P[P D]! P[D] P[D P] = P[P D]! P[D] + P[P D]! P[ D] = III. Random variables (0.995)(0.0001) (0.995)(0.0001) + (0.001)(0.9999) = A random variable takes on a numerical value that is determined by chance. We will use X or Y to denote a specific generic random variable, and x or y to indicate a specific value that the variable could take. For example, we could let X be the number of heads obtained from three coin flips. When we write P[X = x], we mean P[ the number of heads obtained from three coin flips = x]. X stands for the thing that we are measuring; x is a specific value that it could take. A primer in probability, p. 6

7 At times, we will want to distinguish between a discrete random variable, which takes on only a limited number of values, and a continuous random variable, which can take any value within some range. The number of heads obtained from three coin tosses is discrete, since this can take only values of zero, one, two, or three we could not obtain heads, or any other non-integer value. In contrast, the length of time until a light bulb fails is a continuous random variable, since this could be any non-negative value (a light bulb could last exactly months). As before, a probability distribution describes the likelihood of specific outcomes for a random variable. For simplicity, we will often write P[x] to indicate P[X = x], when there is no ambiguity; for example, P[2.718] stands for P[X = 2.718]. We will again let S denote the set of all possible outcomes for a variable. The expected value of a random variable is its theoretical average value. For a discrete random variable, this is calculated as: E[X] = # x! P[x] x"s In other words, we add up all possible outcomes times the chance of obtaining that outcome. When flipping a coin three times, there is a 1 / 8 chance of obtaining zero heads; 3 / 8 chance of obtaining one heads; 3 / 8 chance of obtaining two heads; and 1 / 8 chance of obtaining three heads. Therefore, the expected number of heads is: E[X] = 0! (1 / 8) + 1!(3 / 8) + 2!(3 / 8) + 3!(1 / 8) = 12 / 8 We can also take the expected value of any function of X. If X is a random variable, then G(X) is one, too. The expected value of G(X) is calculated as: E[G(X)] = # G(x)! P[x] x"s With linear functions, like G(X) = a + b! X, we can write E[G(X)] as a function of the expected value of X. The rule is: E[a + b! X] = a + b! E[X] This is not true of other functions, however: E[log(X)]! log(e[x]). At times, we are given additional information that allows us to revise our expectations about the value of X. For example, someone might have revealed that not all of the coins turned up heads in our coin toss. Since this rules out the possibility that we got three heads, we should lower our expectations about the number of heads that we did get. A primer in probability, p. 7

8 If we know that X is actually in T, some subset of S, the conditional expectation of X given T is: E[X T ] = # x! P[x] # P[x] x"t x"t Given that the number of heads is zero, one, or two, the conditional expectation is: E[X X! 3] = 0 "(1 / 8) + 1"(3 / 8) + 2 "(3 / 8) (1 / 8) + (3 / 8) + (3 / 8) = 9 / 8 7 / 8 When we take the expectation of the sum of random variables, the expectation can always be broken up at the summation: E[X + Y ] = E[X] + E[Y ] The same is generally not true for the expected value of a product: E[X!Y ] " E[X]! E[Y ]. The only time that this does work is when the two variables are independent: E[X!Y ] = E[X]! E[Y ]if independent The variance is the theoretical average squared difference between the outcome and its mean. For a discrete random variable, this formula is: Var(X) = E[(X! E[X]) 2 ] = $ (x! E[X]) 2 " P[x] x#s The variance in the number of heads from three coin tosses is: Var(X) = (0! 3 / 2) 2 " (1 / 8) + (1! 3 / 2) 2 "(3 / 8) + (2! 3 / 2) 2 " (3 / 8) + (3! 3 / 2) 2 " (1 / 8) = (9 / 4) " (1 / 8) + (1 / 4) " (3 / 8) + (1 / 4) "(3 / 8) + (9 / 4) " (1 / 8) = 24 / 32 Note that the variance must always be positive (or at least, non-negative), since it requires adding up a bunch of squared terms (which must be positive), each multiplied by a probability (which much also be positive). We can also calculate the variance in any function of X: Var[G(X)] = $ x#s (G(x)! E[G(X)]) 2 " P[x] With linear functions, there is again a specific rule: A primer in probability, p. 8

9 Var[a + b! X] = b 2!Var(X) In other words, adding a constant amount to the variable (always) does not affect its spread. Scaling the variable up or down by a constant b does affect this spread. Finally, the standard deviation in a random variable is the square root of its variance. When two random variables are observed concurrently, we might want to express whether they tend to move in the same direction, opposite direction, or have no joint tendencies. The covariance between X and Y is calculated as: Cov(X,Y ) = E[(X! E[X]) "(Y! E[Y ])] $ $ = (x! E[X])(y! E[Y ]) " P[X = x,y = y] x#s y#t (Fortunately, you need to know covariance more for the concept than for calculations in practice.) If the covariance is positive, the two variables tend to move in the same direction: when X is above average, then Y is generally above average as well. If the covariance is negative, the variables tend to move in opposite directions: when X is above average, Y tends to be below average. When the covariance is zero, the variables essentially move independently. While the sign of the covariance indicates whether the variables tend to move in the same direction, the magnitude is a bit more difficult to interpret. It tells us the size of the similarity, but it also reflects the size of the random variables themselves. (Simply doubling X will double the covariance between X and any other variable.) To adjust for the scale of the variables, we usually use the correlation: Corr(X,Y ) = Cov(X,Y ) Var(X)! Var(Y ) The correlation is an index that ranges between!1 and +1. A correlation of zero indicates that the variables have nothing in common; a correlation of one means that the variables are exactly the same, except that they might be measured in different scales; and a correlation of minus one means that they are exactly opposite. 4 More 4 For example, temperature in Fahrenheit and temperature in Celsius are variables that would have a correlation of exactly one, since they measure exactly the same thing in different scales. The peculiar old Delisle temperature scale, where water boiled at 0 D and froze at 150 D, would have a correlation of negative one with either Fahrenheit or Celsius temperatures. A primer in probability, p. 9

10 generally, a magnitude close to one indicates that the variables are very similar, while a correlation close to zero indicates a weak relationship. When taking the covariance or correlation of a linear function of some random variable, the rules are: Cov(a + b! X,Y ) = b! Cov(X,Y ) Corr(a + b! X,Y ) = Corr(X,Y ) Finally, we should note that the covariance between a variable and itself is the same as the variance in that variable: Cov(X, X) = E[(X! E[X]) " (X! E[X])] = E[(X! E[X]) 2 ] = Var(X) These formulas change a bit when dealing with continuous random variables. With continuous random variables, the probability that X takes any specific value (exactly) is infinitesimally small there is virtually zero chance that a light bulb burns out at exactly the instant that it turns months old. P[X = x]! 0, if X is continuous (generally). However, there is a non-negligible chance that the variable falls within some range, so it makes sense to talk about P[a! X! b]. The distribution of a continuous random variable X is described by a probability density function f (x), also known as a p.d.f., which has the property: b P[a! X! b] = " f (x)dx, where f (x) is the p.d.f. of X a A familiar p.d.f. is the bell curve of the normal distribution. Formally, this curve is characterized as f (x) = (2!" 2 ) #1 2 exp(#(x # µ) 2 2" 2 ), where µ is the mean of the population and! 2 is its variance. Graphing this function, we get: A primer in probability, p. 10

11 The fraction of the population whose X values lie between two points, a and b for example, is the area under this curve between a and b. Thinking back to calculus: if we are given some function f (x) and we want to know the area under this curve between two points, we integrate the function between those points, so! a b f (x)dx represents the fraction of the population whose Xes are between these values, so this is the probability that any observation, picked at random, is between these values. The normal distribution is only one example of a p.d.f.; we will deal with several others. For all distributions, a p.d.f. must satisfy two properties: f (x)! 0! +# "# f (x)dx = 1 These ensure that the probability measure will have all of the necessary properties mentioned in the previous section. For continuous random variables, the cumulative density function or c.d.f. is another important function. The c.d.f. gives the probability that the variable is below some specific value: F(x) = P[X! x], where F(x) is the c.d.f. of the random variable X Clearly, this is related to the p.d.f.: F(b) = # b!" f (x)dx The c.d.f., evaluated at some point, is the p.d.f., integrated up to this point. This implies that the p.d.f. is the derivative of the c.d.f. Also, the c.d.f. can be used to calculate the probability that X falls in some range (which is sometimes a convenient alternative to integrating the p.d.f.): F!(x) = f (x) P[a! X! b] = F(b) " F(a) The formulas for calculating expected values, variances, and such for continuous random variables are identical to those for discrete random variables, with two changes: 1. We replace the summation (over all possible values) with an integral (over the entire range); and 2. We replace the probability P[x] with the p.d.f. f (x). A primer in probability, p. 11

12 This gives us the formula for the expected value: +# E[X] = x! f (x)dx $ "# We can also calculate the expected values of functions of X, by integrating those functions instead of just x. The rule that E[a + b! X] = a + b! E[X] holds for continuous random variables, just as it did for discrete ones. The conditional expectation is defined similarly: b b E[X a! X! b] = # x " f (x)dx # f (x)dx The formula for variance is that: a Var(X) = E[(X! E[X]) 2 +# ] = (x! E[X]) 2 " f (x)dx $!# a Again, the rule that Var(a + b! X) = b 2 Var(X) is true for continuous random variables, as it was for discrete random variables. Covariance is calculated analogously, although we have to specify a joint p.d.f. for the two random variables. The correlation coefficient remains the covariance divided by the standard deviations in the two variables. As a final note, we often don t have to integrate these functions in practice and much of the time, the function can t actually be integrated (for example, you cannot integrate the p.d.f. of the normal distribution, f (x) = (2!" 2 ) #1 2 exp(#(x # µ) 2 2" 2 ), without a computer approximation). In theory, we integrate things, but in practice, we rarely have to do the dirty work. IV. Common probability distributions When modeling probabilistic phenomena, we repeatedly rely on a handful of distributions. The binomial distribution is the most common among the discrete variables. It describes a situation where we have N repeated, independent trials that can each come up positive (with probability p) or negative (with probability1! p ); the outcome of interest is K, the number of positive results. 5 The number of heads obtained from N coin tosses of a fair coin ( p = 1 / 2 ) would be a perfect example of a binomial distribution. 5 I will describe the outcomes generically as positive and negative, but this distribution applies whether the outcome is binary. There are other ways to describe this stylized situation: some people call the outcomes yes or no, while others will say success or failure. A primer in probability, p. 12

13 In principle, we could work out the probability distribution of K. When we flip one coin, two outcomes are equally likely, and one of these is heads and one is tails. When N = 2, four sequences are equally likely: each of HH, HT, TH, and TT occurs with probability1 / 4. One of these outcomes gives us zero heads, two outcomes give us one heads, and one outcome gives us two heads. We can continue this process to figure out the distribution when N = 3, N = 4, and N = 5 : N = 1 N = 2 N = 3 N = 4 N = 5 P[K = 0] 1 / 2 1 / 4 1 / 8 1 /16 1 / 32 P[K = 1] 1 / 2 2 / 4 3 / 8 4 /16 5 / 32 P[K = 2] -- 1 / 4 3 / 8 6 /16 10 / 32 P[K = 3] / 8 4 /16 10 / 32 P[K = 4] /16 5 / 32 P[K = 5] / 32 Unless you recognize a pattern involving Pascal s triangle, this becomes tedious and even that pattern will fail if p! 1 / 2. A formula does exist, however: P[k] = N! k!(n! k)! pk (1! p) N! k where X! (read as X factorial ) equals X!(X " 1)!(X " 2)!! 3! 2!1, the product of the number with all numbers less than itself. (However, 0!is set equal to zero.) For example:1! = 1, 2! = 2!1 = 2, 3! = 3! 2!1 = 6, 4! = 4! 3! 2!1 = 24, and so on. For shorthand, we might write to indicate that K is a random variable from the binomial distribution with N trials and a probability p of a positive outcome in each trial. For example, suppose that Professor Leach has a class of twenty-five students; the professor assigns grades, with each student having a 0.10 chance of failing the class. The probability that exactly five of twenty-five students fail is: P[5] = 25! 5!(25! 5)! (0.10)5 (0.90) 25!5 (The rest is fairly straight-forward albeit messy algebra, so I won t solve it further. However, I will point out that you can often simplify the factorial part of the problem: 25! = 25! 24! 23! 22! 21!(20!) ; the 20! in the numerator cancels with the (25! 5)! = (20!) in the denominator.) The mean of a binomial distribution is always: N N! E[K] = k! k!(n " k)! pk (1" p) N " k # = N! p k =0 A primer in probability, p. 13

14 (But it is not simple to show why the expression in the middle collapses to just N! p.) The variance of a binomial random variable is: Var(K) = N! p! (1 " p) A second common discrete distribution is the Poisson distribution, which is used to model count data : outcomes that are non-negative integers, typically smaller ones. (These non-negative integers are, of course, the numbers we could get if we count things: zero, one, two, three, and so on.) The Poisson distribution is especially appropriate when we are counting the number of [some event] per [time period], like the number of customers arriving at a store in an hour or the number of children that a woman has [in her lifetime]. Technically, the Poisson distribution makes two assumptions: 1. For all intervals of the same length within this period, there is the same chance that an event occurs, and 2. The chance that an event occurs is independent of what has happened in the past. While these assumptions might not hold exactly in all the situations that we model with the Poisson distribution, we often feel that they are close enough to justify it. With the Poisson distribution, the probability of having k events in the period is: P[k] = e!" " k / k! where! is some positive number, which reflects the average number of events that occur within the period. (In shorthand, we would write K ~ Pois(!) for this distribution). For example, if we know that families have 1.8 children on average a number that comes from the March 2005 Current Population Survey and that fertility is determined entirely by chance, then the probability of having k children is P[k] = e!1.8 "1.8 k / k!. If we calculate the probabilities from this formula, we would get the theoretical probabilities below: Number of children Theoretical probability Observed fraction These theoretical probabilities are remarkably close (in my opinion) to the observed fractions of families of each size in the dataset confirming that the Poisson A primer in probability, p. 14

15 distribution provides a reasonable description of this random variable. (It doesn t quite capture the observed fondness for having two children. In reality, many couples probably reduce their chance of having more children at that point, which violates the assumption that the probability of having an additional event is independent of the past.) As a final note,! is the mean number of events occurring in the time period, and it is also the variance in the number of events in the period. (Incidentally, in the CPS, the variance in family size is1.7, which is another indicator that the variable has approximately a Poisson distribution.) Let s move on to continuous probability distributions. The simplest is the uniform distribution, which assigns equal likelihood to any outcome between a lower limit of! and an upper limit of u. (For example, people s birthdays are essentially distributed uniformly over the numbers 1 through 365.) The complete p.d.f. of the uniform distribution is: # 0 if x <! % f (x) = $ 1 (u!!) if! " x " u % & 0 if x > u The shorthand notation would be X ~ U([!,u]). Calculating the expected value of the uniform distribution is relatively simple:! E[x] = $ 0! x! dx + $ 1 (u "!) + $ 0! x! dx "# u = $ ( 1 (u "!))! x! dx! u = ( 1 (u "!)) $ x! dx! = ( 1 (u "!)) x 2 2! u x=u ( ) x=! ( )! x! dx ( ) u 2 "! 2 ( )((u "!)(u +!)) = 1 2! 1 (u "!) = 1 2! 1 (u "!) = 1 2! (u +!) ( ) The expected value is the average of the two endpoints of the distribution. We could go through a similar exercise to find the variance in the uniform distribution: u # A primer in probability, p. 15

16 ! Var(X) = $ 0!(x " E[x]) 2! dx + $ 1 (u "!) + $ 0!(x " E[x]) 2! dx "# u $ ( )!( x " 1 2! (u +!)) 2! dx = 1 (u "!)! = = 1 12! (u "!) 2! u ( )! (x " E[x]) 2! dx (The actual algebra is somewhat tedious, but I hope you get the idea.) The single most important continuous distribution is the normal distribution. Many variables in nature have this bell-curve distribution: height, weight, or intelligence in a population; annual rainfall in a region or the fraction of days that are overcast; the age at which a person gives birth or the age at which a person would like to retire. The p.d.f. of the normal distribution is: f (x) = 1 2!" 2 # exp $(x$ µ)2 ( 2" ) 2 where µ is the mean and! 2 the variance in X. For shorthand, we would write X ~ N(µ,! 2 ) to indicate this normal distribution. In addition, the central limit theorem tells us that the sum (or average) of a number of independent, identically distributed random variables will tend to be normal, regardless of the distribution of the variables themselves. The amount of rainfall that falls in a region in a day might have some odd distribution: a 90% chance of no rain, a 7% chance of a quarter inch, and a 3% chance of one inch. This is far from normal. However, if you calculate the distribution of rainfall over 365 days, you ll get an normal distribution, approximately. A consequence of the central limit theorem is that the binomial distribution starts to look normal as N gets large. (The total number of positive outcomes, K, is the sum of a bunch of independent random variables.) Technically, we would write that ( )! N ( Np, Np(1" p) ). The Poisson distribution also approaches as N! ", B N, p the normal distribution, since it is also the total number of events that happen in some period. The more common the event is, the more the Poisson distribution will resemble the normal distribution: technically, as! " #, Pois(!) " N(!,!). For all practical purposes, I would say that there is essentially no difference between the Poisson and normal distributions when the expected number of events in the period is ten or more. With the binomial distribution, it s a bit harder to establish a rule of thumb for how large N needs to be in order to use the normal approximation, since this will depend on p in part. If N is a hundred or so, I would usually be fairly comfortable with the normality assumption; if N is just a couple of dozen, I would usually favor the binomial distribution. u # A primer in probability, p. 16

17 The standard normal distribution is a normal distribution with mean zero and variance one. We can standardize any normal X by calculating: Z = X! µ " We could then write that Z ~ N(0,1). (There is a tradition of using Z to denote an arbitrary standard normal random variable. Additionally, we often use!(z) to denote the p.d.f. of the normal distribution, and!(z) to denote its c.d.f.) Usually, we often standardize random variables in order to calculate look up values of the c.d.f. Normal distributions with different µ s and! s will have different c.d.f.s, and it is impossible to have a table of values for each of these functions. However, we can standardize any normal random variable, and compare this to a single table with values of the c.d.f. of the standard normal distribution. To model the time duration until some event occurs, we often use the exponential distribution. Technically, this makes two assumptions: 1. There is a constant probability that the event occurs at any time, and 2. The probability that the event occurs is independent of the past history. The life of an incandescent light bulb is an almost ideal example of an exponential random variable; this distribution could also be used to model the time that a person works with a particular employer, or the length of time that a patient waits for an organ transplant. To denote that the random variable T has an exponential distribution, we might writet ~ Exp(!); the p.d.f. of this distribution is: $ 0 if t < 0 f (t) = % &!e "!t if t # 0 where the parameter! represents the reciprocal of the average duration of the random variable T. For example, if I purchase a light bulb that advertises an (average) lifetime of 2000 hours, the p.d.f. of the duration, in hours, is f (t) = e!0.0005t. The variance in the exponential distribution is! "2. Another not-uncommon distribution is the log-normal distribution, which is often used to model outcomes that are always positive and fairly skewed where many people have fairly low values, but a few people have substantially higher values. Income would be one excellent example: most people earn low-to-moderate A primer in probability, p. 17

18 amounts tens of thousands of dollars per year but the range extends quite far, up into the hundreds of thousands, millions, and even multiple millions. The distance that a person has to travel to a hospital would be another: most people live fairly close, within five or ten miles of a hospital, but the range extends quite far, and a handful of people might have to travel hundreds of miles. Technically, if some variable X has the log-normal distribution, then its natural logarithm, ln(x), is normally distributed. While the sums of random variables tend to have the normal distribution, the products of random variables tend to have the log-normal distribution. Imagine that ever college graduate starts with the same base salary, maybe $40,000. Each year of his career, he randomly receives some promotion: perhaps a 0.50 chance of 0% promotion, a 30% chance of a 3% promotion, and a 20% chance of a 10% promotion. Since these increases are multiplicative, we would expect that the distribution of incomes after twenty years to be roughly log-normal. V. Probability distributions for statistics There are a handful of probability distributions that we frequently use for statistics, but which rarely show up in the real world. The first is the chi-squared distribution. Technically, if we start with a variable that has the standard normal distribution, its square has the chi-squared distribution. If we add N chi-squared variables to each other, the sum is chi-squared with N degrees of freedom. Z i ~ N(0,1)! " N i=1 Z 2 i ~ # 2 (N) We will discuss these degrees of freedom later. (Generally, they are the number of observations we have, minus the number of parameters that we estimated.) We use the chi-squared distribution to describe sample variances, which are calculated from the sum of squared values of a variable. When we have a bunch of normal random variables and we calculate their average, X =! X i N, then the difference between this sample mean X and the true mean µ, divided by the square root of the sample variance (!(X i " X) 2 (N " 1) ), has the t- distribution (with N! 1 degrees of freedom). We use the t-distribution when we want to test hypotheses about the value of an estimated parameter. If we knew the true variance in the parameter, we could standardize it and compare it to the standard normal distribution. However, when we use estimated variance that we calculate from our sample, we compare the standardized variable to the t- distribution. In many cases, the difference is minor, especially when we have a large A primer in probability, p. 18

19 sample. (As the degrees of freedom get large, the t-distribution approaches the normal distribution.) Some textbooks and teachers distinguish between small sample tests with the t-distribution and large sample with the normal distribution; technically, the small sample test is always the only correct way to do things when we use estimated variance, but the difference is minor. Finally, when we have two chi-squared variables, V 1 ~! 2 (N 1 ) andv 2 ~! 2 (N 2 ), the ratio (V 1 / N 1 ) (V 2 / N 2 ) has an F-distribution with N 1 degrees of freedom in the numerator and N 2 degrees of freedom in the denominator. The F-distribution is most often used to compare two estimated variances, to test if they are the same. VI. Statistics When dealing with data, we observe a random sample that is generated from some probability distribution. For example, if X is distributed normally with mean five and variance four, then we might draw ten observations with the values: 1.91, 2.53, 2.81, 5.28, 5.29, 5.53, 5.60, 5.73, 5.96, 7.27 or with the values: 1.60, 3.39, 3.68, 3.74, 4.62, 5.11, 5.32, 7.00, 7.54, 8.24 In statistics, the problem is that we don t know what the mean and variance of the population or other parameters that relate to the population so we are trying to guess them. We do, of course, know the mean and variance of our samples. We calculate these with the formulas: X = Ê[X] = N X i N! i=1 N Vâr(X) = (X i! X) 2 (N! 1) " i=1 (In general, I will use the hat to indicate a sample variance or similar calculation, except when I m describing the sample mean.) With the two samples above, we would calculate sample means of 4.79 and 5.02, and sample variances of 3.04 and These are not the sample as the population parameters, but they would be reasonable guesses if we didn t know the true mean and variance which, of course, we almost never know in practice. Statistics is all about guessing. An estimator is a formula or technique used to guess the value of some parameter. An estimate, in contrast, is the number that we get out of that formula, given our sample. Thus, X =! X i N is an estimator of the population mean; again, this word refers to the formula itself. With this formula, we obtain estimates of 4.79 A primer in probability, p. 19

20 and 5.02 on the two samples. Similarly, the formula Vâr(X) =!(X i " X) 2 an estimator of the population variance. (N " 1) is There are many ways that we could estimate any particular variable. Some of these techniques will be silly, and some will be sensible. For example, consider three different estimators of the population mean, µ : ˆµ 1 =! N i=1 X i N ˆµ 2 = X 1 ˆµ 3 = These are all formulas we could use to guess the population mean from our sample; since they are formulas, they qualify as estimators. However, the second and third methods seem a bit silly the second uses only one person s value of X to guess the population average, and the third uses a completely arbitrary number. Now, we need to define some desirable properties of estimators, so we can talk about what is sensible. Let s use! to denote some parameter that we re trying to estimate, and let ˆ! represent our estimator. Unbiasedness: The estimator ˆ! is unbiased if E[ ˆ!] =! ; that is, we would expect it to be correct, on average. Consistent: The estimator ˆ! is consistent if ˆ! "! as N! " ; that is, the estimator gets the correct value exactly as our sample gets larger and larger. Efficient: The estimator ˆ! is efficient if it minimizesvar( ˆ!) ; that is, that it is the most precise estimator available. The sample mean is an unbiased estimator of the population mean. This means that on average, it will be correct. Look at the estimates we obtained from the two samples about one was a bit below the true value, and one was a bit above. This is exactly what we expect from an unbiased estimator. We can demonstrate formally that this estimator is unbiased. To do this, we need to calculate the expected value of ˆ!. ˆ! is just a formula, some function of the values of the variable in our sample. We need to figure out the expected value of that function. E[ ˆµ 1 ] = E[! i=1 N X i N] = E[X 1 N + X 2 N + + X N N] A primer in probability, p. 20

21 Remember that we can always separate an expectation where we add or subtract components (but not when we multiply or divide, unless the variables are independent of each other): E[ ˆµ 1 ] = E[X 1 N + X 2 N + + X N N] = E[X 1 N] + E[X 2 N] + + E[X N N] In addition, we can always move a constant (like1 N ) outside of the expectation: E[ ˆµ 1 ] = E[X 1 N] + E[X 2 N] + + E[X N N] = 1 N! E[X 1 ] + 1 N! E[X 2 ] N! E[X N ] Now we need to apply the expectation to each of those variables. Remember that the expected value of any X i is the population mean, µ. E[ ˆµ 1 ] = 1 N! E[X 1 ] + 1 N! E[X 2 ] N! E[X N ] = 1 N! µ + 1 N! µ N! µ = N!(1 N! µ) = µ Thus, we have shown that this formula, ˆµ 1, is an unbiased estimator of µ. What about the other estimators? As it turns out, ˆµ 2 is actually an unbiased estimator of the population mean: E[ ˆµ 2 ] = E[X 1 ] = µ. Our first observation is just as likely to be above average as below average. Even though it seems silly to discard the rest of our sample, using only one observation is unbiased. The third estimator, however, is generally biased (unless we re sampling calculations of pi, I guess): E[ ] = ! µ, in general. What about consistency? As the sample grows larger and approaches the entire population, the sample mean should indeed approach the population mean, so the first estimator is consistent. The second is not: if we were to keep expanding the sample, this estimator wouldn t change. We d still expect it to be above average half the time and below average half the time. Finally, the third estimator is wrong no matter how large our sample is. Usually, there s little difference between consistency and unbiasedness if an estimator is one, then it is the other, as well. (In fact, if we wanted to show consistency formally, we d usually show that the estimator is unbiased, and then we d show that the variability in the estimator gets smaller and smaller as the sample size gets larger meaning that the estimator is perfectly precise when the sample is infinitely large.) Exceptions are usually a bit weird. Using a limited sub-sample to A primer in probability, p. 21

22 estimate a mean is unbiased but not consistent; using the formula!(x i " X) 2 / N to estimate variance (instead of!(x i " X) 2 / (N " 1)) is biased, but the amount of the bias gets smaller and smaller as the sample grows larger, and the bias disappears when the sample is infinitely large. Finally, we should talk about efficiency, which means that we need to address the variance in an estimator. Estimates are random variables themselves: they vary with the random sample that we collected. Two different samples drawn from the same distribution, using the same estimator, will generally yield different estimates. We want to know how much we would expect the value of the estimator to vary from sample to sample, so we calculate Var( ˆ!) = E[( ˆ! " E[ ˆ!]) 2 ] for our estimator. For example, the variance in ˆµ 1 would be: Var( ˆµ 1 ) = E[( ˆµ 1! E[ ˆµ 1 ]) 2 ] = E[(" X i / N! µ) 2 ] Usually, the first steps in calculating the variance are plugging in the expected value of the estimator (the population mean, µ, in this case), and substituting the formula in place of ˆµ 1. We also need to specify a bit more about the distribution of X: let s assume that each X i variance of! 2, and that X i and X j are uncorrelated (as would be the case with a truly random sample). Now, we have to plow through algebra: Var( ˆµ 1 ) = E[(! X i / N " µ) 2 ] = E[(! X i / N " N # (µ / N)) 2 ] = E[(X 1 / N + X 2 / N + + X N / N " µ / N " µ / N " " µ / N) 2 ] = E[((X 1 / N " µ / N) + (X 2 / N " µ / N) + + (X N / N " µ / N)) 2 ] Now let s square terms. We ll have: Var( ˆµ 1 ) = E[((X 1 / N! µ / N) + (X 2 / N! µ / N) + + (X N / N! µ / N)) 2 ] = E[(X 1 / N! µ / N) 2 + (X 2 / N! µ / N) (X N / N! µ / N) 2 +2(X 1 / N! µ / N)(X 2 / N! µ / N) + + 2(X 1 / N! µ / N)(X N / N! µ / N) +2(X 2 / N! µ / N)(X 3 / N! µ / N) + + 2(X 2 / N! µ / N)(X N / N! µ / N) + + 2(X N!1 / N! µ / N)(X N / N! µ / N)] Now we can break up the expectation at each summation, and we can factor a constant 1 N 2 out of each term: A primer in probability, p. 22

23 Var( ˆµ 1 ) = 1 N 2! E[(X 1 " µ) 2 ] + 1 N 2! E[(X 2 " µ) 2 ] N 2! E[(X N " µ) 2 ] + 2 N 2! E[(X 1 " µ)(x 2 " µ)] N 2! E[(X 1 " µ)(x N " µ)] + 2 N 2! E[(X 2 " µ)(x 3 " µ)] N 2! E[(X 2 " µ)(x N " µ)] N 2! E[(X N "1 " µ)(x N " µ)] We can now apply the expectation. In the first line of this expression, we have variances in the variables, and E[(X i! µ) 2 ] = " 2. In the following lines, we have a bunch of covariances, and E[(X i! µ)(x j! µ)] = 0. Thus, the variance in ˆµ 2 is: Var( ˆµ 1 ) = 1 N 2 (! 2 +! 2 + +! 2 ) + 2 N 2 ( ) = 1 N 2 (N! 2 ) =! 2 N That s it. If we wanted to talk about efficiency of the estimators, we would compare the variance of ˆµ 2 to the variance in the other estimators. Var( ˆµ 2 ) = E[(X 1! E[X 1 ]) 2 ] = E[(X 1! µ) 2 ] = " 2 As long as our sample has more than one observation, N > 1, the first estimator is more efficient, more precise, since it has less variance. The variance in the third estimator is: Var( ˆµ 3 ) = E[( ! E[ ]) 2 ] = E[( ! ) 2 ] = 0 This is actually the most efficient estimator. However, since it s biased since it s wrong it doesn t make sense to use it. Often our criterion for the best estimator is the minimum variance unbiased estimator (sometimes called MVUE), the most efficient of the unbiased estimators. In general, when we re faced with an estimator, we need to figure out three things: its mean, its variance, and its distribution. The estimator inherits all of these properties from the variables used to calculate it. Once we know these characteristics, we can construct confidence intervals for our estimate and test hypotheses about the variable. A primer in probability, p. 23

24 A Q% confidence interval for a parameter estimate is a range of values that has a Q% chance of including the true value of the coefficient. We should note that it is technically incorrect to say that there is a Q% chance that the true value of the parameter lies within this range. Since the true value is a fixed, constant, nonrandom variable, it makes no sense to talk about a chance relating to something non-random. The parameter estimate is a random variable, since it depends on the specific values in our random sample. The confidence interval is also something random, since it is based around our parameter estimate. It is correct to think about the chance that this confidence interval does something in this case, including the true value of the parameter. When we want to test a hypothesis, we must define two things: the null hypothesis, which is the thing to be tested, and the alternative hypothesis, which is what we believe to be true if the null is not. For example, we might want to test whether the true value of the parameter! could be This would be our null hypothesis, H 0 :! = A natural alternative is that! is not equal to 2.13, H A :! " Often the alternative hypothesis is simple and innocuous like this, but sometimes we make stronger assumptions. To test a hypothesis, we define a confidence level, which I will call Q%, and then we go through the following steps: 1. Assume that the null hypothesis is true. 2. If the null hypothesis is true, then find the distribution of the estimator (its mean, variance, and type of distribution). 3. Find the probability of obtaining this estimate, or one even further from the hypothesized value, if the null is true. This probability is called a p-value. 4. If the p-value is less than1! Q /100, you reject the null hypothesis; if the p- value is more than1! Q /100, you fail to reject the null hypothesis. For example, suppose that we believe that the variable X is distributed normally with mean of µ and variance of! 2 (but we don t know these values). We collect a random sample of ten observations: 1.91, 2.53, 2.81, 5.28, 5.29, 5.53, 5.60, 5.73, 5.96, 7.27 Our sample mean is 4.79, and our sample variance is Now we want to test the hypothesis H 0 : µ = 2.13 against the alternative H A : µ! Out sample mean isn t 2.13, but we re asking the question: how unlikely would it be to observe a sample mean of 2.13 if the true mean is 4.79? If the null hypothesis is true, then X =! X i N should be normally distributed (adding normal random variables gives us another normal random variable) with a A primer in probability, p. 24