Chapter 2 Cofidece itervals ad hypothesis tests This chapter focuses o how to draw coclusios about populatios from sample data. We ll start by lookig at biary data (e.g., pollig), ad lear how to estimate the true ratio of 1s ad 0s with cofidece itervals, ad the test whether that ratio is sigificatly differet from some baselie value usig hypothesis testig. The, we ll exted what we ve leared to cotiuous measuremets. 2.1 Biomial data Suppose we re coductig a yes/o survey of a few radomly sampled people 1, ad we wat to use the results of our survey to determie the aswers for the overall populatio. 2.1.1 The estimator The obvious first choice is just the fractio of people who said yes. Formally, suppose we have samples x 1,..., x that ca each be 0 or 1, ad the probability that each x i is 1 is p (i frequetist style, we ll assume p is fixed but ukow: this is what we re iterested i fidig). We ll assume our samples are idedepet ad idetically distributed (i.i.d.), meaig that each oe has o depedece o ay of the others, ad they all have the same probability p of beig 1. The our estimate for p, which we ll call ˆp, or p-hat would be ˆp = 1 x i. Notice that ˆp is a radom quatity, sice it depeds o the radom quatities x i. I statistical ligo, ˆp is kow as a estimator for p. Also otice that except for the factor of 1/ i frot, ˆp is almost a biomial radom variable (that is, (ˆp) B(, p)). We ca compute its expectatio ad variace usig the properties we reviewed: i=1 E[ˆp] = 1 p = p, (2.1) var[ˆp] = 1 p(1 p) p(1 p) =. 2 (2.2) 1 We ll talk about how to choose ad sample those people i Chapter 7. 1
Sice the expectatio of ˆp is equal to the true value of what ˆp is tryig to estimate (amely p), we say that ˆp is a ubiased estimator for p. Reassurigly, we ca see that aother good property of ˆp is that its variace decreases as the umber of samples icreases. 2.1.2 Cetral Limit Theorem The Cetral Limit Theorem, oe of the most fudametal results i probability theory, roughly tells us that if we add up a buch of idepedet radom variables that all have the same distributio, the result will be approximately Gaussia. We ca apply this to our case of a biomial radom variable, which is really just the sum of a buch of idepedet Beroulli radom variables. As a rough rule of thumb, if p is close to 0.5, the biomial distributio will look almost Gaussia with = 10. If p is closer to 0.1 or 0.9 we ll eed a value closer to = 50, ad if p is much closer to 1 or 0 tha that, a Gaussia approximatio might ot work very well util we have much more data. This is useful for a umber of reasos. Oe is that Gaussia variables are completely specified by their mea ad variace: that is, if we kow those two thigs, we ca figure out everythig else about the distributio (probabilities, etc.). So, if we kow a particular radom variable is Gaussia (or approximately Gaussia), all we have to do is compute its mea ad variace to kow everythig about it. 2.1.3 Samplig Distributios Goig back to biomial variables, let s thik about the distributio of ˆp (remember that this is a radom quatity sice it depeds o our observatios, which are radom). Figure 2.1a shows the samplig distributio of ˆp for a case where we flip a coi that we hypothesize is fair (i.e. the true value p is 0.5). There are typically two ways we use such samplig distributios: to obtai cofidece itervals ad to perform sigificace tests. 2.1.4 Cofidece itervals Suppose we observe a value ˆp from our data, ad wat to express how certai we are that ˆp is close to the true parameter p. We ca thik about how ofte the radom quatity ˆp will ed up withi some distace of the fixed but ukow p. I particular, we ca ask for a iterval aroud ˆp for ay sample so that i 95% of samples, the true mea p will lie iside this iterval. Such a iterval is called a cofidece iterval. Notice that we chose the umber 95% arbitrarily: while this is a commoly used value, the methods we ll discuss ca be used for ay cofidece level. We ve established that the radom quatity ˆp is approximately Gaussia with mea p ad variace p(1 p)/. We also kow from last time that the probability of a Gaussia radom variable beig withi about 2 stadard deviatios of its mea is about 95%. This meas that there s a 95% chace of ˆp beig less tha 2 p(1 p)/ away from p. So, we ll defie 2
1200 1000 800 600 400 200 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 80 60 40 20 0 0.48 0.49 0.50 0.51 0.52 (a) The samplig distributio of the estimator ˆp: i.e. the distributio of values for ˆp give a fixed true value p = 0.5. (b) The 95% cofidece iterval for a particular observed ˆp of 0.49 (with a true value of p = 0.5). Note that i this case, the iterval cotais the true value p. Wheever we draw a set of samples, there s a 95% chace that the iterval that we get is good eough to cotai the true value p. Figure 2.1 the iterval ˆp ± 2 }{{} coeff. p(1 p). (2.3) } {{ } std. dev. With probability 95%, we ll get a ˆp that gives us a iterval cotaiig p. What if we wated a 99% cofidece iterval? Sice ˆp is approximately Gaussia, its probability of beig withi 3 stadard deviatios from its mea is about 99%. So, the 99% cofidece iterval for this problem would be p(1 p) ˆp ± 3 }{{} coeff.. (2.4) } {{ } std. dev. We ca defie similar cofidece itervals, where the stadard deviatio remais the same, but the coefficiet depeds o the desired cofidece. While our variables beig Gaussia makes this relatioship easy for 95% ad 99%, i geeral we ll have to look up or have our software compute these coefficiets. But, there s a problem with these formulas: they requires us to kow p i order to compute cofidece itervals! Sice we do t actually kow p (if we did, we would t eed a cofidece iterval), we ll approximate it with ˆp, so that (2.3) becomes ˆp(1 ˆp) ˆp ± 2. (2.5) This approximatio is reasoable if ˆp is close to p, which we expect to ormally be the case. If the approximatio is ot as good, there are several more robust (but more complex) ways to compute the cofidece iterval. 3
1.0 0.8 0.6 p 0.4 0.2 0.0 Figure 2.2: Multiple 95% cofidece itervals computed from differet sets of data, each with the same true parameter p = 0.4 (show by the horizotal lie). Each cofidece iterval represets what we might have gotte if we had collected ew data ad the computed a cofidece iterval from that ew data. Across differet datasets, about 95% of them cotai the true iterval. But, oce we have a cofidece iterval, we ca t draw ay coclusios about where i the iterval the true value is. Iterpretatio It s importat ot to misiterpret what a cofidece iterval is! This iterval tells us othig about the distributio of the true parameter p. I fact, p is a fixed (i.e., determiistic) ukow umber! Imagie that we sampled values for x i ad computed ˆp alog with a 95% cofidece iterval. Now imagie that we repeated this whole process a huge umber of times (icludig samplig ew values for x i ). The about 5% of the cofidece itervals costructed wo t actually cotai the true p. Furthermore, if p is i a cofidece iterval, we do t kow where exactly withi the iterval p is. Furthermore, addig a extra 4% to get from a 95% cofidece iterval to a 99% cofidece iterval does t mea that there s a 4% chace that it s i the extra little area that you added! The ext example illustrates this. I summary, a 95% cofidece iterval gives us a regio where, had we redoe the survey from scratch, the 95% of the time, the true value p will be cotaied i the iterval. This is illustrated i Figure 2.2. 2.1.5 Hypothesis testig Suppose we have a hypothesized or baselie value p ad obtai from our data a value ˆp that s smaller tha p. If we re iterested i reasoig about whether ˆp is sigificatly smaller tha p, oe way to quatify this would be to assume the true value were p ad the compute the probability of gettig a value smaller tha or as small as the oe we observed (we ca do the same thig for the case where ˆp is larger). If this probability is very low, we might thik the hypothesized value p is icorrect. This is the hypothesis testig framework. We begi with a ull hypothesis, which we call H 0 (i this example, this is the hypothesis that the true proportio is i fact p) ad a alterative hypothesis, which we call H 1 or H a (i this example, the hypothesis that the true mea is sigificatly smaller tha p). 4
Usually (but ot always), the ull hypothesis correspods to a baselie or borig fidig, ad the alterative hypothesis correspods to some iterestig fidig. Oce we have the two hypotheses, we ll use the data to test which hypothesis we should believe. Sigificace is usually defied i terms of a probability threshold α, such that we deem a particular result sigificat if the probability of obtaiig that result uder the ull distributio is less tha α. A commo value for α is 0.05, correspodig to a 1/20 chace of error. Oce we obtai a particular value ad evaluate its probability uder the ull hypothesis, this probability is kow as a p-value. This framework is typically used whe we wat to disprove the ull hypothesis ad show the value we obtaied is sigificatly differet from the ull value. I the case of pollig, this may correspod to showig that a cadidate has sigificatly more tha 50% support. I the case of a drug trial, it may correspod to showig that the recovery rate for patiets give a particular drug is sigificatly more tha some baselie rate. Here are some defiitios: I a oe-tailed hypothesis test, we choose oe directio for our alterative hypothesis: we either hypothesize that the test statistic is sigificatly big, or that the test statistic is sigificatly small. I a two-tailed hypothesis test, our alterative hypothesis ecompasses both directios: we hypothesize that the test statistic is simply differet from the predicted value. A false positive or Type I error happes whe the ull hypothesis is true, but we reject it. Note that the probability of a Type I error is α. A false egative or Type II error happes whe the ull hypothesis is false, but we fail to reject it 2 The statistical power of a test is the probability of rejectig the ull hypothesis whe it s false (or equivaletly, 1 (probability of type II error). Power is usually computed based o a particular assumed value for the quatity beig tested: if the value is actually, the the power of this test is. It also depeds o the threshold determied by α. It s ofte useful whe decidig how may samples to acquire i a experimet, as we ll see later. 2 Notice our careful choice of words here: if our result is t sigificat, we ca t say that we accept the ull hypothesis. The hypothesis testig framework oly lets us say that we fail to reject it. 5
p p p a Figure 2.3: A illustratio of statistical power i a oe-sided hypothesis test o variable p. Example The cocepts above are illustrated i Figure 2.3. Here, the ull hypothesis H 0 is that p = p 0, ad the alterative hypothesis H a is that p > p 0 : this is a oe-sided test. I particular, we ll use the value p a as the alterative value so that we ca compute power. The ull distributio is show o the left, ad a alterative distributio is show o the right. The α = 0.05 threshold for the alterative hypothesis is show as p. Whe the ull hypothesis is true, ˆp is geerated from the ull (left) distributio, ad we make the correct decisio if ˆp < p ad make a Type I error (false positive) otherwise. Whe the alterative hypothesis is true, ad if the true proportio p is actually p a, ˆp is geerated from the right distributio, ad we make the correct decisio whe ˆp > p ad make a Type II error (false egative) otherwise. The power is the probability of makig the correct decisio whe the alterative hypothesis is true. The probability of a Type I error (false positive) is show i blue, the probability of a Type II error (false egative) is show i red, ad the power is show i yellow ad blue combied (it s the area uder the right curve mius the red part). Notice that a threshold usually balaces betwee Type I ad Type II errors: if we always reject the ull hypothesis, the the probability of a Type I error is 1, ad the probability of a Type II error is 0, ad vice versa if we always fail to reject the ull hypothesis. 6
Example: Drug therapy results: a warig about data collectio Figure 2.4: Results of a simulated drug trial measurig the effects of stati drugs o lifespa. The top figure shows the lifespa of subjects who did ot receive treatmet, ad the bottom figure shows the lifespa of subjects who did receive it. Figure 2.4 shows results from a simulated drug trial a. At first glace, it seems clear that people who received the drug (bottom) teded to have a higher lifespa tha people who did t (top), but it s importat to look at hidde cofouds! I this simulatio, the drug actually had o effect, but the disease occurred more ofte i older people: these older people had a higher average lifespa simply because they had to live loger to get the drug. Ay statistical test we perform will say that the secod distributio has a higher mea tha the first oe, but this is ot because of the treatmet, but istead because of how we sampled the data! a Figure from: Støvrig, et al. Stati Use ad Age at Death: Evidece of a Flawed Aalysis. The America Joural of Cardiology, 2007 2.2 Cotiuous radom variables So far we ve oly talked about biomial radom variables, but what about cotiuous radom variables? Let s focus o estimatig the mea of a radom variable give observatios of it. As you ca probably guess, our estimator will be ˆµ = 1 i=1 x i. We ll start with the case where we kow the true populatio stadard deviatio; call it σ. This is somewhat urealistic, but it ll help us set up the more geeral case. 2.2.1 Whe σ is kow Cosider radom i.i.d. Gaussia samples x 1,..., x, all with mea µ ad variace σ 2. We ll compute the sample mea ˆµ, ad use it to draw coclusios about the true mea µ. 7
Just like p, ˆµ is a radom quatity. Its expectatio, which we computed i Chapter 1, is µ. Its variace is [ 1 ] var[ˆµ] = var x i i=1 = 1 var[x 2 i ] i=1 = 1 2 i=1 σ 2 = σ2. (2.6) This quatity (or to be exact, the square root of this quatity) is kow as the stadard error of the mea. I geeral, the stadard deviatio of the samplig distributio of the a particular statistic is called the stadard error of that statistic. Sice ˆµ is the sum of may idepedet radom variables, it s approximately Gaussia. If we subtract its mea µ ad divide by its stadard deviatio σ/ (both of which are determiistic), we ll get a stadard ormal radom variable. This will be our test statistic: Hypothesis testig z = ˆµ µ σ/. (2.7) I the case of hypothesis testig, we kow µ (it s the mea of the ull distributio), ad we ca compute the probability of gettig z or somethig more extreme. Your software of choice will typically do this by usig the fact that z has a stadard ormal distributio ad report the probability to you. This is kow as a z-test. Cofidece itervals What about a cofidece iterval? Sice z is a stadard ormal radom variable, it has probability 0.95 of beig withi 2 stadard deviatios of its mea. We ca compute the cofidece iterval by maipulatig a bit of algebra: P (ˆµ 2 }{{} coeff. P ( 2 z 2) 0.95 P ( 2 ˆµ µ σ/ 2) 0.95 P ( 2 σ ˆµ µ 2 σ ) 0.95 σ }{{} std. dev. µ ˆµ + 2 }{{} coeff. σ }{{} std. dev. ) 0.95 This says that the probability that µ is withi the iterval ˆµ ± 2 σ is 0.95. But remember: the oly thig that s radom i this story is ˆµ! So whe we use the word probability here, it s referrig oly to the radomess i ˆµ. Do t forget that µ is t radom! 8
Also, remember that we chose the cofidece level 0.95 (ad therefore the threshold 2) somewhat arbitrarily, ad we could just as easily compute a 99% cofidece iterval (which would correspod to a threshold of about 3) or a iterval for ay other level of cofidece: we could compute the threshold by usig the stadard ormal distributio. Fially, ote that for a two-tailed hypothesis test, the threshold at which we declare sigificace for some particular α is the same as the width of a cofidece iterval with cofidece level 1 α. Ca you show why this is true? Statistical power If we get to choose the umber of observatios, how do we pick it to esure a certai level of statistical power i a hypothesis test? Suppose we choose α ad a correspodig threshold x. How ca we choose, the umber of samples, to achieve a desired statistical power? Sice the width of the samplig distributio is cotrolled by, by choosig large eough, we ca achieve eough power for particular values of the alterative mea. The followig example illustrates the effect that sample size has o sigificace thresholds. Example: Fertility cliics Figure 2.5: A fuel plot showig coceptio statistics from fertility cliics i the UK. The x-axis idicates the sample size; i this case that s the umber of coceptio attempts (cycles). The y-axis idicates the quatity of iterest; i this case that s the success rate for coceivig. The fuels (dashed lies) idicate thresholds for beig sigificatly differet from the ull value of 32% (the atioal average). This figure comes from http://uderstadigucertaity.org/fertility. Figure 2.5 is a example of a fuel plot. We see that with a small umber of samples, it s difficult to judge ay of the cliics as sigificatly differet from the baselie value, sice exceptioally high/low values could just be due to chace. However, as the umber of cycles icreases, the probability of cosistetly obtaiig large values by chace decreases, ad we ca declare cliics like Lister ad CARE Nottigham sigificatly better tha average: while other cliics have similar success rates over fewer cycles, these two have a high success rate over may cycles. So, we ca be more certai that the higher success rates are ot just due to chace ad are i fact meaigful. 9
2.2.2 Whe σ is ukow I geeral, we wo t kow the true populatio stadard deviatio beforehad. We ll solve this problem by usig the sample stadard deviatio. This meas usig ˆσ 2 / istead of σ 2 / for var(ˆµ). Throughout these otes, we ll refer to this quatity as the stadard error of the mea (as opposed to the versio give i Equatio (2.6)). But oce we replace the fixed σ with the radom ˆσ (which we ll also write as s), our test statistic (Equatio (2.7)) becomes t = ˆµ µ ˆσ/. (2.8) Sice the umerator ad deomiator are both radom, this is o loger Gaussia. The deomiator is roughly χ 2 -distributed quatity 3, ad the overall statistic is t-distributed. I this case, our t distributio has 1 degrees of freedom. Cofidece itervals ad hypothesis tests proceed just as i the kow-σ case with oly two chages: usig ˆσ istead of σ ad usig a t distributio with 1 degrees of freedom istead of a Gaussia distributio. The cofidece iterval requires oly ˆµ ad the stadard error s, while the hypothesis test also requires a hypothesis, i the form of a value for µ. For example, a 95% cofidece iterval might look like ˆµ ± t ˆσ (2.9) To determie the coefficiet t, we eed to kow the value where a t distributio has 95% of its probability. This depeds o the degrees of freedom (the oly parameter of the t distributio) ad ca easily be looked up i a table or computed from ay software package. For example, if = 10, the the t distributio has 1 = 9 degrees of freedom, ad k = 2.26. Notice that this produces a wider iterval tha the correspodig Gaussia-based cofidece iterval from before. If we do t kow the stadard deviatio ad we estimate it, we re the less certai about our estimate ˆµ. To derive the t-test, we assumed that our data poits were ormally distributed. But, the t-test is fairly robust to violatios of this assumptio. 2.3 Two-sample tests So far, we ve looked at the case of havig oe sample ad determiig whether it s sigificatly greater tha some hypothesized amout. But what about the case where we re iterested i the differece betwee two samples? We re usually iterested i testig whether the differece is sigificatly differet from zero. There are a few differet ways of dealig with this, depedig o the uderlyig data. 3 I fact, the quatity ( 1)ˆσ 2 /σ 2 is χ 2 -distributed with 1 degrees of freedom, ad the test statistic t = ˆµ µ σ/ σ 1 ˆσ 1 is therefore t-distributed. 10
I the case of matched pairs, we have a before value ad a after value for each data poit (for example, the scores of studets before ad after a class). Matchig the pairs helps cotrol the variace due to other factors, so we ca simply look at the differeces for each data poit, x post i ull mea of 0. x pre i ad perform a oe-sample test agaist a I the case of two samples with pooled variace, the meas of the two samples might be differet (this is usually the hypothesis we test), but the variaces of each sample are assumed to be the same. This assumptio allows us to combie, or pool, all the data poits whe estimatig the sample variace. So, whe computig the stadard error, we ll use this formula: Our test statistic is the s 2 = ( 1 1)s 2 1 + ( 2 1)s 2 2. ( 1 + 2 2) t = ˆµ (1) ˆµ (2) s p (1/1 ) + (1/ 2 ). This test still provides reasoably good power, sice we re usig all the data to estimate s p. I this settig, where the two groups have the same variace, we say the data are homoskedastic. I the geeral case of two samples with separate (ot pooled) variace, the variaces must be estimated separately. The result is t quite a t distributio, ad this variat is ofte kow as Welch s t-test. It s importat to keep i mid that this test will have lower statistical power sice we are usig less data to estimate each quatity. But, uless you have solid evidece that the variaces are i fact equal, it s best to be coservative ad stick with this test. I this settig, where the two groups have differet variaces, we say the data are heteroskedastic. 2.4 Some importat warigs for hypothesis testig Correctig for multiple comparisos (very importat): suppose you coduct 20 tests at a sigificace level of 0.05. The o average, just by chace, eve if the ull hypothesis is wrog, oe of the tests will show a sigificat differece (see this relevat xkcd). There are a few stadard ways of addressig this issue: Boferroi correctio: If we re doig m tests, use a sigificace value of α/m istead of α. Note that this is very coservative, ad will dramatically reduce the umber of acceptaces. 11
False discovery rate (Bejamii-Hochberg): this techique guaratees α overall error by usig the very small sigificaces to allow slightly larger oes through as well. Rejectig the ull hypothesis: You ca ever be completely sure that the ull hypothesis is false from usig a hypothesis test! Ay statemet stroger tha the data do ot support the ull hypothesis should be made with extreme cautio. Practical vs statistical sigificace: with large eough, ay miutely small differece ca be made statistically sigificat. The first example below demostrates this poit. Sometimes small differeces like this matter (e.g., i close electios), but may times they do t. Idepedet ad idetically distributed: May of our derivatios ad methods deped o samples beig idepedet ad idetically distributed. There are ways of chagig the methods to accout for depedet samples, but it s importat to be aware of the assumptios you eed to use a particular method or test. Example: Practical vs statistical sigificace Suppose we are testig the fairess of a coi. Our ull hypothesis might be p = 0.5. We collect 1000000 data poits ad observe a sample proportio ˆp = 0.501 ad ru a sigificace test. The large umber of samples would lead to a p-value of 0.03. At a 5% sigificace level, we would declare this sigificat. But, for practical purposes, eve if the true mea were i fact 0.501, the coi is almost as good as fair. I this case, the strog statistical sigificace we obtaied does ot correspod to a practically sigificat differece. Figure 2.6 illustrates the ull samplig distributio ad the samplig distributio assumig a proportio of p = 0.501. 0.490 0.495 0.500 0.505 0.510 Figure 2.6: Samplig distributios for p = 0.5 (black) ad p = 0.501 (blue) for = 1000000. Note the scale of the x-axis: the large umber of samples dramatically reduces the variace of each distributio. 12
Example: Pitfall of the day: Iterpretatio fallacies ad Sally Clark I the late 1990s, Sally Clark was covicted of murder after both her sos died suddely withi a few weeks of birth. The prosecutors made two mai claims: The probability of two childre idepedetly dyig suddely from atural causes like Sudde Ifat Death Sydrome (SIDS) is 1 i 73 millio. Such a evet would occur by chace oly oce every 100 years, which was evidece that the death was ot atural. If the death was ot due to two idepedet cases of SIDS (as asserted above), the oly other possibility was that they were murdered. The assumptio of idepedece i the first item was later show to be icorrect: the two childre were ot oly geetically similar but also were raised i similar eviromets, causig depedece betwee the two evets. This wrogful assumptio of idepedece is a commo error i statistical aalysis. The probability the goes up dramatically a. Also, showig the ulikeliess of two chace deaths does ot imply ay particular alterative! Eve if it were true, it does t make sese to cosider the 1 i 73 millio claim by itself: it has to be compared to the probability of two murders (which was later estimated to be eve lower). This secod error is kow as the prosecutor s fallacy. I fact, tests later showed bacterial ifectio i oe of the childre! a See Royal Statistical Society cocered by issues raised i Sally Clark Case, October 2001. 13