Hypothesis Testing: General Framework 1 1

Transcription

1 Hypothesis Testing: General Framework Lecture 2 K. Zuev February 22, 26 In previous lectures we learned how to estimate parameters in parametric and nonparametric settings. Quite often, however, researchers are interesting in checking a certain statement about a parameter, not its exact value. Suppose, for instance, that someone developed a new drug for reducing blood pressure. Let θ denote the average change in a patient s blood pressure after taking a drug. The big question is to test H : θ = versus H : θ =. () The hypothesis H is called the null hypothesis. It states that, on average, the new treatment has zero effect 2 on blood pressure. The alternative hypothesis 3 states that there is some effect. In this context, testing H against H is a primary problem. Even if we find out that θ = 4, estimating the value of θ is important, yet a secondary problem. A part of statics that deals with this sort of yes/no problems is called hypothesis testing. In this lecture, we discus a general framework of hypothesis testing. To get started let us consider the following toy example, that will help us to illustrate all main notions and ideas. 2 Hence the name null. 3 Also sometimes called the research hypothesis. 4 Hopefully, θ <! Two Coins Example Suppose that Alice has two coins: fair and unfair, with the probabilities of heads p =.5 and p =.7 respectively. Alice chooses one of the coins, tosses it n = times and tells Bob the number of heads, but does not tell him what coin she tossed. Based of the number of heads k, Bob has to decide which coin it was. Intuitively, it is clear that the larger k =,,..., n, the more likely it was an unfair coin. If Alice tossed coin i (i = is fair and i = is unfair), then the probability of getting exactly k heads is given by the Binomial distribution Bin(n, p i ): ( ) n P i (k) = pi k k ( p i) n k, i =,. (2) Figure : Alice and Bob are two archetypal characters commonly used in cryptography, game theory, physics, and now... in statistics. Comics source: Figure 2 shows the values of these probabilities for different k. Suppose that Bob observed only k = 2 heads. Then P (k = 2) 3, (3) P (k = 2) and, therefore, the fair coin is about 3 times more likely to produce this result than the unfair one. On the other hand, if there were k = 8 heads, then P (k = 8).9, (4) P (k = 8)

2 hypothesis testing: general framework Coin Coin Figure 2: Probabilities (2)..25 Probabiity Number of Heads which would favor the unfair coin. So, based on Fig. (2), Bob should guess that the coin is unfair if k {7, 8, 9, }, (5) and unfair otherwise. This is the simplest example of testing. General Framework Suppose that data X,..., X n is modeled as a sample from a distribution f F 5. Let θ be the parameter of interest, and Θ be the set of all its possible values, called the parameter space. Let Θ = Θ Θ be a partition of the parameter space into two disjoint sets 6. Suppose we wish to test H : θ Θ versus H : θ Θ. (6) We call H the null hypothesis and H the alternative hypothesis. Let Ω be the samples space, i.e. the range of data, X = (X,..., X n ) Ω. We test a hypothesis by finding an appropriate subset of outcomes R Ω, called the rejection region: 5 The statistical model F can be either parametric or nonparametric. 6 Recall that A = B C means that A = B C and B C =. If X R reject H, If X / R accept H. (7) Usually, the rejection region has the following form: R = {X Ω : s(x) > c}, (8) where s is a test statistic and c is a critical value. The problem of testing is then boils down to finding an appropriate statistic s and an appropriate critical value c.

3 hypothesis testing: general framework 3 In the two coin example, the data is the total number of heads X = k, which is modeled as a sample from the binomial distribution Bin(n, θ), where n = and θ Θ = {.5,.7}. The hull hypothesis is that the coin is fair: H : θ Θ = {.5}, and the alternative is that the coin is not fair, H : θ Θ = {.7}. The sample space is Ω = {,..., }. Bob tested the hypothesis using the rejection region R given by (5) 7. 7 What is the test statistic and the critical value in this example? The Null and Alternative Mathematically, the null and alternative hypotheses seem to play symmetric roles. Traditionally, however, the null hypothesis H says that nothing interesting is going on 8, the current theory is correct, no new effects, etc. The null hypothesis is a status quo. The alternative hypothesis, on the other hand, says that something interesting, something unexpected is happening: the old theory needs to be updated, new previously unseen effects are present, etc 9. It is useful to think of hypothesis testing is a legal trial. By default, we assume that someone is innocent (the null hypothesis) unless there is strong evidence that s/he is guilty (alternative). Question: Suppose an engineer designed a new earthquakeresistant building. Let p F be the failure probability of the building under earthquake excitation. How would you formulate the null and alternative hypotheses if you wish to test whether or not the failure probability is smaller than a certain acceptable threshold p F? 8 Recall the drag example from the beginning. H says the new drag no effect on the blood pressure. 9 This explains why we focus on the rejection region and not the acceptance region. The rejection region is where the surprise is living. Presumption of innocence. Errors in Testing Can we guarantee that we make no errors when making conclusions from data? Of course, not. Data provides some, but not full, information about the unknown quantity of interest and helps to reduce the uncertainty, but not completely illuminate it. The errors are thus unavoidable. There are two types of errors in hypothesis testing with very boring names: type I error and type II error: Figure 3: Unfortunately, the presumption of innocence does on always work in real life. The unfair coin may produce 5 heads in which case Bob will make in error by accepting the hypothesis that the coin is fair. Type I error: rejecting H when it is true. Type II error: accepting H when it is not true. Purely mathematically 2, making both errors are equally bad. But, given the context discussed in the previous section, making a type I error is much worse than making a type II error: declaring an innocent person guilty is much worse than declaring a guilty person innocent. Probabilities of both errors can be computed using the so-called power function. 2 That is when we focus on equations and forget about the context.

4 Power function -(3) Power function -(3) hypothesis testing: general framework 4 Power Function If R is the rejection region, then the probability of a type I error is P(Type I error) = P(X R θ Θ ). (9) The probability of a type II error is P(Type II error) = P(X / R θ Θ ) = P(X R θ Θ ). () From (9) and (), we see that probabilities of both error are determined by function on the parameter space P(X R θ). This leads to the following definition. Definition. The power function of a hypothesis test with rejection region R is the function of θ defined by β(θ) = P(X R θ). () In term of error probabilities: P(Type I error), if θ Θ, β(θ) = P(Type II error), if θ Θ. (2) Ideal Test The ideal test will thus have the power function which is zero on Θ and one on Θ, see Fig. 4. This ideal is rarely (never) achieved in practice. Example: In the two coin example, the parameter space is a two point set Θ = {.5,.7}, Θ = {.5}, Θ = {.7}, and the power function is β(θ) = P(k {7, 8, 9, } θ) = k=7 ( ) θ k ( θ) k k.7, if θ =.5,.65, if θ =.7. (3) # # Parameter space # Figure 4: The ideal power function. This power function is not exactly what Bob would like to have, but in some sense (will discuss later) this is the best possible test. In reality, a reasonable test has power function near zero on Θ and near one on Θ. So, qualitatively, the power function of a good test looks like the one in Fig Level, Reasonable Test.2 Controlling Errors Usually it is impossible to control both types of errors and make their probabilities arbitrary small. Roughly, the reason behind this is the following. Choosing a test is choosing the rejection region R Ω. If # Parameter space # Figure 5: The power function of a reasonably good test of size α. #

5 hypothesis testing: general framework 5 we want to make the type I error probability (9) smaller, we need to shrink R. In the extreme case, we can completely exclude the type I error by taking R =. On the other hand, to make the type II error probability () smaller, we need to inflate R. By taking R = Ω, we can guaranty that the type II error will not be made. So, typically, decrease in the probability of one error leads to the increase of the probability of the other error 3. As we discussed previously, type I error is more dangerous, and therefore, controlling its probability is more important. This leads to the following definition. 3 The provided intuition is rough because instead of shrinking and inflating R we could move it around. Definition 2. The size of a test with power function β(θ) is α = sup θ Θ β(θ). (4) A test is said to have level α if its size is α 4. In words, the size of the test is the largest possible probability of the type I error (rejecting H when it is true). See Fig. 5. Researchers usually specify the size of the test they wish to use 5 (to make sure that the type I error is under control), and then search for the test with the highest power under H (i.e. on Θ ) among all test with level α. Such a test, if it exists, is called most powerful. Finding most powerful tests is hard and, in many cases, they don t even exist. So in practice, researchers use a test with power which is high enough. Example: Let X,..., X n N (µ, 2 ), where 2 is known 6. We want to test H : µ versus H : µ >. (5) 4 In practice, the terms size and level are often used interchangeably because both are upper-bounds for the type I error probability. 5 With typical choice being α =.,.5, and.. 6 i.e. estimated. So, here Θ = R, Θ = (, ], and Θ = (, ). It seems reasonable to use the sample mean X n as a test statistic and reject H if X n is large enough. The rejection region is thus R = {(X,..., X n ) : X n > c} Ω = R n, (6) where c is the critical value. Let us find the power function of this test. β(µ) = P(X n > c µ). (7) ( Since X n N µ, 2 n ), we have that n(xn µ) ( ) n(xn µ) n(c µ) β(µ) = P > ( ) n(c µ) = Φ. N (, ). Therefore, The power function is an increasing function of µ. It is shown in Fig. 6 for n =, =, and different values of c. As expected, (8)

6 Power function -(7) Power function -(7) hypothesis testing: general framework 6 when the rejection region (6) shrinks (the critical value c increases), the size of the test α decreases meaning that it becomes less and less likely to make the type I error. On the other hand, the type II error probability increases. To make a test with a specific size α, we c=. c=.5 c=.75 Normal model, sample size n= Figure 6: The normal power function (8) for n =, =, and different values of c. Notice that as c increases (rejection region shrinks), the size of the test decreases, as expected size,.3.2. size, Parameter 7 need to find the corresponding critical value c. Thanks to monotonicity of β, α = β(). Together with (8), this give an equation for c, whose solution is c = Φ ( α) n. (9) A halfway summary: the test which rejects H whenever X n > c, where c is given by (9), has size α. Suppose now that we can also control the sample size n 7. Note that the power function does depend on the sample size, and by choosing n large enough we can hope to reduce the type II error probability. Since β is continuous and β() = α, β(µ) in the neighborhood of zero, and, therefore, the type II error probability is large in this neighborhood. However, we may step apart from zero by δ >, δ and ask the power function to be large at δ: 7 For example, we are designing an experiment, and trying to determine what sample size is appropriate. β(δ) = ɛ, (2) where ɛ >, ɛ plays similar role to type II error as α plays for the type I error. Combining (8), (9), and (2), gives an equation for the sample size, whose solution is ( ) n = Φ ( α) Φ (ɛ)). (2) δ Normal model, sample size n= size, / Parameter 7 Figure 7: The normal power function (8) with n and c defined from (2) and (9) with α = δ = ɛ =.. Notice the sample size increase!

7 hypothesis testing: general framework 7 Thus, the test which rejects H whenever X n > c, where n and c are given by (2) and (9), has size α and, moreover, the type II error probability is at most ɛ if µ [δ, ] Θ. If µ [, δ], this probability is, unfortunately, larger. Figure 7 shows the power function for α = δ = ɛ =.. The strategy described in this example is often employed in other cases. Namely, to design a test, we need to specify the rejection region R = {X Ω : s(x) > c} by choosing a test statistic s and its critical value c. Choosing the test statistic is an art, but often reasonable candidates are rather obvious 8 8. After choosing s, the rejection Sometimes after long analysis :) region is parametrized by c. We chose c to get the desired size α. To this end, we need to solve 9 9 for c the following equation: Analytically if you are lucky, but most likely numericlly. sup P(X R c θ) = α, (22) θ Θ where X = (X,..., X n ). If we can t control n 2, then we are done. If we can control n, then we may try to reduce the type II error probability by exploiting the fact that the power function depends on n. 2 The data comes from an observational study, or experiments are too expensive. Further Reading. The most complete book on testing is Lehman (986) Testing Statistical Hypothesis, Wiley. Next Time In real applications, finding most powerful tests is a very hard problem which often does not have a solution. So, instead of focusing on the theory of most powerful tests, we will consider several widely used tests that often perform reasonably well.