Chapter Five. Hypothesis Testing: Concepts

Transcription

1 Chapter Five The Purpose of Hypothesis Testing An Initial Look at Hypothesis Testing Formal Hypothesis Testing Introduction Null and Alternate Hypotheses Procedure for Formal Hypothesis Tests Examples Errors in Hypothesis Testing Introduction False Positive Errors False Negative Errors Summary: Choosing the Confidence Level Chapter Checkpoint

2 The Purpose of Hypothesis Testing The purpose of obtaining measurements of a chemical system is usually to draw some conclusions about the properties of the system. One of the simplest use of statistics, one that has largely concerned us to this point, is to obtain an estimate of the system properties through the use of confidence intervals. This is an aspect of statistical estimation theory. Now, however, we turn our attention to decision theory, where we learn how we can use measurement statistics to draw general conclusions about chemical systems. The following are examples of situations where we want to draw some kind of conclusion based on measurements: two reactants are mixed, and the concentrations of the products are monitored as a function of time in order to determine the rate constant, k, of the reaction. You want to compare the results of your measurement with a value calculated from theory. you have just come up with a new synthetic procedure for a certain commercial product that you believe increases the yield over the currently accepted method. You measure the yield by both methods, and you find that your method gives a 65% yield while the older method gave a 60% yield. You must prove that your method is actually superior to the older method, and that the increase in yield is not due to the uncertainty in the measured values. For a more detailed example, consider the following situation. Let s say we obtain the following measurements of the ph of a particular solution ph measurements: 9.5, 9.9, 9.8 Now we wish to know whether it is possible to state, with confidence, that the ph of the solution is less than 10. If we can assume that the measurements are unbiased, we can restate this question in a form that can be evaluated with statistics, namely: is it true that ph < 10? Now, assuming no measurement bias, the fact that none of the measurements of ph are greater than 10 seems to support the notion that the true ph of the solution is less than ten. However, since the measurement of ph is a random variables, there is always a chance that the actual ph is indeed greater than ten, and that the three measurements, by random chance, all happen to be less than 10 just like there is a chance that three coin flips in a row will come up tails, even though there is a fifty-fifty chance of getting heads on any single coin toss. Our problem is this: at what point can we say that random variability is an unlikely explanation for the difference between the measured ph values and a fixed value (e.g., a ph of 10)? In other words, when do the measured values differ significantly from the fixed value? The meaning of the word significantly must be very clear: a statistically significant difference in the values would be a greater difference than could be reasonably explained by random error. This is exactly the type of question that hypothesis testing answers. Hypothesis tests are sometimes called significance tests, since they detect significant differences in numbers, differences that are unlikely to be due to random chance. Page 110

3 An Initial Look at Hypothesis Testing Let s use an example to help us to see how we might derive conclusions using random variables (i.e., measurements). Example 5.1 A cigarette manufacturer states that the nicotine level of its cigarettes is 14 mg per cigarette. You wish to test this claim. You collect a random sample of 5 cigarettes and test for nicotine content. The measured nicotine level (in mg) of the cigarettes in the sample are 14.05, 14.33, 16.36, 18.55, Do these measurements indicate a nicotine level different than that claimed by the manufacturer? Basically, what we would like to do is test the following statement: Hypothesis: The true nicotine level of the cigarettes is different from that claimed (14 mg) by the manufacturer. Let s calculate the mean of the measured nicotine level. x ( ). mg measurements T x bar mean( x ) x bar = mg So the mean measured level of nicotine in the five cigarettes was mg/cigarette. Obviously, this value is somewhat larger than the nicotine level stated by the manufacturer. The question is, however, is the difference between the nicotine levels significant? Do we have any justification for challenging the nicotine levels claimed by the manufacturer? In order to answer this question, we need more information than simply the measurement average: we must also make use of the observed variability of the five measurements to construct a confidence interval. s x stdev ( x) se s x 5 se = mg standard error of mean value t critical t-value for 4 df's at the 5% level width t. se width = 2.32 mg x lower x. bar t se x lower = mg x upper x. bar t se x upper = lower boundary of CI upper boundary of CI mg In this instance, the 95% confidence interval is ± 2.32 mg/cigarette. Recall exactly what this interval represents: assuming no bias, this range of values ( mg) contains the true amount of nicotine in the cigarettes analyzed, with 95% probability. Page 111

4 Since the confidence interval calculated from the measurements on five cigarettes includes 14 mg, we cannot support the original hypothesis that the manufacturer s claimed nicotine level is incorrect. In other words, the difference between the measurement mean of mg and the manufacturer s stated level of 14 mg is not significant. Note that we must be very careful in how we phrase our conclusion. Even though the confidence interval includes the value 14 mg, we have not proven that the manufacturer s claim is true. In other words we do not prove that [nicotine] = 14 mg/cigarette. We can only state that there is a 95% probability that the true nicotine content is somewhere between and 17.93; out best estimate of the nicotine content is mg. we cannot prove (with 95% probability) that [nicotine] 14 mg/cigarette, since the 95% confidence interval contains this value. We have just had our first brush with hypothesis testing, where we use data (containing random error) from an experiment to test an assertion. This is obviously an important area of statistics, and one that we will discuss in detail. Page 112

5 Formal Hypothesis Testing Introduction In the last section, a confidence interval was constructed in order to test a specific hypothesis. In scientific endeavors, there are a wide variety of different types of hypotheses that may need to be tested using the results of one or more experiments. In this section, we will formalize the procedure to be used in hypothesis testing. Although the procedure may seem a little rigid, it can be adopted for almost any situation. The price for the general applicability of the procedure is the use of somewhat abstract language and concepts. Null and Alternate Hypotheses All hypothesis tests actually involve at least two statements, called the null hypothesis (H 0 ) and the alternate (or working) hypothesis (H 1 ). A statistical hypothesis is an assertion or conjecture concerning one or more population parameters. Basically, this step is a translation from words to population parameters. The null hypothesis, H 0, will generally involve an equality and one or more population parameters. In our nicotine example, the null hypothesis would be: null hypothesis H 0 : µ x = 14 mg/cigarette In other words, we accept as the null hypothesis the manufacturer s claim that each cigarette contains 14 mg of nicotine. If the null hypothesis is true, and if there is no bias in the measurements, then the population mean µ x of all measurements will be 14 mg. As you can see, the null hypothesis involves a population parameter (µ x, the population mean of the measurements) and a statement of equality. As we will stress time and again, the null hypothesis cannot be proven. It is assumed as fact unless the data proves otherwise. The alternate hypothesis, H 1, will be a statement involving the same population parameters, in such a way that H 1 and H 0 cannot both be true. Usually the alternate hypothesis involves one of the following relational operators:, <, or >. For our example, alternate hypothesis H 1 : µ x 14 mg/cigarette (two-tailed test) Alternate hypotheses such as this one, with a not equals ( ) relationship, result in two-tailed tests. This statement claims that the measurement population mean is not 14 mg; if we assume no measurement bias, this hypothesis disputes the manufacturer s claim of nicotine level. The form of both hypotheses is very important, particularly that of the alternate hypothesis. This is because we are testing the alternate hypothesis in the hypothesis test procedure. Suppose we actually suspect that the manufacturer is underestimating the nicotine level in the cigarettes; in this case, we would use the following alternate hypothesis: a different alternate hypothesis H 1 : µ x > 14 mg/cigarette (one-tailed test) or, H 1 : the true nicotine content is greater than 14 mg/cigarette Page 113

6 This form of H 1 would result in a slightly different hypothesis test. Alternate hypotheses such as this one, with a greater than (>) or less than (<) relationship, result in one-tailed tests. In the hypothesis testing procedure, we assume that the null hypothesis is true, and it is not tested. The goal of the procedure is to test the assertion embodied by the alternate hypothesis, H 1. If H 1 is proven to be true, then obviously H 0 will be false. This format is exactly the same as that of the US criminal legal system, as represented in the famous statement innocent until proven guilty. In statistical hypothesis testing, H 0 is assumed to be true unless H 1 can be proven to be true with reasonable certainty. Procedure for Formal Hypothesis Tests For easy reference, here is a list of the steps in hypothesis testing; each step will be discussed in detail. 1. Form the null hypothesis, H 0, and the alternative hypothesis, H 1, in terms of statistical population parameters. 2. Choose the desired confidence level. The confidence level this is also sometimes called the significance level. 3. Choose a test statistic and calculate it. 4. Calculate the critical values; alternately, determine the P-value of the test statistic. 5. State the conclusion clearly, avoiding statistical jargon. Step 1: State the null hypothesis (H 0 ) and the alternate hypothesis (H 1 ) We have described the null and alternate hypotheses. Formulating these is the most difficult but crucial part of the test procedure. Remember that we begin with an assumption that H 0 is true, and that we are trying to test H 1. We may be interested in either proving or disproving H 1. The following table gives the null hypotheses for three common statistical tests. Note that the null hypothesis always involves population parameters, and (in these cases) is expressed as an equality. Page 114

7 Situation comparison of a random variable, x, and a fixed value, k Form of the null hypothesis H 0 : x = k Answers the question: Is there a significant difference between the mean of some measurements, and some fixed value? comparison of the mean of two variables, x and y H 0 : x = y Is there a significant difference between the mean of two sets of measurements? comparison of the variances of two variables, x and y H 0 : x 2 = y 2 Is there a significant difference between the variances of two sets of measurements? The alternate hypotheses, H 1, in these cases may involve an inequality ( ) or a relational operator (< or >). As discussed previously, the form of H 1 determines whether we use a one-tailed or a two-tailed test. Step 2: Choose the desired level of confidence/significance Remember that any confidence interval has an associated confidence level. The purpose of a confidence interval is to bracket the possible values for a population parameter such as µ x. Random variables always add a little spice (i.e., uncertainty) to any conclusion; there is always a chance that we are wrong, since random variables are, well, random. So the confidence level is needed to state the probability that the population parameter is truly contained with our confidence interval. It is a measure of how much we trust the interval, how confident we are in our result. Since confidence intervals play a crucial role in hypothesis testing, it is not surprising that we generally choose a confidence level when testing assertions using the results of experiments, which are almost always random variables. The meaning of the confidence level in hypothesis testing is slightly different than in confidence intervals, however. Consider our example. We have two competing hypothesis: H 0 : µ x = 14 mg and H 1 : µ x 14 mg. We are testing the alternate hypothesis, H 1, and there are two possible outcomes: 1. We succeed in proving that H 1 is true, in which case H 0 is know to be false. 2. We fail to prove that H 1 is true. [Remember! We cannot prove that H 0 is true.] The confidence level in hypothesis testing measures our certainty when we succeed in proving H 1. It is the probability that the conclusion that H 1 is true and H 0 is false is correct. Let s assume that we want to test at the 95% level for our example. That means that, if our test proves that the nicotine level is not 14 mg, there is a 95% probability that our data has lead us to the proper conclusion. You might wonder: why wouldn t I want to be very certain in my conclusion? In other words, shouldn t I always choose a high confidence level in hypothesis testing (at least 95%, and maybe Page 115

8 99% or even 99.9%!). We will defer a discussion of the appropriate confidence level in testing to later in the chapter. But for now, ask yourself this question: why don t you similarly always choose a high confidence level in constructing confidence intervals? A 95% confidence interval is commonly given; why not always use 99%, or 99.9%? What affect would that have on the confidence interval? There are both advantages and disadvantages in choosing high confidence levels, as we will discover. In statistics, the term significance level is probably more common than confidence level in hypothesis testing. The significance level (SL) is directly related to the confidence level (CL): SL = 100% CL. Thus, instead of testing at the 95% confidence level, we may instead test at the 5% significance level and arrive at the same conclusions. Although we will tend to use the term confidence level in this text, you should be familiar with both terms. Step 3: Choose a test statistic and calculate its value The next step in hypothesis testing is to choose a statistic (the test statistic) appropriate for testing the hypotheses. The test statistic (like any statistic) is a value calculated in some manner from the data. Since the data presumably contain random error, the test statistic will likewise be a random variable. There are two requirements for a test statistic: 1. Its probability distribution must be known; preferably, tables of critical values exist for the statistic. 2. The test statistic should result in a reasonably good (or efficient ) hypothesis test. What factors might make one test better than another? Let s come back to that point in a little bit. In example 5.1, the null and alternate hypotheses both deal with the population mean µ x of the measurements, so it would seem that we could certainly use the sample mean of the measurements as the basis for the test statistic. In constructing a confidence interval for µ x, the t-distribution is used (when σ x is not known). This suggests that the following test statistic, T, could be used in this hypothesis test: possible test statistic T = x n 14 s(x n ) The test statistic is the studentized sample mean. It has a t-distribution; if H 0 is true, then µ T = 0. The sample mean is not the only possible basis of the test statistic. Instead, we could use the sample median, or some other form of weighted average. It turns out that for normally distributed data, the studentized sample mean is the best test statistic to use for hypothesis tests such as for example 5.1. Let s calculate the observed value for the test statistic for the five measurements in example 5.1: x bar 14. mg T obs se T obs = This is the "studentized" mean: the number of std devs of the mean from 14 mg In this equation, se is the standard error of the sample mean, x bar. According to the observed test statistic, the mean of the measurements, mg/cigarette, is 1.92 standard deviations from the manufacturer s claimed value of 14 mg/cigarette. Page 116

9 Step 4: Calculate the critical value(s) or the P-value It is important to keep in mind that the null hypothesis, H 0, is innocent until proven guilty. The probability distribution of the test statistic, T, assuming that the null hypothesis is true, is called the null distribution. The next step in hypothesis testing is to calculate the critical value(s) of the null distribution. For two-tailed tests, such as the one we must use for example 5.1, there are two critical values. (One-tailed tests only have a single critical value). The null distribution of T is a t-distribution with four degrees of freedom and a mean of zero. Recalling that we choose 95% as our confidence level, the critical values are the values such that T crit = ± t 4,0.025 = ± lower critical value upper critical value 95% Test Statistic accept H 1 reject H 0 accept H 1 reject H 0 accept H 0 T = T = (lower critical value) upper critical value Figure 5.1: Decision criteria for the hypothesis test for example 5.1. If the observed test statistic is above the upper critical value, or below the lower critical value, then we accept the alternate hypothesis, H 1, and reject the null hypothesis, H 0. The critical values are the boundaries between two decision-making regions: the acceptance region, between the two critical values. If the test statistic assumes a value in this region, then the null hypothesis, H 0, is accepted. We cannot prove the alternate hypothesis, H 1, with the desired confidence level. Page 117

10 the rejection region, where T obs > T upper or T obs < T lower. If the test statistic is in this region, then H 0 is rejected and H 1 is accepted. We have proven that H 1 is true at the desired confidence level. By inspecting the null distribution, we can see how the critical values are chosen, and we can understand the role of the confidence level in hypothesis testing. Figure 5.1 shows the situation for a two-tailed test at the 95% confidence level. We choose the critical values so that the 95% of the area under the null distribution is between the critical values. What this means is that, if the null hypothesis is true, there is a 95% probability that the observed test statistic will be within the acceptance region. It is not strictly necessary to calculate the critical values. An alternative approach makes use of the concept of the P-value, which has been mentioned before. The P-value can be interpreted in terms of the null distribution; in particular, for a two-tailed test, the P-value is two-tailed P-value P obs = P(T > T obs )+P(T < T obs )=2 $ P(T > T obs ) Consider example 5.1: the mean of five measurements of nicotine content was mg/cigarette, which is 1.92 standard deviations from the manufacturer s claimed value. Most statistical programs and spreadsheets will also calculate the P-value; for example 5.1, the two-tailed P-value is P obs = In other words, if the null hypothesis were true, there is a 12.66% probability that we would obtain a sample mean that is farther than 1.92 standard deviations from 14 mg/cigarette (in either direction). The P-value is used instead of (or in addition to) critical values. It indicates the weight of the evidence in favor of the alternate hypothesis: the smaller the P-value, the less likely it is that random variability can account for the observed data. To tie the P-value approach with the critical region approach, consider this: the P-value tells us the maximum value of the confidence level that we can adopt and still prove the alternate hypothesis. We calculate this value by maximum confidence level: CL = 100% $ (1 P obs ) where CL is the confidence level as a percentage. For example 5.1, if we choose a confidence level of 87.44% or less, then we can prove that the alternate hypothesis is true. Of course, a smaller confidence level means that we are less confident of our conclusion, so we want a P-value as small as possible. We may more directly interpret the P-value in terms of the significance level. The P-value is the largest significance level at which we may accept the alternate hypothesis. Thus, in this example, we can prove H 1 at the 12.66% significance level, at best. Remember: a smaller significance level means we are more certain of this conclusion. Aside: calculating P-values in Excel Page 118

11 When the null distribution is a t-distribution, then the P-value is calculated in Excel by using the TDIST() function: calculation P-values in Excel P obs = tdist(t obs, df, tail) where T obs is the observed value of the test statistic, df are the degrees of freedom of the t-distribution, and tails is either one or two (for 1- or 2-tailed P-values). For example 5.1, you would enter = tdist(1.9243, 4, 2) into any cell to obtain the 2-tailed P-value. Other Excel functions would be needed when the null distribution does not follow a t-distribution. Step 5: State the conclusion After we decide whether to accept H 0 or H 1, we must state our conclusion in a manner that is accurate and yet can be understood by anyone who does not have a background in statistics. Essentially, we must translate our conclusions from statistic-ese (e.g., reject H 0, accept H 1 ) to normal language. We should give both our conclusion and the confidence level, even though the confidence level is most properly understood in a statistics framework. For example 5.1, we accepted H 0 ; we couldn t prove H 1. In other words, our conclusion would be: We cannot prove with 95% confidence that the nicotine level in the cigarettes is different than 14 mg/cigarette. This statement sounds like poor English (basically a double negative), but the wording was very carefully chosen. We begin with the assumption that the cigarettes have 14 mg of nicotine, and we fail prove any differently. This is similar to a jury returning a verdict of Not Guilty in a criminal trial. Notice that the verdict is not that the defendant was innocent, simply that guilt was not proven beyond a reasonable doubt. In hypothesis testing, the level of reasonable doubt is determined when the confidence level is set. Examples Let s try another two-tailed test. This test is similar in nature to example 5.1. Example 5.2 A certain analytical procedure is being tested for the presence of measurement bias. Twenty measurements are made on a solution whose concentration has been certified at µm. The sample mean is µm, with an RSD of 5.0% for the individual measurements. Is there any evidence of measurement bias? Page 119

12 First let's set up the null and alternate hypotheses H 0 : µ x µm There is no bias in the measurements. H 1 : µ x µm Bias exists; two-tailed test. ξ x µm x bar µm RSD. 5.0 % s. x RSD x bar s x = µm std_err s x 19 std_err = µm Let's use the studentized mean as the test statistic, and calculate the observed test statistic T obs x bar ξ x std_err T obs = sample mean is this many std devs from the true value P obs This is the two-tailed P-value of the observed value of the test statistic Now we look up the critical values from the t-tables. For 19 degrees of freedom, a 95% confidence level and a two-tailed test, the critical values are and Since the observed value of the test statistic is within the acceptance region, we must accept the null hypothesis. Thus, we cannot prove bias in these measurements at the 95% confidence level. Note: from the observed P-value for this example, we see that we can only prove H 1 with 60.22% confidence, at best. Now let s try a one-tailed test. Example 5.3 It is suspected that a series of tests of blood alcohol level proves that the alcohol level is above the legal limit of 0.10%. The measurements are: Do these measurements prove legal intoxication with 95% confidence? As always, the first step is to set up the null and alternate hypotheses. In this case, we should use the following: null H 0 : µ x = 0.10 % blood alcohol level at the legal limit (assuming no bias) alternate H 1 : µ x > 0.10 % blood alcohol level above the legal limit It may be a little difficult to see why the null hypothesis should be that the blood alcohol level is exactly 0.10 %. In setting up the hypotheses, it is best to always ask yourself, what is it that I want to test? What are the possible conclusions? The answers to these questions determine the form of the alternate hypothesis; the null hypothesis will follow. For this example, we want to test whether or not the alcohol level is above the legal limit. Remember that the purpose of the statistical test procedure is actually to test the alternate hypothesis, so that we would propose as the alternate hypothesis that the alcohol level is too high. The nature of the testing procedure is such that we either prove or fail to prove this Page 120

13 hypothesis; i.e., or conclusion will be either that we can prove that the alcohol level is too high (a guilty verdict) or that we cannot prove an excessive alcohol level ( not guilty ).These conclusions are proper for our intentions in this example. Since we propose that µ x > 0.10 % is our alternate hypothesis statement, the corresponding null hypothesis is µ x = 0.10 %. The other thing to notice about the form of H 1 in this example is that it results in a one-tailed test. This will affect the critical values (and the P-value, if we calculate it). Let s continue with our testing procedure. We can proceed by calculating the observed test statistic. x ( ).% x bar mean( x ) x bar = % std_err T stdev ( x) 6 std_err = % Let's calculate the observed test statistic x bar 0.10.% T obs T obs = studentized measurement mean std_err P obs Probability of seeing a larger value that T obs is 0.379%. The P-value is standard output for many statistical programs. In this case, the one-tailed P-value is %, which means that we can prove H 1 at the % confidence level, if we desired; certainly at the 95% level we may reject H 0 and accept H 1. However, it is difficult to use t-tables to calculate P obs, so we will confirm this decision using the critical value approach. For a one-tailed test, there is only a single critical value, as shown in the next figure. Page 121

14 critical v alue 95% Test Statistic H 0 : µ = k H 1 : µ > k accept H 1 accept H 0 critical value Figure 5.2 : An example of a one-tailed test. There is only a single critical value. The top figure shows the null distribution. The critical value is chosen such that the area to under the curve to the left of the critical value is at the appropriate confidence level (95% for this example). The lower figure shows the decision process: if the observed test statistic is larger than the critical value, T obs > T crit, then the null hypothesis is rejected and the alternate hypothesis is proven. Recall that the null distribution is the probability distribution of the test statistic, T, assuming that H 0 is true. As the upper figure shows, we must choose the critical value such that, for the null distribution, P(T obs < T crit ) = CL where CL is the chosen confidence level. For our example, we have chosen a confidence level of 95%. We can determine the critical value from the t-tables: one-tailed critical value T crit = t, = t 5,.05 where ν is the appropriate degrees of freedom, and α is the area in the right tail of the t distribution. We determine the value of α from the confidence level:. CL =(1 ) $ 100% For our example, the t-tables tell us that the critical value is T crit = If you recall, the observed value of the test statistic was ; since this is larger than the critical value, we reject the null hypothesis and accept the alternate hypothesis. Our conclusion is: Assuming no measurement bias, the data show that the blood alcohol level is above the legal limit (at the 95% confidence level). Page 122

15 Errors in Hypothesis Testing Introduction Since they involve random variables, there is always an element of uncertainty in hypothesis tests. Specifically, there is always a chance that the conclusion of a test is in error. This uncertainty is the reason that you must specify a confidence level when you perform statistical tests. Choosing the confidence level allows you to determine the degree of the uncertainty in your test: basically, you can control the likelihood that your conclusion is correct. As we will see, the confidence level also indirectly determines the ability of the statistical test to detect and label small differences as significant. How can the conclusion from a hypothesis test be in error? For tests with a single null hypothesis, H 0, and a single alternate hypothesis, H 1, then the following table shows all the possibilities: decision reality H 1 is not true: H 1 is true: accept H 0 ( negative result) correct false negative accept H 1 ( positive result) false positive correct Let s illustrate with an example. Let s say someone undergoes a pregnancy test. Now the reality of the matter is that the person is either pregnant or she isn t.the test will either decide in favor of pregnancy (called a positive test result) or will decide that the subject is not pregnancy (a negative result). We can draw an analogy to statistical hypothesis tests. We begin with the assumption (the null hypothesis) that the subject is not pregnant. The alternate hypothesis, the one we want to test, is that the subject is pregnant. A conclusion in favor of pregnancy (H 1 is accepted) is considered a positive test result; however, if the subject actually is not pregnant (H 0 is actually true), then our conclusion is in error. This situation an incorrect acceptance of H 1 is called a false positive. On the other hand, if the conclusion of the test is that the subject is not pregnant (H 0 is accepted), and this conclusion is in error (H 1 is actually true), then the test gives a false negative. In the remainder of this section, we will describe how to calculate the probability that the result of a hypothesis test is in error (either a false positive or false negative). False Positive Errors All of the hypothesis tests presented so far in this chapter have been of the following type: the null hypothesis is H 0 : µ x = k the true measurement mean is some fixed value, k While the alternate hypothesis is one of the following Page 123

16 H 1 : µ x k H 1 : µ x > k H 1 : µ x < k the true measurement mean is not some fixed value, k (a two-tailed test) the true measurement mean is larger than some fixed value, k (a one-tailed test) the true measurement mean is smaller than some fixed value, k (a one-tailed test) The decision criterion of the test is the following: if the observed test statistic, T obs, is outside of the interval defined by the critical value(s), then we reject H 0 and accept H 1. A false positive occurs when T obs is outside the H 0 acceptance region when, in fact, H 0 is true. The probability of a false positive is controlled by choosing the appropriate confidence levels in a statistical test. To be exact, CL = 1 where CL is the chosen confidence level, and α is the probability of a false positive. In other words, when testing at the 90% confidence level, there is a 10% chance of falsely accepting H 1. Let s imagine that we are comparing a mean value, µ x, to a fixed value k. Unknown to us, the null hypothesis is actually true. The following figure shows the null distribution of the test statistic, i.e., the probability distribution of the test statistic when the null hypothesis is actually true. Null Distribution critical v alue critical v alue area: α/2 area: α/ Test Statistic Figure 5.3: choosing the critical values for a two-tailed test. If T obs occurs between the critical values, then the null hypothesis is accepted; if not, then H 1 is accepted. The shaded area in both tails is probability of a false positive: it is the probability that T obs does not fall between the critical values, even though it should, since H 0 is true. Now we can see how the critical values are chosen for two-tailed tests: each tail must contain an area of α/2, so that the total probability of a false positive is α, the desired value. Page 124

17 Now let s consider the probability of false positive error for a one-tailed test. In such a test, there is only a single critical value. Let s imagine that we are testing for values that are greater than a fixed value, k; in other words, our alternate hypothesis is H 1 : µ x > k. The next figure shows the null distribution, together with the critical value and the probability of false positive. Null Distribution critical v alue area: α Test Statistic Figure 5.4: choosing the critical value for a one-tailed test. If T obs is less than the critical value, then the null hypothesis is accepted; if not, then H 1 is accepted. The shaded area in both tails is probability of a false positive. Note that the critical value was chosen such that the probability of false positive, α, is the same as in figure 5.3 To summarize, we set the probability of false positive error when we choose the confidence level. We must then choose the critical values according to our desired value of α. This means that, for a two-tailed test, the area in each tail of the null distribution must be α/2; for a one-tailed test, the area in the single tail (since there is only one critical value) will be α. False Negative Errors A false negative occurs when we incorrectly accept H 0 when we should actually reject H 0 and accept H 1. In other words, the alternate hypothesis is actually true, but the test statistic still falls within the critical region (so that the null hypothesis is accepted). The next figure shows the probability distribution of the test statistic when the alternate hypothesis is true. Page 125

18 True Distribution of Test Statistic (H 1 is true) accept H 0 accept H 1 critical value β: probability of f alse negativ e Test Statistic Figure 5.5: This figure shows the probability distribution (not the null distribution) of the test statistic in a situation when the alternate hypothesis is actually true (in this case, µ x > k). However, if the test statistic is less than the critical value, shown in the figure, then the null hypothesis will be accepted: this would be a false negative error. The shaded area shows the probability, β, of this occurring. As we see in the figure, even when the alternate hypothesis is true, there is some chance (β) that the test statistic will be less than the critical value. This chance is the probability of false negative error, β. In order to calculate β, we must know the value of the population parameter, µ x. We can always calculate the value of β for some hypothetical situation in which we postulate a value for the population parameter. This type of exercise would give us some idea of how sensitive our testing procedure is to situations in which the alternate hypothesis is false. The next example illustrates this point. Page 126

19 Example 5.4 You wish to develop a procedure to test for bias in the analysis of fluoride in water. During the analytical procedure, three independent measurements are obtained on a sample, and averaged to determine the fluoride concentration. The standard solution to be used in the test is known to contain 0.45 w / w % F, and the RSD of the entire analytical procedure is known to be 0.10 (i.e., 10% RSD for the average of the three measurements) (a) What are the critical values that can be used to determine if there is bias in a measurement? (b) What values of population measurement mean, µ x, would result in a 90% probability that bias will be detected? In other words, what bias would result in acceptance (with 90% probability) of the alternate hypothesis in part (a)? The true fluoride concentration, ξ x, of the standard solution is 0.45 w / w %. The analytical procedure in this situation consists of obtaining three measurements and averaging them to obtain a point estimate of the fluoride concentration. We can calculate the standard error of the mean of three measurements: ξ. x 0.45 % RSD 0.1 σ. overall RSD ξ x σ overall = % The null and alternate hypotheses will be the true standard error (a population parameter) is known H 0 : µ x = ξ x there is no measurement bias H 1 : µ x ξξ measurement bias exists (two-tailed test) One thing that is different about this hypothesis test, compared to all the others we have done: the true (i.e., population) standard deviation of the mean, (x n ), is known. Thus, the test statistic will be the standardized different between the mean of three measurements and the true concentration of the solution: test statistic T = x 3 x (x 3 ) where x 3 is the mean of 3 measurements. Assuming that x 3 is distributed normally, T will follow a normal distribution with a standard deviation of one.. The null distribution, which assumes that µ x = ξ x, will follow a z-distribution (i.e., a standard normal distribution). Let s set our confidence level at 99%; in other words, we are limiting the probability of false positives to 1%: α = Now we can find the critical values. From the z-tables, we see that z (you should verify this; the actual value is , as reported by Excel). Our decision rules for this hypothesis test are: if < T obs < , then accept H 0. We cannot prove measurement bias with 99% confidence. Page 127

20 if T obs < or T obs > , then reject H 0 and accept H 1. We can prove bias with 99% confidence. In this instance, it is useful to note that there is an equivalent way of stating these decision rules: if the observed measurement mean, x 3, is more than standard errors from the true concentration, ξ x, then we have evidence of bias. crit lower ξ. x z crit σ overall crit lower = % crit upper ξ. x z crit σ overall crit upper = % Alternate decision rules if w / w % < x < w 3 / w %, then we must accept H 0 if x < w / w % or > w 3 x 3 / w %, then we reject H 0 and accept H 1 at the 99% confidence level You should realize that these rules are not different then the first ones; they would result in exactly the same conclusion for a given set of data. These rules just give another way of looking at the hypothesis test process. Now let s look at part (b). We want to find the measurement population mean, µ x, that would result in a 90% chance that measurement bias would be detected. Let s imagine that there is actually a certain amount of positive bias in the measurements. The probability that the bias will actually be detected is the area under the probability distribution curve that is greater than the upper critical value. In other words, if we want to find the minimum amount of positive bias that will be detected with a 90% probability, we need to find the measurement mean, µ x, that satisfies: This situation is shown in the following figure. P( x > w 3 / w %) = 0.90 Page 128

21 probability distribution of measurement mean accept H 1 accept H 1 accept H 0 β = measurement mean, w/w % Figure 5.6: The critical values associated with the decision rules for two-tailed bias detection at the 99% confidence level are represented by the dashed vertical lines. The probability distribution describes the mean of three positively biased measurements, and results in β = 0.10; in other words, for measurements described by this distribution, there is a 10% chance of a false negative result to bias testing at the 99% confidence level. From the z-tables, we know that z 0.90 = gives a right-tailed area of We must solve for µ x in the following expression: x crit x = (x 3 ) where x crit is the upper critical value for the testing procedure, and (x 3 ) is the standard error of the mean of three measurements. Solving for µ x gives = x crit + z 0.90 (x 3 ) This is the mean of the probability distribution shown in the figure. Substituting w / w % for the critical value, and a standard error of w / w %, gives µ x = w / w %. This corresponds to a bias, γ x, of γ x µ x ξ x γ x = % If you repeat this procedure to find the negative bias that gives β = 0.10, you will find that a bias of γ x = w / w % will give the desired false negative probability value. In other words, our calculations tell us that when testing for bias at the 99% confidence level under these conditions, we have a 90% chance of detecting bias of w / w %. This is useful information. If, for example, the sensitivity of our hypothesis test for bias detection is unacceptable, then we have two options: lower our confidence level from 99% (which would Page 129

22 decrease our critical values) or average more measurements to decrease our standard error. We could also try to improve the precision of our method, so that the standard deviation of the individual measurements is smaller. Summary: Choosing the Confidence Level Choosing the confidence level directly determines the critical values and the value of α, the probability of a false positive error. Let s consider a two-tailed test: H 0 : H 1 : µ x = k µ x k for which there are two critical values, represented on the following number line: Choosing a larger confidence level will cause the critical values to move further apart. True, this means that there is a less chance of a false positive error; however, the power of the test to detect small differences between µ x and k has been decreased. In other words, there is a greater chance of a false negative error (i.e., β has increased). Thus there is always a compromise to consider in choosing the confidence level; values of 95% and 99% are very common. The value chosen may depend on the potential consequences of errors. Consider the following situations: in employee drug testing, no employer want to deal with false accusations. In such a situation, a high confidence level (99% or even higher) might be appropriate, because the consequences of a false positive (wrongly accusing an employee of taking drugs) are perceived to be more severe than missing the borderline cases. in screening patients for HIV, the consequences of a false negative (incorrectly concluding that the patient is not infected) are very severe. In this case, the confidence level might be set relaltively low. To be sure, there will be an increase in false positives, but a separate, independent test can be performed on these patients. Page 130

23 Chapter Checkpoint The following terms/concepts were introduced in this chapter: acceptance region alternate hypothesis critical value false positive false negative hypothesis test null hypothesis null distribution one-tailed test P-value rejection region significance level significance test statistical hypothesis statistical significance test statistic two-tailed test In addition to being able to understand and use these terms, after mastering this chapter, you should use formal hypothesis testing procedures to determine if there is a significant difference between a normally-distributed random variable and a fixed value, using either a one- or two-tailed test interpret P-values from a hypothesis test explain trade-offs in choosing a confidence level Page 131