Introduction to Hypothesis Testing A Hypothesis Test for Heuristic Hypothesis testing works a lot like our legal system. In the legal system, the accused is innocent until proven guilty. After examining the evidence, he is found either guilty or not guilty by a jury of his peers. How much evidence does there need to be to convict? The answer to this is different for every jury. Also, this is not a perfect process meaning mistakes are made. A mistake can be made by sending an innocent man to prison or letting a guilty man go free. Let s put these ideas into the framework of hypothesis testing. Statistical Let s say a researcher has reason to believe the population mean is different from what has been accepted. The belief that has been around for some time (the status quo) will be called the null hypothesis, denoted by H O. The belief that the true mean may actually be different from this null hypothesized belief is called the alternative hypothesis, denoted by H A. Stating the hypothesis We ll state our null hypothesis in the following way: H O : = O (the = sign always goes with H O ) Then, the alternative hypothesis can be one of the following three statements: H A : H A : H A : Finding the evidence We ll use X and knowledge of its distribution to gather our evidence. Intuitively, we know that the further away X is from O, the more evidence we have that the null hypothesis is not true. Let t s X o s n be our test statistic. If the null hypothesis were true (remember: innocent until proven guilty!), then t s ~ t df=n 1, and we can compute probabilities associated with it. When X is close to O, then t s When X is larger than O, then t s When X is smaller than O, then t s
Diagram of finding the evidence for the three possible tests H O : = O H A : > O H O : = O H A : < O H O : = O H A : O Hypothesis Test for, Errors, and Power Page 2
Definition The P value of a test statistic is the probability, given that the null hypothesis is true, of observing a test statistic that extreme or more extreme in the direction of the alternative hypothesis. The Decision So, the P value quantifies how extreme our test statistic would be, given that the null hypothesis is true. This is evidence against the null hypothesis. Question: How much evidence is needed to conclude the null hypothesis is incorrect? Answer: This varies from researcher to researcher, and we ll make a pre specified cut off,, before we conduct the test of hypothesis. We call this the significance level of the test. We reject H O when P. We fail to reject H O when P >. Steps for Carrying Out a Hypothesis Test (1) Set (significance level) (2) State hypotheses (3) Compute test statistic (4) Compute P value (5) Make decision (6) State conclusion in context of the setting Hypothesis Test for, Errors, and Power Page 3
Example: The national center for health statistics reports the mean systolic blood pressure for males aged 35 44 is 128 mmhg. A medical researcher believes the mean systolic blood pressure for male executives in this group is lower than 128 mmhg. A random sample of 72 male executives in this age group results in a sample mean of 126.1 mmhg and a standard deviation of 15.2 mmhg. Is there evidence to support the researcher s claim? Test this hypothesis at the 0.05 level of significance. Hypothesis Test for, Errors, and Power Page 4
Compute the P value be for: H A : H A : > Errors When we make a decision (reject or fail to reject H O ), are we always correct? We can make two types of errors in hypothesis testing. Definition: The False Positive Rate (a.k.a. the Type I Error Rate) of a test is the probability of rejecting H o when it is true. NOTATION: = P{reject H o H o true} Definition: The False Negative Rate (a.k.a. the Type II Error Rate) of a test is the probability of failing to reject H o when it is false. NOTATION: β = P{fail to reject H o H o false} Hypothesis Test for, Errors, and Power Page 5
Choosing If we think of (the significance level of a test) as the probability of rejecting the null hypothesis given the null hypothesis is actually true, then we would certainly want to choose a very small to guard against this type of error. Right? It turns out, we cannot simultaneously minimize both and β. Traditionally, we attend to : If a false positive error is worse than a false negative, drive a very low (.01,.005, ) If a false negative error is worse than a false positive, let a rise (.10, or even.15) If you re not sure/can t distinguish, then a traditional middle ground is = 0.05. Example Suppose some sort of immunotherapy is being proposed as an effective therapy against cancer. Suppose the immunotherapy is tested on cancer patients who are already taking chemotherapy and some sort of measure of change in response (change in tumor size?) is being measured with H O : no effect of immunotherapy H A : beneficial effect of immunotherapy A Type I Error would waste a lot of patients money on useless immunotherapy A Type II Error would dismiss an effective cure as useless Deciding which type of error is worse isn t always easy to determine! Power Definition: The power of a test is the probability of rejecting Ho when it is false. NOTATION: P{reject H o H o false}. Notice: P{reject H o H o false} = 1 P{fail to reject H o H o false} = 1 β. So, power is the complement of false negative error. We can estimate the power of a hypothesis testing procedure (which is beyond the scope of this course) in advance and often we try to design experiments so that power = 1 β 0.80. Hypothesis Test for, Errors, and Power Page 6