Chapter 4. Hypothesis Tests

Similar documents

Online 12 - Sections 9.1 and 9.2-Doug Ensley

Recall this chart that showed how most of our course would be organized:

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Solutions to Homework 5 Statistics 302 Professor Larget

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Week 3&4: Z tables and the Sampling Distribution of X

HYPOTHESIS TESTING WITH SPSS:

Stats Review Chapters 9-10

p-values and significance levels (false positive or false alarm rates)

Standard 12: The student will explain and evaluate the financial impact and consequences of gambling.

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.

Hypothesis Testing for Beginners

Mind on Statistics. Chapter 12

Introduction to Hypothesis Testing OPRE 6301

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

STA 130 (Winter 2016): An Introduction to Statistical Reasoning and Data Science

Confidence Intervals on Effect Size David C. Howell University of Vermont

Probability and Expected Value

Introduction to Hypothesis Testing

How To Check For Differences In The One Way Anova

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Testing Hypotheses About Proportions

Solutions to Homework 6 Statistics 302 Professor Larget

Pay per Click Success 5 Easy Ways to Grow Sales and Lower Costs

MONEY MANAGEMENT. Guy Bower delves into a topic every trader should endeavour to master - money management.

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

I C C R. Gain Attention/Interest: Is ESP (Extra Sensory Perception) For Real? (I Knew You Were Going to Ask That!

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Double Deck Blackjack

Preparation of Two-Year College Mathematics Instructors to Teach Statistics with GAISE Session on Assessment

Hypothesis Testing --- One Mean

Independent samples t-test. Dr. Tom Pierce Radford University

Session 8 Probability

How To Run Statistical Tests in Excel

5/31/2013. Chapter 8 Hypothesis Testing. Hypothesis Testing. Hypothesis Testing. Outline. Objectives. Objectives

Chapter 26: Tests of Significance

Pristine s Day Trading Journal...with Strategy Tester and Curve Generator

1. How different is the t distribution from the normal?

Correlational Research

Mental Health Role Plays

How to Study Mathematics Written by Paul Dawkins

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Sample Size and Power in Clinical Trials

This Method will show you exactly how you can profit from this specific online casino and beat them at their own game.

Solutions to Homework 3 Statistics 302 Professor Larget

Todd: Kim: Todd: Kim: Todd: Kim:

Name: Date: Use the following to answer questions 3-4:

Sleeping Beauty, as described on wikipedia, citing Arnold

Intro to Hypothesis Testing Exercises

Two-sample hypothesis testing, II /16/2004

Seven Things You Must Know Before Hiring a Real Estate Agent

UNDERSTANDING YOUR ONLINE FOOTPRINTS: HOW TO PROTECT YOUR PERSONAL INFORMATION ON THE INTERNET

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

Results from the 2014 AP Statistics Exam. Jessica Utts, University of California, Irvine Chief Reader, AP Statistics

Would You Like To Earn $1000 s With The Click Of A Button?

Basic Probability. Probability: The part of Mathematics devoted to quantify uncertainty

Hypothesis testing. c 2014, Jeffrey S. Simonoff 1

THE OPTIMIZER HANDBOOK:

Conditionals: (Coding with Cards)

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

How to Overcome the Top Ten Objections in Credit Card Processing

STAT 35A HW2 Solutions

socscimajor yes no TOTAL female male TOTAL

STRING TELEPHONES. Education Development Center, Inc. DESIGN IT! ENGINEERING IN AFTER SCHOOL PROGRAMS. KELVIN Stock #651817

$ $ Lessons on Stewardship. $ $ $ $ $ $ A Three Week Children s Church Curriculum Week One: Tithes Week Two: Offering Week Three: Stewardship

How to Pass Physics 212

7 Secrets To Websites That Sell. By Alex Nelson

Interviewer: Sonia Doshi (So) Interviewee: Sunder Kannan (Su) Interviewee Description: Junior at the University of Michigan, inactive Songza user

Introduction to the Practice of Statistics Fifth Edition Moore, McCabe

Gaming the Law of Large Numbers

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

Expected Value. 24 February Expected Value 24 February /19

The Assumption(s) of Normality

Kim: Thank you Todd, I m delighted to be here today and totally looking forward to our conversation.

Worldwide Casino Consulting Inc.

and Maths in School. Addition in School. by Kate Robinson

p ˆ (sample mean and sample

Objectives for the lesson

The Secret Formula for Webinar Presentations that Work Every Time

Seven Things You Must Know Before Hiring a Real Estate Agent

Standard Life Active Retirement For accessing your pension money

Chapter 9. Two-Sample Tests. Effect Sizes and Power Paired t Test Calculation

Beating Roulette? An analysis with probability and statistics.

Having a coin come up heads or tails is a variable on a nominal scale. Heads is a different category from tails.

Reasoning Strategies for Addition Facts

Lab 11. Simulations. The Concept

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

Medical Malpractice VOIR DIRE QUESTIONS

How to see the market context using CCI Written by Buzz

Chapter 7 Section 7.1: Inference for the Mean of a Population

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Statistics. Head First. A Brain-Friendly Guide. Dawn Griffiths

VISUAL ALGEBRA FOR COLLEGE STUDENTS. Laurie J. Burton Western Oregon University

Private Today, Public Tomorrow

Scientific Experiments Using the Inquiry Activity Pendulums

Hypothesis Testing. Steps for a hypothesis test:

22. HYPOTHESIS TESTING

Acing Math (One Deck At A Time!): A Collection of Math Games. Table of Contents

Lesson 2 Budget to Save: Developing a Budget

Transcription:

Chapter 4 Hypothesis Tests 1

2 CHAPTER 4. HYPOTHESIS TESTS 4.1 Introducing Hypothesis Tests Key Concepts Motivate hypothesis tests Null and alternative hypotheses Introduce Concept of Statistical significance Timing One class. If your students have weaker backgrounds, you might delay talking about statistical significance until the next class, and use one and half classes for the section. Class Notes Motivate hypothesis tests Main idea: How do we tell if data from a sample provide clear evidence for a hypothesis about a population, or whether sample results might just be due to random chance? Emphasize the fact that we are testing whether we can go from data in a single sample to a conclusion about a population. Start class with a class activity and an example or two to motivate this idea. Suggestions are given below. Null and alternative hypotheses Main idea to get across: Null hypothesis is usually no effect or no difference, while the alternative hypothesis is what we are seeking evidence of. Additional points for the students: In this text, (until Unit D) the null hypothesis will always have an equals sign, while the alternative will include either less than, greater than, or not equals. The notation in the hypotheses must be in terms of population parameters rather than sample statistics. Parameters used in this chapter are the same as those used in Chapter 3: proportion, mean, difference in proportions, difference in means, and correlation. After introducing hypotheses in class, return to the motivating examples from earlier and write the hypotheses. Then a variety of short quick examples will help them get used to stating the hypotheses. Be sure to encourage them to always define the parameter(s). Introduce Concept of Statistical significance Main idea: If sample results are strongly in support of the alternative hypothesis and unlikely to occur by random chance when assuming the null hypothesis is true, we call them statistically significant. This is a hard concept and it may take some time for students to fully understand it. At this stage, a minimal understanding would be that significant means evidence for the alternative while not significant means we can make no conclusion. We build to a deeper understanding over the next two sections. In class, talk about what significant would mean for each of the motivating examples and also what insignificant means. Be sure the students always state the result in context. Use an additional example for students to work through at the end of class.

4.1. INTRODUCING HYPOTHESIS TESTS 3 Class Activity Begin class with one of the following class activities, to motivate the ideas behind hypothesis tests: Option 1: Collect data from the class by having all students choose and write down a random number between 1 and 100. Poll the class to find the proportion of odd numbers. Is it greater than 0.5? (If the sample proportion is less than 0.5, do the following for even numbers.) If so, can we conclude that college students prefer odd numbers? Note that if the proportion picking odd is 0.51, we see that the sample proportion is greater than 0.5 but we also know that a result such as that could easily happen just by random chance. If the proportion is 0.9, that would provide much more evidence that college students prefer odd numbers. How much evidence do we need before we can make such a conclusion? Hypothesis testing helps us figure that out! Option 2: Collect data from the class by having students pair up and test each other for extrasensory perception (ESP). See the class PowerPoint slides or the class activity handout. One student will select and write down one of the five letters: A, B, C, D, or E. The other student will try to guess what the letter is. Students keep track of the number of attempts and the number correctly guessed, so that we have a proportion of correct guesses. If no one has ESP, we expect the proportion to be 0.2. Is the sample proportion greater than 0.2? If so, can we conclude that people have ESP and can communicate telepathically? Note that if the proportion of correct guesses is 0.21, we see that the sample proportion is greater than 0.2 but we also know that a result such as that could easily happen just by random chance. If the proportion is 0.9, that would provide much more evidence that people have ESP. How much evidence do we need before we can make such a conclusion? Hypothesis testing helps us figure that out! Use the activity example to motivate the null and alternative hypotheses: Option 1: H 0 : p = 0.5 vs H a : p > 0.5 where p is the proportion of college students picking odd numbers. Option 2: H 0 : p = 0.2 vs H a : p > 0.2 where p is the proportion of people guessing a letter correctly out of five choices. Return to this example at the end to discuss what statistical significance would mean in this context. Ask them what assuming the null hypothesis is true means in each case. Class Examples In addition to the class example that arises from a class activity, here are a couple other suggestions. See also the PowerPoint slides and the class handout. Also, check the exercises for good ideas for examples. There are more exercises than needed and many are quite interesting and would make great class examples. Example Sleep or Caffeine for Memory? In an experiment, students were given words to memorize, then were randomly assigned to either take a 90 minute nap or take a caffeine pill. A couple hours later, they were tested on their recall ability. We wish to test to see if the sample provides evidence that there is a difference in mean number of words people can recall depending on whether they take a nap or have some caffeine. State the hypotheses. Indicate what statistically significant results would mean in this context. What can we conclude if the results are statistically insignificant? Solution

4 CHAPTER 4. HYPOTHESIS TESTS Lead the students through recognizing this as a difference in means test, in which the alternative hypothesis is not equals. We have H 0 : µ s = µ c vs H a : µ s µ c where µ s and µ c are mean number of words people can recall after sleep or caffeine, respectively. Remind them that the hypotheses are always about population parameters, not sample statistics. In this case, statistically significant results would indicate that there is evidence of a difference in average number of words recalled between the people who take a nap and people who take caffeine. Statistically insignificant results would give us no information and we couldn t make any concrete conclusion. Example BPA in Canned Tomatoes? A sample of 50 cans of tomatoes are tested for levels of the chemical BPA to see if there is evidence that the mean level is greater than 100 ppb (parts per billion). State the hypotheses. Indicate what statistically significant results would mean in this context. What can we conclude if the results are statistically insignificant? Discuss what kind of sample results might be significant and what might be insignificant. Solution Lead the students through recognizing this as a test for a single mean, in which the alternative hypothesis is greater than. We have H 0 : µ = 100 vs H a : µ > 100 where µ is the mean BPA level in the population of all canned tomatoes. Remind them that the hypotheses are always about population parameters, not sample statistics. In this case, statistically significant results would indicate that there is evidence that mean BPA level in canned tomatoes is greater than 100 ppb. Statistically insignificant results would give us no information and we couldn t make any concrete conclusion. Sample results would definitely be insignificant if the sample mean is less than 100, since that would provide no evidence for a population mean greater than 100. Sample means close to 100 are also likely to be inconclusive since they are likely to happen just by random chance even if the population mean equals or is less than 100, so these are unlikely to provide enough evidence to generalize to the population. Sample means very large are likely to give the most evidence for the alternative hypothesis, since they are unlikely to happen by random chance if the null hypothesis is true. (Of course, all this depends on the sample size and the standard deviation, but at this point, we just want to help students develop a general idea about evidence for the alternative.) Exercise Notes Exercises 1-4, 20, 27, 28, 37, 38 ask students to think about evidence for a claim. Exercises 16, 17, 39, 40 explicitly ask about statistical significance. Almost all the exercises ask students to state null and alternative hypotheses. There are no exercises needing technology beyond a graphing calculator. Suggested Exercises: Skill Builders: 1-15 odd (or even); Exercises: 16, 17, 19, 28

4.2. MEASURING EVIDENCE WITH P-VALUES 5 4.2 Measuring Evidence with P-values Key Concepts P-value measures strength of evidence against the null hypothesis A randomization distribution allows us to use the meaning of a p-value to calculate a p-value The smaller the p-value, the stronger the evidence for the alternative hypothesis Timing One class. Take your time! If students really understand a p-value, the rest is easy. Class Notes Start by reminding students of the roles of the null and alternative hypotheses, and then ask them the key question: How unusual is it to see a sample statistic as extreme as the sample statistic we observed, if H 0 is true? Introduce this idea with one example or activity (such as Paul the Octopus see below). Ask what it means to assume the null hypothesis is true in this case. We create a randomization distribution assuming the null hypothesis is true. The main focus of this section, however, is on understanding p-values. We return the details of creating randomization distributions in Section 4.4. The main thing students should realize now is that the randomization distribution consists of simulated statistics that would happen just by random chance if the null hypothesis is true. The fact that we assume H 0 is a key difference between randomization distributions and the bootstrap distributions students saw in Chapter 3. Use StatKey or other technology to generate randomization distributions. Ideally, do this on a computer projector in class. [If you don t have access to a computer projector in class, create slides from images on StatKey. Instructions for downloading images are available there.] Once we have the randomization distribution, we can answer the key question How unusual is it to see a sample statistic as extreme as the sample statistic we observed, if H 0 is true? We find the proportion of simulated statistics in the tail beyond the sample statistic from our data. This is the p-value: the proportion of times we might obtain a sample statistic as extreme (or more extreme) than the observed sample statistic, if the null hypothesis is true. Do a couple of examples using only right-tail tests (H a :>) until they get the idea, then introduce them to left-tail tests (H a :<) and two-tail tests (H a : ). Point out to them that in some sense this is the reverse process than the one they used for confidence intervals. For confidence intervals, they enter an area under the distribution (such as 95% or 90%) and find the cutoffs. To find a p-value, the process works in reverse. They enter a cutoff (the relevant sample statistic) and find the area beyond that cutoff. The most important point to make in the class is this: The smaller the p-value, the stronger the evidence for the alternative hypothesis. Point out that sample statistics that give the strongest evidence for the alternative hypothesis will be far out in the tail of a randomization distribution, which leads to small p-values. Don t get ahead of the game: We don t worry about significance level and making conclusions

6 CHAPTER 4. HYPOTHESIS TESTS such as Reject H 0 until the next section. At this point, we just want them to know: The smaller the p-value, the stronger the evidence for the alternative hypothesis. Class Activity Paul the Octopus makes a great class activity to begin this class. You will need to bring a bag of pennies or other coins to class to distribute. During the 2010 World Cup tournament, Paul the Octopus (in a German aquarium) became famous for correctly predicting the winner in all 8 games it was asked to predict. Two containers of food were lowered into Paul s tank, each with a flag of the opposing teams. He made a selection by choosing which container to eat from. (It is great fun, if possible, to show the YouTube video in class. The link is in the PowerPoint slides or just google Paul the Octopus.) The question is: Does this sample of 8 games provide evidence that Paul has psychic powers and can choose correctly more than half the time? Ask students to state the null and alternative hypotheses. (H 0 : p = 0.5 vs H a : p > 0.5) Ask students to compute Paul s sample proportion. (ˆp = 8/8 = 1.0) We want to see how unlikely it is to have an 8 for 8 record if Paul is just randomly guessing. We can simulate this with a coin! Each coin flip represents a guess between two teams, with heads standing for a correct guess and tails for incorrect. Ask students why this method works for assuming the null hypothesis is true. (Because the null hypothesis is p = 0.5 and a coin lands heads 0.5 of the time and tails the other 0.5.) Now the activity: Have each student (or each group of students) flip a coin 8 times and count the number of heads. Have them compute their sample proportion of heads. Then ask the class if anyone got heads on all 8 flips? This begins to give them the idea of measuring how unlikely something is. Then turn to StatKey and show a simulated distribution of many sample proportions with samples of size 8 and p = 0.5. Point out that every dot is a simulated sample proportion just like the one they just found (but the computer can do it lots faster!) Now that you have the randomization distribution, the question is: how unlikely is the sample proportion ˆp = 1.0 that Paul had? We find this by clicking on Right tail and entering 1.0 as the cutoff. (This will be new to them since previously they only changed the center area box.) The area above 1.0 is the p-value! It will probably be about 0.004. This tells us that we will only see 8 heads out of 8 flips of a coin on about 4 out of 1000 tries. This is pretty unlikely and might make you wonder whether the coin is fair or whether Paul has psychic powers! Class Examples In addition to the class example that arises from a class activity (Paul the Octopus as described above, which can also work well as a class example without the activity, or some other), here are a couple other suggestions. See also the PowerPoint slides and the class handout. Also, check the exercises for good ideas for examples. There are more exercises than needed and many are quite interesting and would make great class examples. Example Support for the Death Penalty In 1980 and again in 2010, a Gallup poll asked a random sample of 1000 US citizens people Are you in favor of the death penalty for a person convicted of murder?. In 1980, the proportion saying yes was 0.66. In 2010, it was 0.64. Does this data provide evidence that the proportion of US citizens favoring the death penalty was higher in 1980 than it was in 2010? Using p 1 for the proportion in 1980 and p 2 for the proportion in 2010: (a) State the null and alternative hypotheses:

4.2. MEASURING EVIDENCE WITH P-VALUES 7 (b) What is the sample statistic? (c) To create the randomization distribution, what do we have to assume? (d) (Show a randomization distribution on StatKey or a slide and ask:) Which of the following is closest to the p-value? 0.001, 0.02, 0.15, 0.5 (e) (Then use StatKey or other technology to find the p-value exactly, making sure they understand the steps involved.) Solution (a) This is a difference in proportions test, with hypotheses H 0 : p 1 = p 2 vs H a : p 1 > p 2. (b) The sample statistic is the difference in sample statistics: ˆp 1 ˆp 2 = 0.66 0.64 = 0.02 (c) To create the simulated statistics, we assume the proportions are equal, as stated in the null hypothesis. (d) Have the students visually estimate the area in the tail beyond 0.02. It should be closest to 0.15. (An image of the randomization distribution is available on the PowerPoint slides and the handout.) (e) The p-value comes out to be about 0.16. Example Sleep or Caffeine for Memory? [This example is continued from earlier class notes.] In an experiment, students were given words to memorize, then were randomly assigned to either take a 90 minute nap or take a caffeine pill. A couple hours later, they were tested on their recall ability. We wish to test to see if the sample provides evidence that there is a difference in mean number of words people can recall depending on whether they take a nap or have some caffeine. The hypotheses are H 0 : µ s = µ c vs H a : µ s µ c, and the sample statistic is x s x c = 3.0. Use a randomization distribution to find the p-value. Solution Use StatKey or other technology to generate the randomization distribution, and point out that those simulated samples show what might happen by random chance if we assume the null hypothesis (no difference in means between sleep and caffeine) is true. Ask the students how to find the p-value? This is a two-tail test and we enter the sample statistic of 3.0 on the right tail. Then we need to double the area for the two tails!! This is hard for them to remember so be sure to do an example such as this one that is two-tailed. The p-value is approximately 2(0.022) = 0.044. Example More Short Examples! Have some short examples at the end where you just ask them quick questions such as: Which p-value gives stronger evidence against the null hypothesis and for the alternative hypothesis: 0.95 or 0.02? Make sure they internalize that small p-value means more evidence! Exercise Notes Exercises 41-44, 52, 53, 59 ask students to use the fact that small p-value gives more evidence. Exercises 45-47, 54, 55, 58, 60 ask students to estimate a p-value from a randomization distribution. Exercises 48-51, 56, 57 ask both the above.

8 CHAPTER 4. HYPOTHESIS TESTS Exercise 61 asks students to think clearly about the definition of a p-value, and 62 and 63 are more challenging exercises. There are no exercises needing technology beyond a graphing calculator. Suggested Exercises: Skill Builders: 41-51 odd (or even); Exercises: 53, 55, 56 or 57, 58

4.3. DETERMINING STATISTICAL SIGNIFICANCE 9 4.3 Determining Statistical Significance Key Concepts Make a formal decision in a hypothesis test by comparing the p-value to a significance level Correctly interpret a conclusion of Reject H 0 or Do not reject H 0 in context Timing One class. Be sure to talk about making informal conclusions based on strength of evidence in the p-value as well as the formal Reject H 0 or Do not reject H 0. It is fine to omit Type I and Type II errors. The analogy to law is quite compelling for students and worth covering. Class Notes Start with the key point from the previous class: The smaller the p-value, the stronger the evidence for the alternative hypothesis. The question is: how small is small enough? This class is about deciding where to draw the line. There is a fantastic class activity to do at this point, which involves you bringing in a deck of all red cards and a bag of candy! See details below. After defining significance level and introducing the formal conclusions of Reject H 0 or Do not reject H 0, you will probably want to do several examples. You might want to do several using only a significance level of 0.05 and focus on the appropriate way to state a conclusion in context. Make it clear that Reject H 0 means there is evidence for the alternative hypothesis while Do not reject H 0 means the test was inconclusive and there is not really evidence of anything. These are the only two possible outcomes to a formal hypothesis test! (The elephant example in the text can help to make this point.) Once you have illustrated a couple of examples, have them try one at their seats. The class handout provided in the Instructor Resources has examples ready to go; check it out. Introduce additional significance levels and have them notice that significant at 1% is stronger evidence than significant at 10%. It is worth talking about the advantages of always giving the p-value so that the strength of evidence is clear, rather than just giving the Reject H 0 or Do not reject H 0 conclusion. This is described at the end of the section in the discussion about informal conclusions with strength of evidence. Feel free to omit the material on Type I and Type II errors. They are not essential to a deep understanding or to what comes later. Even if you omit a discussion of these errors, however, it is worth covering the analogy to law, as we have found this connection to be very compelling for students. Class Activity: Why 5%? Create a deck of cards that is all red cards. Make it appear to be a new deck. Bring it to class, open it and shuffle it obviously in the front of the room. Also bring in a bag of candy. (Reese s peanut butter cups happen to be the candy of choice for one of the Lock authors.) Announce that you will let students draw a card, and the first one to get a black card gets the bag of candy. Proceed to walk around the room, letting each student you come to pick a card. Take your time and play up the Is it black anticipation with each

10 CHAPTER 4. HYPOTHESIS TESTS draw. Don t give anything away! The students will get increasingly suspicious as they see red card after red card. Most likely, one of the students will claim that something isn t right after about 5 red cards in a row. The point? Our intuition starts to tell us something is going on at about 5%. (The probability of seeing that many red cards in a row from a standard deck goes below 5% at 5 in a row and below 1% at 7 in a row.) A significance level of 5% makes sense because this is when our intuition begins to tell us that something beyond just random chance is going on. This activity makes a very compelling and memorable point with the students! (So it is worth the price of a bag of candy - which can then be shared with everyone which also makes them happy.) Class Examples Here are a couple of suggestions. See also the PowerPoint slides and the class handout. Also, check the exercises for good ideas for examples. There are more exercises than needed and many are quite interesting and would make great class examples. Example Red Wine and Weight Loss Resveratrol, an ingredient in red wine and grapes, has been shown to promote weight loss in animals. In one study, a sample of lemurs had various measurements taken before and after receiving resveratrol supplements for 4 weeks. For each p-value given, indicate the formal generic conclusion as well as a conclusion in context. Use a 5% significance level. (a) In the test to see if the mean resting metabolic rate is higher after treatment, the p-value is 0.013. (b) In the test to see if the mean body mass is lower after treatment, the p-value is 0.007. (c) In the test to see if locomotor activity changes after treatment, the p-value is 0.980. (d) In the test to see if mean food intake changes after treatment, the p-value is 0.035. (e) Which of the results given in (a) - (d) above are significant at a 1% level? Solution Be sure the students understand that it is always important to include a conclusion in context. (a) Reject H 0. There is evidence that mean metabolism rate is higher if resveratrol supplements are taken. (b) Reject H 0. There is strong evidence that mean body mass is lower if resveratrol supplements are taken. (c) Do not reject H 0. We did not find any evidence that resveratrol is associated with activity levels. (d) Reject H 0. There is evidence that mean food intake is related to resveratrol consumption. (e) The one one for which the evidence is strong enough to still be significant at a 1% level is the test on mean body mass. Example Multiple Sclerosis and Sunlight

4.3. DETERMINING STATISTICAL SIGNIFICANCE 11 It is believed that sunlight offers some protection against multiple sclerosis, but the reason is unknown. Is it the vitamin D, the UV light, or something else? In an experiment, mice were injected with a substance to give them MS and were randomly assigned to either a control group (with no treatment), a group that received vitamin D supplements, or a group that got exposed regularly to UV light. The scientists found that mice exposed to UV light were significantly less likely to get MS than the control mice, but that vitamin D did not seem to reduce the likelihood of getting MS compared to the control group. For these two tests, one of the p-values was 0.470 and one was 0.002. Which p-value goes with which test? Also, for each test, indicate whether we Reject H 0 or Do not reject H 0. Solution The scientists found a significant effect with the UV light, so the low p-value of 0.002 goes with that test. The conclusion is to reject H 0. The scientists did not find an effect with vitamin D, so the high p-value of 0.470 goes with that test. The conclusion is to not reject H 0. Example More Short Examples! Include some additional short examples in which you report some p-values (0.03, 0.67, 0.001, etc.) and have the students decide the formal conclusion of Reject H 0 or Do not reject H 0 for a significance level of 5% (and then do some with a level of 1% and/or 10%.) Get them very comfortable before they even leave class with making this generic conclusion! Here is another example if you decide to cover Type I and Type II errors: Example BPA in Tomato Soup A consumer protection agency is testing a sample of cans of tomato soup from a company. If they find evidence that the average level of the chemical bisphenol A (BPA) in tomato soup from this company is greater than 100 ppb (parts per billion), they will recall all the soup and sue the company. (a) State the null and alternative hypotheses. (b) What does a Type I error mean in this situation. (c) What does a Type II error mean in this situation. (d) Which is more serious, a Type I error or a Type II error? (There is no right answer to this one. It is a matter of opinion and one could argue either way.) Solution (a) This is a test for a single mean. The hypotheses are H 0 : µ = 100 vs H a : µ > 100. (b) A Type I error means the company s mean is within normal bounds of 100 (the null hypothesis is true) but the sample obtained happens to show(incorrectly) that the mean is too high and the agency ends up recalling all the soup and suing the company when it shouldn t have. (c) A Type II error means the company s mean is too high (the null hypothesis is false) but the sample obtained doesn t give sufficient evidence to show that it is too high and the agency (incorrectly) decides not to recall the soup or sue the company. (d) Both seem pretty serious so you really want to try to not make an error. (Good time to remind them of the benefits of a larger sample size!)

12 CHAPTER 4. HYPOTHESIS TESTS Exercise Notes Exercises 67-76, 77, 79-90 all ask students to make appropriate conclusions to tests. Exercises 91, 92, 93-98 ask students to think about significance levels. Exercises 99-103, 104, 105 ask about Type I and Type II errors. Exercise 78 as students about the definition of a p-value and 106 is a more challenging exercise. There are no exercises needing technology beyond a graphing calculator. Suggested Exercises: Skill Builders: 67-76 odd (or even); Exercises: 77, 83, 84, 85, 88, 90

4.4. CREATING RANDOMIZATION DISTRIBUTIONS 13 4.4 Creating Randomization Distributions Key Concepts Using StatKey or other technology to create randomization distributions and conduct hypothesis tests Understanding the process behind creating randomization distributions Timing One class but with a wide range. Some instructors will minimize the details of creating randomization distributions, and the section will probably take less than one class for these instructors. Instructors who opt to cover the process of creating randomization distributions in greater detail (especially those getting into the process of data collection) will probably want to spend more than one class. Class Notes The key idea for this class is that randomization distributions are created assuming the null hypothesis is true. This is key to helping students understand the p-value. If, by the end of this class, they get only this point and also see how to use StatKey or other technology to conduct a randomization hypothesis test, you will have done your job! As a part of this point, be sure they recognize that the randomization distribution will always be centered at the value given by the null hypothesis. If you have a computer projector, do a variety of examples on StatKey. (If you have a lab with students at their own computers, even better! Have students do a variety of examples on StatKey.) Ask the students for input at every stage: Where will this randomization distribution be centered? What do we do after we have the randomization distribution? Is this a left-tail, right-tail, or two-tail test? How do we find the p-value? How do we interpret the p-value in terms of likelihood of the results happening by random chance? What is the conclusion of the test? This class should strongly reinforce material from the previous sections of Chapter 4. How much detail you go into on the process of creating randomization distributions is optional and up to you. Some instructors will minimize this part, and that is fine. If you do cover it, the point is not to have students memorize a list of different methods, but to expose them to the general process and have them think about what it means in different situations. The general process is to find a randomization method that: is consistent with the null hypothesis and uses the data in the original sample. Ask students to brainstorm in different situations and come up with methods that fit these goals! Because there are a variety of different methods that work and not just one method to memorize, students might find this section difficult. For this reason, some instructors will simplify things a bit by not spending much time (if any) on the goal to match the randomization used in data collection. Instructors who do spend time on this part, however, may find it is a powerful connection back to Chapter 1. Class Activity: Creating One Randomization Statistic If you plan to use this activity, designed to go with the Cocaine Addiction example below, you will need to bring one or more pre-made decks of cards to class. Each deck of cards can be made of 3 x 5 cards cut in half

14 CHAPTER 4. HYPOTHESIS TESTS or whatever is easiest for you. You will need 48 cards, with a red R printed on 28 of them and a black NR printed on the other 20. Separate them into one deck of 24 with 18 R cards and 6 NR cards (labeled desipramine with a post-it note), and another deck of 24 with 10 R cards and 14 NR cards (labeled lithium with a post-it note.) Bring one of these decks of 48 cards to class. Bring more of them if you want to split the class into groups and give each group a deck of cards to play with. (Or, if you have class time to spare, bring blank cards and have the students create their own decks.) Once you make these decks of cards, save them for future semesters! Or share with colleagues! As a part of the Cocaine Addiction study below, have each student or group of students (or just you selecting one student to demonstrate if you only have one deck of cards) create one randomization sample as follows: Since the null hypothesis says the drug doesn t matter in determining who relapses and who doesn t, we can combine the desipramine cards with the lithium cards and shuffle them all together in one big deck of 48 cards. Since we want to see what can happen just by random chance of assignment to the two groups, we then deal the cards into two piles of 24, one designated the simulated desipramine group and one designated the simulated lithium group. This is one randomization sample. We then compute the difference in proportion of relapses between the two groups and that is one randomization statistic. We can plot it on a (very small!) dotplot and that is the beginning of a randomization distribution! Class Examples Here are a couple suggestions. See also the PowerPoint slides and the class handout. Also, check the exercises for good ideas for examples. There are more exercises than needed and many are quite interesting and would make great class examples. Example Cocaine Addiction In a randomized experiment on treating cocaine addiction, 48 cocaine addicts who were trying to quit were randomly assigned to take either desipramine (a new drug), or Lithium (an existing drug). The response variable is whether or not the person relapsed (which means the person was unable to break out of the cycle of addiction and returned to using cocaine.) We are testing to see if desipramine is better than lithium at treating cocaine addiction. The results are shown in the two-way table. Relapse No relapse Total Desipramine 10 14 24 Lithium 18 6 24 Total 28 20 48 (a) Using p D for the proportion of desipramine users who relapse and p L for the proportion of lithium users who relapse, write the null and alternative hypotheses. (b) Compute the appropriate sample statistic. (c) We compute a randomization statistic by assuming the null hypothesis is true. What does that mean in this case? (d) How might we compute a randomization sample for this data? What statistic would we compute as the randomization statistic? Do the class activity with cards at this stage. (e) Use StatKey to generate a randomization dotplot for the difference in proportions based on this sample and what we might see by random chance of the null hypothesis is true. Describe the resulting distribution. Where is it centered?

4.4. CREATING RANDOMIZATION DISTRIBUTIONS 15 (f) Use StatKey to see how extreme the sample statistic from part (b) is in the randomization distribution. This tells us how unlikely the sample data is if the null hypothesis is true (which is the p-value!) Explicitly state the p-value and use it to make a conclusion in the test. Solution (a) H 0 : p D = p L vs H a : p D < p L (b) p D = 10/24 = 0.417 and p L = 18/24 = 0.75 so we have p D p L = 0.417 0.75 = 0.333 (c) It means that the two proportions are equal and the drug has no effect on the relapse rate. It doesn t matter what drug is taken. (d) Since drug doesn t matter, we combine all 48 patients together and see that 28 relapsed and 20 didn t. To see what happens by random chance, we randomly divide them into two groups and compute the difference in proportions of relapses between the two groups. The difference in proportions is the statistic. (e) The resulting distribution will be bell-shaped and centered at the value of the difference in proportions from the null hypothesis, which is zero. (f) This is a left-tail test, and we see on StatKey that the p-value is about -0.016. It is very unlikely to see this large a difference in relapse rates if the drug doesn t matter. We reject the null hypothesis and conclude that despramine is significantly better than Lithium at helping people kick the cocaine habit. Example Normal Human Body Temperature Normal human body temperature is generally considered to be 98.6 F. We wish to test to see if there is evidence that mean body temperature is different from 98.6 F. We collect data from a random sample of 50 people and find x = 98.26. (a) State the null and alternative hypotheses. (b) A randomization distribution requires that the null hypothesis is true. What does that mean in this case? (c) Brainstorm: How can we use the data in the sample as much as possible while also forcing the null hypothesis to be true? (d) Use StatKey to create a randomization distribution for this test, and then use it to find the p-value. Use the p-value to make a conclusion in the test. Solution (a) H 0 : µ = 98.6 vs H a : µ 98.6 (b) It means that the population mean for the simulated samples must be 98.6. (c) Ask the students to think about this: we need to use the 50 data values that we have, while also somehow forcing the mean to be 98.6. How can we do that? They might come up with this on their own: we shift all the data values up by 0.34 to give use the data (same sample size and same spread) while also forcing the mean to be 98.6. (d) Notice that the distribution is centered at 98.6 as it should be. We see how extreme the sample statistic of 98.26 is in the tail of the randomization distribution and we remember to double it since this is a two-tail test. We see that the p-value is very small, so even doubling it, we still get a p-value very close to zero. There is very strong evidence that average human body temperature is not 98.6 F.

16 CHAPTER 4. HYPOTHESIS TESTS Example How Many Times a Day do you Laugh? If you want to do another example (or a different example) on a test for a single mean, try using the data in Example 3.20 in Section 3.3 on the number of times a day people laugh. It is easy to use since there are only six data points (16, 22, 9, 31, 6, 42) with a mean of 21.0. Try testing whether this provides evidence that the mean number of times a day people laugh is greater than 20 (or maybe less than 25). Students will have to figure out how much to add or subtract to move the sample data to the null parameter mean. Example More Short Examples! Include some additional short examples in which you use data available on Statkey to generate some randomization distributions (after stating the null and alternative hypotheses). Ask them in advance where the distribution will be centered and whether it is a left-tail, right-tail, or two-tail test. Then make sure they understand how to use the randomization distribution to see how to use the sample statistic to find the p-value. Emphasize the visual impact of seeing how extreme the sample results are compared to what might happen just by random chance. Exercise Notes Exercises 107-111, 112-116, 125, 135 ask students to answer general questions about randomization distributions. Exercises 123, 126, 127, 128, 130, 137 ask students to think carefully about how to actually create a randomization distribution. Exercises 117-122, 124, 129, 130, 131, 132, 133, 134, 136, 140 ask students to use StatKey to conduct a randomization test. The exercises at the end are more challenging: 138 requires a knowledge of Type I and Type II errors, 139 is particularly challenging as it requires students to think about different statistics to use, 141 and 142 involve matched pairs, and 143, 144, 145 all ask students to think more deeply about how to create a randomization distribution. Questions requiring StatKey or equivalent: 117-122, 124, 129, 130, 131, 132, 133, 134, 136, 138, 140, 142, 145 Suggested Exercises: Skill Builders: 107-122 odd (or even); Exercises: (for using a randomization distribution) 125, 129, 131, 132, 134. (For also assessing how to create a randomization distribution, add 127, 128, 135)

4.5. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 17 4.5 Confidence Intervals and Hypothesis Tests Key Concepts Using the range of plausible values given in an interval to make a conclusion in a two-tailed test Recognizing the problem of multiple tests (and also recognizing that statistical significance is not necessarily the same as practical significance.) Timing Half of a class. This is important material, but it is not hard for the students and can be covered quickly. Class Notes Start with a summary of confidence intervals (estimating something) and hypothesis tests (testing a claim). Then give several examples of the key idea for the class: A confidence interval gives the range of plausible values for the parameter, so it allows us to make conclusions in a two-tail hypothesis test about the parameter. If the null parameter is a plausible value (inside the interval), we do not reject H 0. If the null parameter is not a plausible value (outside the interval), we have evidence to reject H 0. This is a straightforward idea but it really helps students connect the ideas. It is worth also talking about the fact that statistical significance and practical significance are not the same thing, and you should definitely make them aware of the problem of multiple tests and publication bias. If they remember this after the course is done, that will be a very good thing! Class Examples Here are a couple suggestions. See also the PowerPoint slides and the class handout. Also, check the exercises for good ideas for examples. There are more exercises than needed and many are quite interesting and would make great class examples. Example Normal Human Body Temperature Using bootstrapping, we found a 95% confidence interval for the mean body temperature µ to be 98.05 to 98.47. What is the conclusion of a test of H 0 : µ = 98.6 vs H a : µ 98.6? Solution The value 98.6 is not inside the confidence interval, so 98.6 is not a plausible value for µ and we reject H 0. There is evidence that mean body temperature is not 98.6 F. The significance level used is 5%, since the confidence level used was 95% for the interval. (There is a great image in the text and also provided in the PowerPoint slides of a bootstrap distribution and randomization distribution together for this example. It provides a great visual display of the connection between plausible/not plausible values for µ in the bootstrap distribution and sample statistics for which we do not reject/do reject the null hypothesis. Show this image if possible!)

18 CHAPTER 4. HYPOTHESIS TESTS Example Happy Family? The Pew Research Center asked a random sample of US adults aged 18 to 29 Does a child need both a father and a mother to grow up happily? A 95% confidence interval is given below for p, the proportion of all US adults age 18 to 29 who say yes. Use the interval to determine the conclusion to a hypothesis test of H 0 : p = 0.5 vs H a : p 0.5. Solution (a) In 2010, the 95% confidence interval was 0.487 to 0.573. (b) In 1997, the 95% confidence interval was 0.533 to 0.607. (a) Since 0.5 is in the confidence interval 0.487 to 0.573, and thus is a plausible value for p, we do not have evidence against the null hypothesis so we do not reject H 0. At a 5% level, we do not have evidence in 2010 that the proportion is different from 0.5. (b) Since 0.5 is not in the confidence interval 0.533 to 0.607, and thus is not a plausible value for p, we do have evidence against the null hypothesis, so we reject H 0. At a 5% level, we have evidence that the proportion in 1997 is different from 0.5. Example Vitamin E and Heart Attacks? Suppose 100 tests are conducted to determine whether taking vitamin E increases one s chance of having a heart attack. Suppose also that vitamin E has absolutely no effect on one s likelihood of having a heart attack. The tests will use a 5% significance level. (a) How many of the tests are likely to show significance, just by random chance? (b) If only the significant tests are reported, what information is the public likely to hear? Solution (a) Since there are 100 tests and we are using a 5% significance level, we expect 0.05(100) = 5 of them to be significant, just by random chance. (b) Even though only 5 of the tests were significant, and even though there is nothing really going on other than normal random chance, if only those five test results get published, all the headlines will say vitamin causes heart attacks! Be wary of this publication bias. All significant tests should be replicated in further tests before we are confident in the results. Example More Short Illustrations! The PowerPoint slides include some good illustrations (and cartoons!) of the problem of multiple testing and practical vs statistical significance. These aren t exactly examples but would be great to show. Exercise Notes Exercises 146-149, 150-152, 153, 154, 155, 156, 157 use intervals to give a result in a hypothesis test. Exercises 158, 159, 160, 161 more of the same and also require StatKey. Exercises 162 and 163 involve practical vs statistical significance. Exercises 164, 165, 166, 167 illustrate the problem of multiple tests. Questions requiring StatKey or equivalent: 158, 159, 160, 161 Suggested Exercises: Skill Builders: 146-152 odd (or even); Exercises: 154, 155, 156, 157, 162 or 163, 164, 166