Statistical Foundations:

Statistical Foundations: Hypothesis Testing Psychology 790 Lecture #10 9/26/2006

Today sclass Hypothesis Testing. An Example. Types of errors illustrated. Misconceptions about hypothesis testing.

Upcoming Schedule Today (9/26): Hypothesis Testing. Thursday 9/28: Confidence Intervals. Tuesday 10/3: t and two-sample tests. t Thursday 10/5: Midterm review Tuesday 10/10: Midterm (20 item multiple choice). Wednesday (10/11 national holiday for my wife s 30 th birthday) Thursday 10/12: Fall break. Tuesday 10/17: Correlation and Regression from Hays.

Hypothesis Testing Example

An Example of Hypothesis Testing Recall from last week we talked a bit about the Wechsler Adult Intelligence Scale. In the general population, the test has an average of 100 and a standard deviation of 15. Lets go and try a hypothesis test to see if KU students have a similar mean WAIS score. We will sample 100 KU students at random and administer the WAIS.

Buy the WAIS on ebay!!

Example Setup What is the null hypothesis? H 0 : μ KU = 100 What is the alternative hypothesis? H A : μ KU 100

Distributional Setup The key element in our example is to find out what the assumed distribution under H 0. In our case, we will be sampling 100 subjects and taking a sample mean. What does the sampling distribution of the mean look like for N=100? σ ( ) 15, = N 100, N ( 100,1.51 ) N = 100 μ N

Distribution of Test Statistic Under Ui Using R, Rthe plot ltto the right is a picture of the distribution of the test statistic ( x ) under the null hypothesis. Null Hypothesis

Step 1: Set the Type I Error Rate Before we collect our sample, we must first set the Type I error rate for our experiment. Recall the Type I error rate (or α) is the maximum probability we will allow for rejecting the null hypothesis when the null hypothesis is true. This sets up the decision rule for our test. From this, we can obtain a critical value to which we can compare our test statistic. Wh d? What rate do you want to set? Let s to α = 0.05, for tradition s sake.

Decision Rule Using α = 0.05, we can then assign a region of our null distribution where we will reject the null hypothesis. Because we have no idea which direction KU s sample mean will fall, we will split our region into two halfs: An upper tail and a lower tail. We then want to find the following points: α 2 = 0. = α 2 = 0. X ( ) = U such that P x X U 025 X such that P L X L Find these two points. ( x ) 025

Decision Rule Plot We will reject H 0 if our sample mean is in either of these two regions. 97.06 102.94

Our Sample KU Students Laser

Our Sample Lucky for you, I have tweaked my laser pointer to now give me the WAIS score (up to 5 digits) for individuals when hit with the laser beam.

Test Statistic Our sample mean was 107.79. 79 The sample SD was 15.55. Now, what do we decide about our hypothesis test? We reject H 0 because our sample mean of j 0 p 107.79 falls into the rejection region (it is greater than 102.79).

Errors in Hypothesis Testing

Inferential Errors and NHST Real World Null is true Null is false Con nclusion n of the te est lse Null is tr rue Null is fa Correct decision Type I error Type II error Correct decision

Errors and Our Example Knowing a bit about the truth (from simulated data), we can revisit our example for a better description of Type I and Type II errors with graphics. From the example, we knew that the null population sampling distribution of the mean was N(100,1.5). The KU student population sampling distribution for mean WAIS scores was N(105,1.5). We can overlay the two populations and draw regions representing Type II errors.

Type I Error Null Distribution ib ti Alternative Distribution ib ti

Type II Error Null Distribution ib ti Alternative Distribution ib ti

Power Null Distribution ib ti Alternative Distribution ib ti

Points of Interest The example we explored previously was an example of what is called a z-test of a sample mean. Significance tests have been developed for a number of statistics difference between two group means: t-test difference between two or more group means: ANOVA differences between proportions: chi-square

How do we control Type I errors? The Type I error rate is typically controlled by the researcher. It is called the alpha rate, and corresponds to the probability cut-off that one uses in a significance test. By convention, researchers often use an alpha rate of.05. In other words, they will only reject the null hypothesis when a statistic is likely to occur 5% of the time or less when the null hypothesis is true. In principle, any probability value could be chosen for making the accept/reject decision. 5% is used by convention.

Type I errors What does 5% mean in this context? It means that we will only make a decision error 5% of the time if the null hypothesis is true. If the null hypothesis is false, the Type I error rate is undefined.

How do we control Type II errors? Type II errors can also be controlled by the experimenter. The Type II error rate is sometimes called beta. How can the beta rate be controlled? The easiest way to control Type II errors is by increase the statistical power of a test.

Statistical Power Statistical power is defined as the probability of rejecting the null hypothesis when it is false a correct decision (1-beta). Power is strongly influenced by sample size. With a larger N, we are more likely l to reject the null hypothesis if it is truly false. (As N increases, the standard error shrinks. Sampling error becomes less problematic, and true differences are easier to detect.)

Power and correlation This graph shows how the power of the significance test for a correlation varies as a function of sample size. Notice that when N = 80, there is about an 80% chance of correctly rejecting the null hypothesis (beta =.20). When N = 45, we only have a 50% chance of making the correct decision a coin toss (beta =.50). POWER 1.0 0.8 0.4 0.6 0.2 Population r =.30 50 100 150 200 SAMPLE SIZE

Power and correlation Power also varies as a function of the size of the correlation. r =.80 r =.60 When the population correlation is large (e.g.,.80), it requires fewer subjects to correctly reject the null hypothesis that the population correlation is 0. When the population p correlation is smallish (e.g.,.20), it requires a large number of subjects to correctly reject the null hypothesis. POWER 0.0 0.2 0.4 0.6 0.8 1.0 r =.40 r =.20 When the population correlation is 0, the probability of rejecting the null is constant at 5% (alpha). Here power is technically undefined because the null hypothesis is true. 50 100 150 200 SAMPLE SIZE r =.00

Low Power Studies Because correlations in the.2 to.4 range are typically observed in non-experimental research, one would be wise not to trust research based on sample sizes less than 60ish. 0.8 1.0 r =.80 r =.60 r =.40 Why? Because such research only stands a 50% chance of yielding the correct decision, if the null is false. It would be more efficient (and, importantly, just as accurate) to flip a coin to make the decision rather than collecting data and using a significance test. POWER 0.4 0.6 0.0 0.2 r =.20 r =.00 50 100 150 200 SAMPLE SIZE

A Sad Fact In 1962 Jacob Cohen surveyed all articles in the Journal of Abnormal and Social Psychology and determined that the typical power of research conducted in this area was 53%. An even sadder fact: In 1989, Sedlmeier and Gigerenzer surveyed studies in the same journal (now called the Journal of Abnormal Psychology) and found that the power had decreased slightly. Researchers, unfortunately, pay little attention to power. As a consequence, the Type II error rate of research in psychology is likely to be dangerously high maybe as high as 50%.

Power in Research Design Power is important to consider, and should be used to design research projects. Given an educated guess about what the population parameter might be (e.g., a correlation of.30, a mean difference of.5 SD), one can determine the number of subjects needed for a desired level of power. Cohen and others recommend that researchers try to obtain a power level of about 80%.

Power in Research Design Thus, if one used an alpha-level level of 5% and collected enough subjects to ensure a power of 80% for an assumed effect, one would know, before the study was done, what the theoretical error rates are for the statistical test. Although these error rates correspond to long-run outcomes, one could get a sense of whether the research design was a credible one whether it is likely to minimize the two kinds of errors that are possible in NHST and, correspondingly, maximize the likelihood of making a correct decision.

Misconceptions About Hypothesis Testing

Three Common Misinterpretations of Significance Tests and p-values 1. The p-value indicates the probability that the results are due to sampling error or chance. 2. A statistically significant result is a reliable result. 3. A statistically significant result is a powerful, important result.

Misinterpretation # 1 The p-value is a conditional probability. The probability of observing a specific range of sample statistics GIVEN (i.e., conditional upon) that the null hypothesis is true. P(D H o ). This is not equivalent to the probability bilit of the null hypothesis being true, given the data. P(H o D) P(D H o )

Misinterpretation # 2 Is a significant result a reliable, easily replicated result? Not necessarily. The p-value is a poor indicator of the replicability of a finding. Replicability (assuming a real effect exists, that is, that he null hypothesis is false), is primarily a function of statistical ttiti power.

Misinterpretation # 2 If a study had a statistical power equivalent to 80%, what is the probability of obtaining a significant result twice? The probability of two independent events both occurring is the simple product of the probability bilit of each of them occurring..80.80 =.64 If power = 50%?.50.50 =.25 Bottom line: The likelihood of replicating a result is determined by statistical power, not the p-value derived from a significance test. When power of the test is low, the likelihood lih of a long-run series of replications is even lower.

Misinterpretation # 3 Is a significant result a powerful, important result? Not necessarily. The importance of the result, of course, depends on the issue at hand, the theoretical context of the finding, etc.

Misinterpretation # 3 We can measure the practical or theoretical significance of an effect using an index of effect size. An effect size is a quantitative index of the strength of the relationship between two variables. Some common measures of effect size are correlations, regression weights, t-values, and R-squared.

Misinterpretation # 3 Importantly, the same effect size can have different p- values, depending on the sample size of the study. For example, a correlation of.30 would not statistically significant with a sample size of 30, but would be statistically ti ti significant ifi with a sample size of 130. Bottom line: The p-value is a poor way to evaluate the Bottom line: The p value is a poor way to evaluate the practical significance of a research result.

Wrapping Up Today was another fun lecture about the philosophy p of hypothesis testing. We do hypothesis testing all the time. That doesn tmakeit something without error, though.

Next Time Confidence Intervals and their association with hypothesis tests. Confidence Intervals (Ch 6.8 6.11).