Statistical Foundations:

Transcription

1 Statistical Foundations: Hypothesis Testing Psychology 790 Lecture #9 9/19/2006

2 Today sclass Hypothesis Testing. General terms and philosophy. Specific Examples

3 Hypothesis Testing

4 Rules of the NHST Game Recall our discussion about Null Hypothesis Significance Testing from the last lecture: This probability value is often called a p-value or p. When p <.05, a result is said to be statistically significant In short, when a result is statistically significant (p <.05), we conclude that the difference we observed was unlikely to be due to sampling error alone. We reject the null hypothesis. If the statistic is not statistically significant (p >.05), we conclude that t sampling error is a plausible interpretation t ti of the results. We fail to reject the null hypothesis.

5 Hypothesis Testing Notes It is important t to keep in mind that t NHSTs were developed for the purpose of making yes/no decisions about the null hypothesis. As a consequence, the null is either accepted or rejected on the basis of the p-value. For logical reasons, some people are uneasy accepting the null hypothesis when p >.05, and prefer to say that they failed to reject the null hypothesis instead.

6 Hypothesis Testing Items of Interest Very important points about significance testing: 1. The term significant does not mean important, t, substantial, t a or worthwhile. w

7 Points, continued 2. The null and alternative ti hypotheses are often constructed to be mutually exclusive. If one is true, the other must be false. As a consequence, When you reject the null hypothesis, you accept the alternative. When you fail to reject the null hypothesis, you reject the alternative. This may seem tricky because NHSTs do not test t the research hypothesis per se. Formally, only the null hypothesis is tested.

8 Points, continued 3. Because NHSTs are often used to make a yes/no decision about whether the null hypothesis is a viable explanation, mistakes can be made.

9 Errors in Hypothesis Testing

10 Errors in Inference using NHST NHST can lead to decisions which are not correct: Type I error: Your test is significant (p <.05), so you reject the null hypothesis, but the null hypothesis is actually true. Type II error: Your test is not significant (p >.05), you don t reject the null hypothesis, h but you should have because it is false.

11 Errors in Inference using NHST The probability of making a Type I error is determined by the experimenter. Often called the alpha value. Usually set to 5%. The probability of making a Type II error is determined by the experimenter. Often called the beta value. Usually ignored by social science researchers.

12 Errors in Inference using NHST The converse of Type II error is called Power: The probability of rejecting the null hypothesis when it is false a correct decision. 1- beta

13 More on Power Power is strongly influenced by sample size. With larger N, more likely to reject null if it is false. Power analyses are conducted dto determine the size of a sample needed to reject a null hypothesis.

14 Inferential Errors and NHST Real World Null is true Null is false Con nclusion n of the te est lse Null is tr rue Null is fa Correct decision Type I error Type II error Correct decision

15 Points of Interest The example we explored previously was an example of what is called a z-test of a sample mean. Significance tests have been developed for a number of statistics difference between two group means: t-test difference between two or more group means: ANOVA differences between proportions: chi-square

16 How do we control Type I errors? The Type I error rate is typically controlled by the researcher. It is called the alpha rate, and corresponds to the probability cut-off that one uses in a significance test. By convention, researchers often use an alpha rate of.05. In other words, they will only reject the null hypothesis when a statistic is likely to occur 5% of the time or less when the null hypothesis is true. In principle, any probability value could be chosen for making the accept/reject decision. 5% is used by convention.

17 Type I errors What does 5% mean in this context? It means that we will only make a decision error 5% of the time if the null hypothesis is true. If the null hypothesis is false, the Type I error rate is undefined.

18 How do we control Type II errors? Type II errors can also be controlled by the experimenter. The Type II error rate is sometimes called beta. How can the beta rate be controlled? The easiest way to control Type II errors is by increase the statistical power of a test.

19 Statistical Power Statistical power is defined as the probability of rejecting the null hypothesis when it is false a correct decision (1-beta). Power is strongly influenced by sample size. With a larger N, we are more likely l to reject the null hypothesis if it is truly false. (As N increases, the standard error shrinks. Sampling error becomes less problematic, and true differences are easier to detect.)

20 Power and correlation This graph shows how the power of the significance test for a correlation varies as a function of sample size. Notice that when N = 80, there is about an 80% chance of correctly rejecting the null hypothesis (beta =.20). When N = 45, we only have a 50% chance of making the correct decision a coin toss (beta =.50). POWER Population r = SAMPLE SIZE

21 Power and correlation Power also varies as a function of the size of the correlation. r =.80 r =.60 When the population correlation is large (e.g.,.80), it requires fewer subjects to correctly reject the null hypothesis that the population correlation is 0. When the population p correlation is smallish (e.g.,.20), it requires a large number of subjects to correctly reject the null hypothesis. POWER r =.40 r =.20 When the population correlation is 0, the probability of rejecting the null is constant at 5% (alpha). Here power is technically undefined because the null hypothesis is true SAMPLE SIZE r =.00

22 Low Power Studies Because correlations in the.2 to.4 range are typically observed in non-experimental research, one would be wise not to trust research based on sample sizes less than 60ish r =.80 r =.60 r =.40 Why? Because such research only stands a 50% chance of yielding the correct decision, if the null is false. It would be more efficient (and, importantly, just as accurate) to flip a coin to make the decision rather than collecting data and using a significance test. POWER r =.20 r = SAMPLE SIZE

23 A Sad Fact In 1962 Jacob Cohen surveyed all articles in the Journal of Abnormal and Social Psychology and determined that the typical power of research conducted in this area was 53%. An even sadder fact: In 1989, Sedlmeier and Gigerenzer surveyed studies in the same journal (now called the Journal of Abnormal Psychology) and found that the power had decreased slightly. Researchers, unfortunately, pay little attention to power. As a consequence, the Type II error rate of research in psychology is likely to be dangerously high maybe as high as 50%.

24 Power in Research Design Power is important to consider, and should be used to design research projects. Given an educated guess about what the population parameter might be (e.g., a correlation of.30, a mean difference of.5 SD), one can determine the number of subjects needed for a desired level of power. Cohen and others recommend that researchers try to obtain a power level of about 80%.

25 Power in Research Design Thus, if one used an alpha-level level of 5% and collected enough subjects to ensure a power of 80% for an assumed effect, one would know, before the study was done, what the theoretical error rates are for the statistical test. Although these error rates correspond to long-run outcomes, one could get a sense of whether the research design was a credible one whether it is likely to minimize the two kinds of errors that are possible in NHST and, correspondingly, maximize the likelihood of making a correct decision.

26 Misconceptions About Hypothesis Testing

27 Three Common Misinterpretations of Significance Tests and p-values 1. The p-value indicates the probability that the results are due to sampling error or chance. 2. A statistically significant result is a reliable result. 3. A statistically significant result is a powerful, important result.

28 Misinterpretation # 1 The p-value is a conditional probability. The probability of observing a specific range of sample statistics GIVEN (i.e., conditional upon) that the null hypothesis is true. P(D H o ). This is not equivalent to the probability bilit of the null hypothesis being true, given the data. P(H o D) P(D H o )

29 Misinterpretation # 2 Is a significant result a reliable, easily replicated result? Not necessarily. The p-value is a poor indicator of the replicability of a finding. Replicability (assuming a real effect exists, that is, that he null hypothesis is false), is primarily a function of statistical ttiti power.

30 Misinterpretation # 2 If a study had a statistical power equivalent to 80%, what is the probability of obtaining a significant result twice? The probability of two independent events both occurring is the simple product of the probability bilit of each of them occurring =.64 If power = 50%? =.25 Bottom line: The likelihood of replicating a result is determined by statistical power, not the p-value derived from a significance test. When power of the test is low, the likelihood lih of a long-run series of replications is even lower.

31 Misinterpretation # 3 Is a significant result a powerful, important result? Not necessarily. The importance of the result, of course, depends on the issue at hand, the theoretical context of the finding, etc.

32 Misinterpretation # 3 We can measure the practical or theoretical significance of an effect using an index of effect size. An effect size is a quantitative index of the strength of the relationship between two variables. Some common measures of effect size are correlations, regression weights, t-values, and R-squared.

33 Misinterpretation # 3 Importantly, the same effect size can have different p- values, depending on the sample size of the study. For example, a correlation of.30 would not statistically significant with a sample size of 30, but would be statistically ti ti significant ifi with a sample size of 130. Bottom line: The p-value is a poor way to evaluate the Bottom line: The p value is a poor way to evaluate the practical significance of a research result.

34 Wrapping Up Today was another fun lecture about the philosophy p of hypothesis testing. We do hypothesis testing all the time. That doesn tmakeit something without error, though.

35 Next Time Office hours today (1pm-4pm, 449 Fraser). Lab tonight (examples of hypothesis tests). Hypothesis testing example. Confidence Intervals (Ch ).