Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific group of people. Some examples of human study populations are: all UCSC freshman, all Arizona public school teachers, all spouses of Alzheimer s patients in Minnesota, and all preschool children in Chicago. Measurement Properties In addition to specifying the study population of interest, a psychologist will specify some attribute to measure. When studying human populations, the attribute of interest might be a specific type of academic ability, a personality trait, some particular behavior (e.g., hours of TV watching), an attitude, an interest, an opinion, or a physiological measure (e.g., heart rate, blood pressure, blood flow in specific parts of the brain, brain wave). The measurement of the attribute that the psychologist wants to examine is called the response variable. To measure some attribute of a person s behavior is to assign a numerical value to that person. These measurements can have different properties. A ratio scale measurement has the following three properties: 1) a score of 0 represents a complete absence of the attribute being measured, 2) a ratio of any two scores correctly describes the ratio of attribute quantities, and 3) a difference between two scores correctly describes the difference in attribute quantities. Suppose a person s heart rate is measured. This measurement is a ratio scale measurement because a score of 0 beats per minute (bmp) represents a stopped heart and a heart rate of, say, 100 bpm is twice as fast as a heart rate of 50 bpm. In addition, the difference between two heart rates of, say, 50 and 60 bmp describes the same change in heart rate as the difference between 70 and 80 bpm. With interval scale measurements, a score of 0 does not represent a complete absence of the attribute being measured and a ratio of two scores does not correctly describe the ratio of attribute quantities, but a difference between two interval scale scores correctly describes the difference in attribute quantities. Most measurements of psychological attributes are not ratio scale measurements but are assumed to be interval scale attributes. For instance, the Beck Depression Inventory (BDI) is scored on a 0 63 scale with higher scores representing higher levels of depression. However, a BDI score of 0 does not indicate a complete absence of depression nor does a BDI score of, say, 40 represent twice the amount 1
of depression as a BDI score of 20. It is assumed that a difference between two BDI scores correctly describes the difference in depression levels so that a person who initially obtained a BDI score of, say, 30 and then obtained a score 20 after therapy is assumed to have the same level of improvement as a person who initially scored 25 on the BDI and dropped to 15 after therapy. Ratio and interval scale measurements will be referred to simply as quantitative scores. Population Parameters A population parameter is a single unknown numeric value that describes the measurements that could have been assigned to all N people in a specific study population. Psychologists would like to know the value of a particular population parameter because this information could be used to make an important decision or to advance knowledge in some area of research. The population mean, denoted by the Greek letter μ (mu), is a population parameter that is frequently of interest to psychologists. Imagine every person in a study population of size N being assigned a quantitative score. A population mean (μ) is defined as N μ = i=1 y i /N (1.1) where y i is a quantitative score for the i th person in the study population. The N summation notation i=1 y i is a more compact way of writing y 1 + y 2 + + y N. Consider a study population of 2,450 elementary school teachers in a particular school district. Imagine giving a job burnout questionnaire (scored on a quantitative scale of 1 to 25) to all 2,450 teachers. The population mean job burnout score would be μ = (y1 + y2 + + y2450)/2450 where y i is the burnout score for the i th teacher. Another important population parameter is the population standard deviation which is defined as σ = N i=1 (y i μ) 2 /N and describes the variability of the quantitative measurements. Note that σ cannot be negative. Note also that if all N scores are identical (no variability), every y i value would equal μ and then σ would be zero. The squared standard deviation (σ 2 ) occurs frequently in statistical formulas and is called the variance. 2
Normal (Gaussian) Curve A histogram is a graph that visually describes a set of quantitative scores. A histogram is constructed by specifying several equal-length score intervals and counting the number of people who have scores that fall within each interval. An example of a histogram of scores on the Attention Deficit Checklist (ADC) for 4,810 young adults is shown below. ADC Scores 600.00 Frequency 400.00 200.00 0.00 0.00 5.00 10.00 15.00 20.00 Mean =10.00 Std. Dev. =3.01862 N =4,810 y Scientists discovered decades ago that histograms for many different types of quantitative scores could be closely approximated by a certain type of symmetric bell-shaped curve called a normal (or Gaussian) curve. The histogram above includes a graph of a normal curve that closely approximates the shape of the histogram in this particular application. If a set of quantitative scores is approximately normal, the scores will have the following characteristics: about half of the scores are above the mean and about half are below the mean about 68% of the scores are within 1 standard deviation of the mean about 95% of the scores are within 2 standard deviations of the mean almost all (99.7%) of the scores are within 3 standard deviations of the mean 3
A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard unit normal distribution. The symbol z α/2 will be used to denote the point on a standard unit normal distribution for which 100(1 α)% of the distribution is between the values -z α/2 and z α/2. For instance, it can be shown that 95% of a standard unit normal distribution is between the values -1.96 and 1.96 and so z α/2 = 1.96 for α =.05. Random Samples and Parameter Estimates In applications where the study population is large or the cost of measurement is high, the psychologist may not have the necessary resources to measure all N people in the study population. In these applications, the psychologist could take a random sample of n people from the study population of N people. In studies where random sampling is used, the study population is defined as the population from which the random sample was obtained. A random sample of size n is selected in such a way that every possible sample of size n will have the same chance of being selected. Computer programs are typically used to obtain a random sample of size n from a study population of size N. A population mean can be estimated from a random sample. The sample mean n μ = i=1 y i /n (1.2) is an estimate of μ (some statistics texts use X to denote the sample mean). A standard deviation can be estimated from a random sample. The sample standard deviation n σ = i=1 (y i μ ) 2 /(n 1) (1.3) is an estimate of σ (some statistics texts use s to denote the sample standard deviation). Squaring Equation 1.3 gives an estimate of the population variance. Of course, psychologists want to know the exact value of μ but they usually must settle for a sample estimate of μ because the study population size is either too large or the measurement process is too costly. However, the sample mean by itself can be misleading because μ μ will be positive or negative and the direction of the error will be unknown. In other words, the psychologist will not know if the sample mean has overestimated or underestimated the population mean. Furthermore, the magnitude of μ μ will be unknown. Thus, the value of the sample mean might be too small or too large, and it might be close to or very different from the value of μ. 4
Standard Error The standard error of a parameter estimate numerically describes the accuracy of an estimate. A small value of the standard error indicates that the parameter estimate is likely to be close to the unknown population parameter value, while a large standard error value indicates that the parameter estimate could be very different from the study population parameter value. A standard error of a parameter estimate can be estimated from a random sample. The estimated standard error of μ is SE μ = σ 2/n. (1.4) From Equation 1.4 it is clear that increasing the sample size (n) will decrease the value of the standard error and increase the accuracy of the sample mean. From Equation 1.4, it also can be seen that variability in the quantitative scores affects the accuracy of the estimate of a population mean with greater variability in scores leading to less accuracy in the sample mean. Confidence Interval for μ We can learn something about the unknown value of μ by using information from a random sample. By using an estimate of μ (Equation 1.2) and its standard error (Equation 1.4), which can be obtained from one random sample, it is possible to say something about the unknown value of μ in the form of a confidence interval. A confidence interval is a range of values that is believed to contain an unknown population parameter value with some specified degree of confidence. A 100(1 α)% confidence interval for μ is μ ± t α/2;df SE μ (1.5) where t α/2;df is a two-sided critical t-value. The value of t α/2;df can be found in a table of critical t-values given in most statistics texts. The symbol df refers to degrees of freedom and is equal to n 1 in this type of application. The value 100(1 α)% is the confidence level. In psychological studies, it is common to set α =.05 to give a 95% confidence level. Example 1.1. The EPA estimates that lead in drinking water is responsible for more than 500,000 new cases of learning disabilities in children each year. Lead contaminated drinking water is most prevalent in homes built before 1970. A random sample of n = 10 homes was obtained from a listing of about 240,000 pre-1970 homes in the San Francisco 5
area. Drinking water from the 10 homes was tested for lead (the test costs about $25 per house). The legal lead concentration limit for drinking water is 15 ppb. The measured lead concentrations (in ppb) for the 10 homes are given below. 16 14 11 35 29 22 52 21 20 27 The sample mean, sample variance, and standard error for this sample of 10 homes are computed below. μ = (16 + 14 + + 27)/10 = 24.7 σ 2 = [(16 24.7) 2 + (14 24.7) 2 + + (27 24.7) 2 ]/(10 1) = 144.0 SE μ = σ 2/n = 144.0/10 = 3.79 With a sample size of 10 homes, df = n 1 = 9 and t.05/2;9 = 2.26. The 95% lower confidence limit is 24.7 2.26(3.79) = 16.2 and the upper 95% limit = 24.7 + 2.26(3.79) = 33.3. We can be 95% confident that the mean lead concentration in the drinking water of the 240,000 older homes is between 16.2 ppb and 33.3 ppb. Properties of Confidence Intervals There are two important properties of confidence intervals: increasing the sample size will tend to reduce the width of the confidence interval, and increasing the level of confidence (e.g., from 95% to 99%) will increase the width of the confidence interval. Increasing the level of confidence increases the proportion of all possible samples in which a confidence interval will capture the unknown population parameter value. These properties are illustrated in analysis of 50 different random samples of n = 30 from a study population of about 15,000 nurses who were all given an emotional exhaustion questionnaire and their mean score was 22.5. In this hypothetical example, we know that μ = 22.5 but in practice we will not be able to measure all members of the study population and we will estimate μ using the information contained in just one random sample. 6
The above table displays the results of 95% confidence intervals from 50 different random samples. Note that the 95% confidence intervals for μ failed to capture the actual population mean value of 22.5 in sample 7 and sample 34. The table below displays the results for 99% confidence intervals computed from the same 50 random samples. Note that these confidence intervals are wider (less precise) but all of them have captured the population mean value. Choosing a Confidence Level The American Psychological Association recommends using 95% confidence intervals. A 95% confidence interval represents a good compromise between the level of confidence and the confidence interval width, as shown in the following graph. Notice that the confidence interval width increases almost linearly up to a confidence level of about 95% and then the width increases dramatically with increasing confidence. Thus, small increases in the level of confidence beyond 95% lead to relatively large increases in the confidence interval width. 4.000 CI Width 2.000 0.000 0.00 20.00 40.00 60.00 80.00 100.00 Confidence 7
Hypothesis Testing In some applications, the psychologist simply needs to decide if the population parameter is greater than some value or less than some value. If the parameter is greater than some value, then one course of action will be taken; if the parameter is less than some value, then another course of action will be taken. The following notation is used to specify a set of hypotheses regarding μ H0: μ = h H1: μ > h H2: μ < h where h is some number specified by the psychologist and H0 is called the null hypothesis. H1 and H2 are called the alternative hypotheses. In virtually all applications, H0 is known to be false (because it is extremely unlikely that μ will exactly equal h) and the psychologist s goal is to decide if H1 or H2 is true. Consider the following example. If the mean job satisfaction score in a study population of employees is less than 5, then a company will increase year-end bonuses; otherwise, the standard bonus will be given. In this specific application, the set of hypotheses is shown below. H0: μ = 5 H1: μ > 5 H2: μ < 5 A confidence interval for μ can be used to choose between H1: μ > h and H2: μ < h. If the upper limit of a 100(1 α)% confidence interval is less than h, then H0 is rejected and H2 is accepted. If the lower limit of a 100(1 α)% confidence interval is greater than h, then H0 is rejected and H1 is accepted. If the confidence interval includes h, then H0 cannot be rejected. This general hypothesis testing procedure is called a three-decision rule because one of following three decisions will be made: 1) accept H1, 2) accept H2, or 3) fail to reject H0. A failure to reject H0 is called an inconclusive result. A test of H0: μ = h is commonly referred to as a one-sample t-test and involves the computation of the test statistic t = (μ h)/se μ. Statistical packages such as SPSS or R will compute the p-value that corresponds to the value of the test statistic. The p-value can be used to reject H0: μ = h. Specifically, H0 is rejected if the p-value is less than α (α is usually set equal to.05). 8
The p-value is related to the sample size with larger sample sizes leading to smaller p-values. With a sufficiently large sample size, the p-value for a test of H0: μ = h will be less than.05. It is a common practice to report the results of a statistical test to be significant if the p-value is less than.05 and nonsignificant if the p-value is greater than.05. It is important to remember that a p-value of less than.05 (a significant result) simply indicates that the sample size was large enough to reject the null hypothesis (which is known to be false in virtually all applications) and does not indicate if the population mean is meaningfully different from the hypothesized value. Also, a p-value greater than.05 does not imply that H0 is true. In a three-decision rule, a directional error occurs when H1: μ > b has been accepted but μ < b is true or when H2 : μ < b has been accepted but μ > b is true. The probability of making a directional error is at most α/2. For instance, if a 95% confidence interval is used to select H1 or H2, the probability of making a directional error is at most.025. Most social science journals require authors to use α =.05. Power of a Hypothesis Test In hypothesis testing applications, the goal is to reject H0: μ = h and then choose either H1: μ > h or H2: μ < h. The power of a test is the probability of rejecting H0. If the power of the test is low, then the probability of an inconclusive result will be high. The power of a test of H0: μ = h depends on the sample size, the absolute value of (μ h)/σ (the standardized effect size), and the α level. Increasing the sample size will increase the power of the test as illustrated below for α =.05 and (μ h)/σ = 0.5. 9
Decreasing α will reduce the probability of a directional error but will also decrease the power of the test as illustrated in the graph below for n = 30 and (μ h)/σ = 0.5. Note that there is little loss in power for reductions in α down to about.05 with power decreasing more dramatically for α values below.05, which is why α =.05 is a recommended value. For a given sample size and α level, the power of the test increases as the absolute value of (μ h)/σ increases, as illustrated in the graph below for n = 30 α =.05. Interpreting a Confidence Interval Consider a 95% confidence interval for μ. If a 95% confidence interval for μ was computed from every possible sample of size n in a given study population, about 95% of these confidence intervals will capture the unknown value of μ. With random sampling, we know that every possible sample of size n has the same 10
chance of being selected. Knowing that a 95% confidence interval for μ will capture μ in about 95% of all possible samples, and knowing that the one sample the psychologist has used to compute the 95% confidence interval is a random sample, we can say that the probability is.95 (or we are 95% confident) that the computed confidence interval includes μ. Another way to think about confidence intervals is to consider a test of H0: μ = h for many different values of h. For a given value of α, if H0 is tested for all possible values of h, a 100(1 α)% confidence interval for μ is the set of all values of h for which H0 cannot be rejected. All values of h that are not included in the confidence interval are values for which H0 would have been rejected at the specified α level. For instance, if a 95% confidence interval for μ is [14.2, 18.5], then all tests of H0: μ = h will not reject H0 if h is any value in the range 14.2 to 18.5 but will reject H0 for any value of h that is less than 14.2 or greater than 18.5. Sample Size Planning A narrow confidence interval for μ is desirable because it provides a more precise and informative description of μ than a wider confidence interval. It is possible to approximate the sample size that will give the desired width (upper limit minus lower limit) of a confidence interval with a desired level of confidence. The sample size needed to obtain a 100(1 α)% confidence interval for having a desired width of w is approximately n = 4σ 2(z α/2 /w) 2 (1.6) ~ 2 where is a planning value of the response variable variance and z α/2 is a twosided critical z-value. Planning values are obtained from expert opinion, pilot studies, or previously published research. If the maximum and minimum possible values of the response variable scale are known, [(max min)/4] 2 provides a crude planning value of the population variance. Equation 1.6 shows that larger sample sizes are needed with narrower confidence interval widths, greater levels of confidence, and greater variability of the response variable. Round Equation 1.6 up to the nearest integer. Example 1.2. A psychologist wants to estimate the mean job satisfaction score for a population of 4,782 public school teachers. The psychologist plans to use a job satisfaction questionnaire (measured on a 1 to 10 scale) that has been used in previous studies. A review of the literature suggests that the variance of the job satisfaction scale is about 6.0. The psychologist would like the 95% confidence interval for μ (the mean job satisfaction score for all 4,782 teachers) to have a width of about 1.5. The required sample size is approximately n = 4(6.0)(1.96/1.5) 2 = 40.9 41. 11
Note that Equation 1.6 does not include the value of the study population size (N). Actually, the sample size requirement does depend on N according to the formula n = n(1 n/n) where n is given by Equation 1.6 and n is the revised sample size requirement. In most applications, n will be a small fraction of N and then n will be about the same as n. For instance, if N = 3,000 and Equation 1.6 gives n = 40, then n = 40(1 40/3000) = 39.6 40. Sampling in Two Stages In applications where sample data can be collected in two stages, the confidence interval obtained in the first stage can be used to determine how many more participants should be sampled in the second stage. If the 100(1 α)% confidence interval width from a first-stage total sample size of n is w 0, then the number of participants that should be added to the original sample (n + ) in order to obtain a 100(1 α)% confidence interval width of w is approximately n + = [( w 0 w )2 1] n. (1.7) Example 1.3. In a study with 25 participants, the 95% confidence interval for μ had a width of 4.38. The psychologist suspects that the results of this study are unlikely to be published because the confidence interval is too wide. The psychologist would like to obtain a 95% confidence interval for μ that has a width of 2.0. To achieve this goal, the number of participants that should be added to the initial sample is [(4.38/2.0) 2 1]25 = 94.9 95. Target Population The confidence interval for μ (Equation 1.5) provides information about the study population from which the random sample was taken. In most applications, the study population will be a small subset of some larger and more interesting population called the target population. For instance, a psychologist may take a random sample of 100 undergraduate students from a particular university directory consisting of 12,000 student names because the psychologist has easy access to this directory. The results of Equation 1.5 will apply only to those 12,000 undergraduate students, but the psychologist is more interested in the value of μ for a target population of all young adults. It might be possible for the psychologist to make a persuasive argument that the study population mean should be very similar to some target population mean. For instance, suppose the psychologist computed a confidence interval for the mean eye pupil diameter in a small room lit only by a 40-watt light bulb using a 12
random sample from the 12,000 undergraduate students. The psychologist could argue convincingly that the mean eye pupil diameter in the study population of 12,000 undergraduates should be no different than the mean eye pupil diameter of all young adults. As an example where the study population mean would probably not be similar to some target population mean, suppose that the psychologist instead computed a confidence interval for the mean score on an abortion attitude scale using a sample of students from a Jesuit university. In this situation, the psychologist does not believe that the mean abortion attitude in the Jesuit study population is similar to the mean abortion attitude in a target population of all young adults. Researchers in the physical and biological sciences seldom worry about the distinction between a study population and a target population because the parameter values for many physical or biological attributes (like the eye pupil diameter example) are much less likely to differ across different study populations, and consequently the study population parameter values are almost automatically assumed to generalize to some large target population. In contrast, psychologists who study complex human behavior that can vary considerably across different study populations, need to be very cautious about how they interpret their confidence interval and hypothesis testing results. Psychologists should clearly describe the characteristics of the study population so that the statistical results are interpreted in a proper context. Assumptions for Confidence Intervals and Tests Confidence intervals and hypothesis tests for μ require three assumptions. One assumption, the random sampling assumption, requires the sample to be a random sample from the study population. A second assumption, the independence assumption, requires the responses from each participant in the sample to be independent of one another. In other words, no participant in the study should influence the responses of any other participant in the study. A third assumption, the normality assumption, requires the quantitative scores in the study population have an approximate normal distribution. Confidence intervals and hypothesis tests for μ will be uninterpretable if the random sampling assumption has been violated. If the independence assumption has been violated, the true probability of a directional error can be greater than α/2, and the true confidence level can be less than 100(1 α)%. Recall that the interpretation of a confidence interval for μ assumed that a 100(1 α)% confidence interval would capture the unknown population mean in about 100(1 α)% of all possible samples of a given size. However, when the 13
independence assumption is violated, the percent of samples in which a 100(1 α)% confidence interval captures the population parameter can be far less than 100(1 α)% and the psychologist s confidence regarding the computed confidence interval result will be mistakenly too high. Violating the normality assumption will have little effect on the confidence interval and test for μ unless the quantitative scores in the study population are extremely nonnormal and the sample size is small (n < 30). If the sample size is small and the study population quantitative scores are extremely nonnormal, the proportion of all possible 95% confidence intervals that would capture μ can be less than.95, and the psychologist s confidence regarding the computed confidence interval result will be mistakenly too high. Assessing the Normality Assumption Recall that the normal distribution is symmetric. If the quantitative scores in the sample exhibit a clear asymmetry, this would suggest a violation of the normality assumption. The asymmetry in a set of quantitative scores can be described using a coefficient of skewness. The skewness coefficient is equal to zero if the scores are perfectly symmetric, positive if the scores are skewed to the right, and negative if the scores are skewed to the left. SPSS and R provide a test of the null hypothesis that the population skewness coefficient is zero. If the p-value is less than.05, the psychologist may conclude that the normality assumption has been violated and that the population scores are skewed, but a p-value greater than.05 does not imply that the normality assumption has been satisfied. The population distribution of quantitative scores can be non-normal even if the distribution is symmetric. The coefficient of kurtosis describes the degree to which a distribution is more or less peaked or has shorter or thicker tails than a normal distribution. SPSS and R provide a test of the null hypothesis that there is no kurtosis in population distribution of scores. If the p-value is less than.05, the psychologist may conclude that the normality assumption has been violated and that the population scores have kurtosis, but a p-value greater than.05 does not imply that the normality assumption has been satisfied. Data Transformations A transformation of the quantitative scores can reduce skewness. When the quantitative score is a frequency count, such as the number of facts that can be recalled or the number of spelling errors in a writing sample, a square root transformation ( y i ) may reduce non-normality. When the score is a time-to- 14
event, such as the time required to solve a problem or a reaction time, a natural log transformation (ln(yi)) or a reciprocal transformation (1/yi) may reduce nonnormality. Example 1.4. A histogram of 80 highly skewed scores is shown below (left). A histogram of the log-transformed scores (right) is much more symmetric. Although data transformations may reduce non-normality, the mean of the transformed scores may then be difficult to interpret. However, in some applications the value of μ could be interpretable after a data transformation. For instance, if the response variable is measured in squared units, such as the brain surface area showing activity measured in squared centimeters, a square root transformation could be interpreted as the size of the activated area. Or if the response variable is reaction time measured in seconds, then a reciprocal transformation could be interpreted as responses per second. 15