Test Bias. As we have seen, psychological tests can be well-conceived and well-constructed, but



Similar documents
CALCULATIONS & STATISTICS

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Statistics. Measurement. Scales of Measurement 7/18/2012

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

X = T + E. Reliability. Reliability. Classical Test Theory 7/18/2012. Refers to the consistency or stability of scores

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias

Simple Regression Theory II 2010 Samuel L. Baker

Descriptive Statistics

CURVE FITTING LEAST SQUARES APPROXIMATION

DATA COLLECTION AND ANALYSIS

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Simple linear regression

INTRODUCTION TO MULTIPLE CORRELATION

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

The Relationship between the Fundamental Attribution Bias, Relationship Quality, and Performance Appraisal

Session 7 Bivariate Data and Analysis

II. DISTRIBUTIONS distribution normal distribution. standard scores

Association Between Variables

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Independent samples t-test. Dr. Tom Pierce Radford University

EQUATING TEST SCORES

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Lecture Notes Module 1

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Additional sources Compilation of sources:

Module 3: Correlation and Covariance

Example: Boats and Manatees

Types of Error in Surveys

Assessment, Case Conceptualization, Diagnosis, and Treatment Planning Overview

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Constructing a TpB Questionnaire: Conceptual and Methodological Considerations

Schools Value-added Information System Technical Manual

Basic Concepts in Research and Data Analysis

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Chapter 9 Assessing Studies Based on Multiple Regression

The correlation coefficient

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Correlation key concepts:

Multiple Regression: What Is It?

MAY Legal Risks of Applicant Selection and Assessment

Statistics, Research, & SPSS: The Basics

Economics of Strategy (ECON 4550) Maymester 2015 Applications of Regression Analysis

UNDERSTANDING THE TWO-WAY ANOVA

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

4. Continuous Random Variables, the Pareto and Normal Distributions

Mgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side

2. Simple Linear Regression

Week 3&4: Z tables and the Sampling Distribution of X

Fairfield Public Schools

Empirical Methods in Applied Economics

Elasticity. I. What is Elasticity?

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

Chapter 2 - Why RTI Plays An Important. Important Role in the Determination of Specific Learning Disabilities (SLD) under IDEA 2004

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Florida s Plan to Ensure Equitable Access to Excellent Educators. heralded Florida for being number two in the nation for AP participation, a dramatic

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Raw Score to Scaled Score Conversions

Introduction to Longitudinal Data Analysis

Characteristics of Binomial Distributions

Normal distribution. ) 2 /2σ. 2π σ

Mode and Patient-mix Adjustment of the CAHPS Hospital Survey (HCAHPS)

Regression III: Advanced Methods

This chapter discusses some of the basic concepts in inferential statistics.

Interpreting and Using SAT Scores

Econometrics Simple Linear Regression

Module 5: Multiple Regression Analysis

6.4 Normal Distribution

Technical Information

Missing data in randomized controlled trials (RCTs) can

Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh

Using Excel for inferential statistics

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

interpretation and implication of Keogh, Barnes, Joiner, and Littleton s paper Gender,

Chapter 8: Quantitative Sampling

Eight things you need to know about interpreting correlations:

Elementary Statistics

Relating the ACT Indicator Understanding Complex Texts to College Course Grades

Review of Fundamental Mathematics

10. Analysis of Longitudinal Studies Repeat-measures analysis

Teaching Multivariate Analysis to Business-Major Students

Study Guide for the Final Exam

Linear Models in STATA and ANOVA

THE SELECTION OF RETURNS FOR AUDIT BY THE IRS. John P. Hiniker, Internal Revenue Service

CENTER FOR EQUAL OPPORTUNITY. Preferences at the Service Academies

Solution: The optimal position for an investor with a coefficient of risk aversion A = 5 in the risky asset is y*:

Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach

Measurement. How are variables measured?

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Reception baseline: criteria for potential assessments

The Capital Asset Pricing Model (CAPM)

Time Series and Forecasting

Transcription:

Test Bias As we have seen, psychological tests can be well-conceived and well-constructed, but none are perfect. The reliability of test scores can be compromised by random measurement error (unsystematic error), and the validity of test score interpretations can be compromised by response biases that systematically obscure the psychological differences among respondents. Now we will examine the possibility that the validity of test score interpretations can be compromised further by test biases that systematically obscure the differences (or lack thereof) among groups of respondents. Psychological tests are often used to make important decisions that affect the lives of real people which colleges (if any) will decide to accept you, in which class will your child be enrolled, and will an employer decide to hire you? To the degree that such decisions are based on tests that are biased in favor of or against specific groups of people, such biases have extremely important personal and societal implications. Suppose you are interested in studying the possibility that gender differences exist in mathematical ability. You give a reasonably reliable mathematics test to a representative group of males and females, and you find that, on average, males have higher math scores than males. As a researcher you would be tempted to interpret your test scores in terms of the psychological construct that they are intended to reflect that males tend to have greater mathematical ability than females. However, it is possible that the participants test scores should not be interpreted as reflecting purely their mathematical ability. That is, it is possible that the test is biased in some way. For example, if the males test scores overestimated their true mathematical ability and the females test scores underestimated their true ability, then the test is biased. In this case, 1

the difference between the test scores for males and females might be due to test score bias, not due to a difference in their true mathematical abilities. There are two general methods used to detect test biases. Roughly speaking, the two types of test bias reflect biases in the meaning of a test and biases in the use of a test. Construct bias occurs when a test has different meanings for two groups, in terms of the precise construct that the test is intended to measure. Construct bias has to do with the relationship of observed scores to true scores on a psychological test. If this relationship can be shown to be systematically different for different groups, then we might conclude that the test is biased. Construct bias can lead to situations in which two groups have the same average true score on a psychological construct but different average test scores. The second type of bias that is predictive bias, which occurs when a test s use has different implications for two groups. Predictive bias has to do with the relationship between scores on two different tests. One of these tests (the predictor test) is thought to provide values that can be used to predict scores on the other test (the outcome test or measure). For example, college admissions officers might use SAT test scores to predict freshman GPAs. The SAT would be the predictor test and GPAs would be the outcome measure. In this context, test bias concerns the extent to which the link between predictor test true scores and outcome test observed scores differ for two groups. If the SAT is more strongly predictive of GPA for one group than for another, then the SAT suffers from predictive bias, in terms of its use as a predictor of GPA. The two types of bias construct and predictive are independent. For example, a test might have no construct bias, but suffer from predictive bias. The SAT might accurately reflect true academic aptitude differences among groups of people (and thus have no construct bias), 2

but academic aptitude might not be associated with freshman GPA equally for two groups of people (and thus predictive bias would exist). There are several ways to operationally define and identify test score bias. There are at least two categories of procedures that can be used to identify test score bias: a) internals methods to identify construct bias and b) external methods to identify predictive bias. We emphasize the operational nature of this task to remind you that test score bias in both of its forms is a theoretical concept; in part, because both types of bias depend on the theoretical notion of a true score. There is no one way to detect test score bias any more than there is one way to calculate directly such psychometric test score properties as reliability or validity. There are, however, various generally accepted ways to estimate the degree to which test bias exists. An overarching issue in the definition and detection of test bias is that the existence of a group difference in test scores does not necessarily mean that test scores are biased. Suppose you find that females have higher scores on self-esteem than males. This difference is not prima facie evidence that the test is biased (Jensen, 1980; 1998; Thorndike, 1971). The participants test scores might in fact be good estimates of their true self-esteem. In such a case, the test is not biased, and the group difference in test scores reflects a real difference in average self-esteem. Consider doing a study in which you weigh representative groups of males and females. You would doubtless find that the average weight of females is lower than the average weight of males. You would not take this difference to mean that the scale you used to measure weight produced scores that were biased. Why Worry about Test Score Bias? It is likely that everyone reading this book has taken a psychological test of some kind. Virtually all children schooled in the United States or other industrialized countries are exposed 3

on a regular basis to academic achievement tests. In the United States, most students who plan to attend an institution of higher education have taken the SAT or ACT test. Most graduate schools in the United States require student applicants to take the GRE. Applicants for most Federal Government jobs are required to take a civil service examination and corporations regularly test job applicants and sometimes even employees using psychological tests. Scores on these and other types of psychological tests are often used to make important decisions about people. In educational settings intelligence tests scores are used to place children in special programs. Intelligence test scores are used by law courts to make decisions about who can and who can not be sentenced to death following a murder conviction. Educational institutions use scores on standardized tests to make admissions decisions. Corporations and governments often make job decisions about people based, at least in part, on test scores. In the United States, most public school teachers have to take and pass standardized tests to become certified school teachers. The use of psychological tests in our society is pervasive and scores on these tests can have an important impact on people s lives and on our public and private institutions. Because testing is a pervasive feature of our society and because test scores have important consequences for people, we would like to develop tests that produces scores that allow us to differentiate among people based on true score differences and not on particular group membership. For example, if we have a self-esteem test, then we would like to be sure that scores are determined only by self-esteem and not contaminated by some other extraneous factor such as the biological sex of the person taking the test. In other words, we want unbiased tests. Our desire for unbiased test scores is rooted in our belief that we should not discriminate for or against a person because of their biological sex, their ethnicity, their race, their religious 4

preference, or their age. In some cases, the list of groups that should be protected from test score bias has been expanded to include factors such as sexual preference, pregnancy, marital status, linguistic background, and various disabilities. In each of these cases, we should be confident that observed score differences on psychological tests are a function of true score differences. It is especially important to be able to show that test scores are not biased in those instances in which average observed scores on some type of psychological test differ between groups. Detecting Construct Bias: Internal Evaluation of a Test Construct bias is often evaluated by examining responses to individual items on a test. An item on a test is biased if people belonging to different groups responded in different ways to the item and if it could be shown that these differing responses were not related to group differences associated with the psychological attribute that the test was designed to measure. For example, suppose you had a 100-item mechanical aptitude test. If you selected one item from the test and found that males responses were similar to females responses, then the item would not appear to be biased (assuming that the males and females had the same level of mechanical aptitude). On the other hand, if males and females with the same level of aptitude responded in different ways to the item, then you would suspect some type of bias in the item. Most psychological tests are composite tests they contain multiple items. For such composite tests, the overall test bias is a function of the bias associated with each of the items. If none of the items on a test seem to be biased, then we would assume that the total test score is unbiased. However, if there one or more items seem to be biased, then we would suspect that the total test score might also be biased. Remember that test bias concerns the relationship between group differences in true scores and group differences in observed test scores. In the case of construct bias, a test item 5

would be biased if responses to the item for people who belong to one group reflect their true scores on the relevant psychological attribute but responses to the item from people in another group are not simply a function of that attribute (we are assuming some minimum degree of reliability for the test of interest.) Of course, we can never know a person s true score with respect to any attribute. Therefore, the procedures that we are going to discuss are estimates of the existence and degree of construct bias. Construct bias is related to the meaning of test scores. Evidence of construct bias suggests that scores on a test might have different meanings for different groups of people. If we had evidence that suggested that scores on our mechanical aptitude test suffered from construct bias related to the biological sex of those taking the test, we would have to entertain the possibility the scores measure different psychological attributes in the two groups. For example, the males responses to the test might be determined primarily by a single construct mechanical aptitude but the females responses might be determined by two constructs mechanical aptitude and stereotype threat (the tendency to behave in ways that confirm stereotypes about one s group (Spencer, Steele, & Quinn, 1999). Thus, the mechanical aptitude test does not measure the exact same psychological attributes for the two sexes. Several procedures that can be used to estimate the existence and degree of construct bias will be described. These procedures focus on the internal structure of the test. The internal structure of a test has to do with the way that the parts of a test are related to each other. Most simply, internal structure refers to the pattern of correlations among items and/or the correlations between each item and the total test score. To evaluate the presence of construct bias, we examine the internal structure of a test separately for two groups. If the two groups exhibit the same internal structure to their test responses, then we conclude that the test is unlikely to suffer 6

from construct bias. However, if the two groups exhibit different internal structures to their test responses, then we conclude that the test is likely to suffer from construct bias. There are at least four methods for detecting construct bias. Item Discrimination Index (I d ) One method of detecting construct bias is by computing item discrimination indexes separately for two groups. An item s discrimination index reflects the degree to which the item is related to the total test score (i.e., that people who answered the item correctly tended to do better on the test as a whole than people who answered the item incorrectly), and by implication, it indicates that the item is highly similar to most of the other items on a test. In this way, item discrimination indexes reflect the structure of associations among test items. The I d is calculated by calculating a total score on a test and then rank ordering the scores from highest to lowest. You then select those people whose test score puts them in the top 30% of the scorers and those people who put them in the bottom 30% of the scorers. Now you go to each test item each question and compute the proportion of people in each of the two groups who answer the item correctly. For example, suppose that there are 50 people in the top group of test takers and you find that 40 of these people answered question #1 correctly. The proportion of people in this group to answer the item correctly would be.80. Now imagine that in the low scoring group only 10 of these 50 people answered question #1 correctly. The proportion of low scorers to answer the item correctly would be.20. Now if you subtract the low score proportion from the high score proportion you get the discrimination index for item #1 (see item discrimination example Table 1.1). I = P - P d hi l ow Where P = Proportion of people in a group who answer an item correctly 7

Historically, the item discrimination index was developed in association with classical test theory. The index is an important measure of the extent to which responses to test items can be used to differentiate among people on the basis of the amount of their knowledge of some topic, or on the amount of some other type of psychological attribute. Again, assume we give a mechanical aptitude test to a group of people. If people who have high mechanical aptitude have a high probability of answering a particular aptitude question correctly while people with low mechanical aptitude have a low probability of answering the item correctly, the question would have a high item discrimination index value (e.g.,.90). On the other hand, if the item discrimination index for a question was low (e.g.,.10) then the low aptitude respondents answered the item correctly nearly as often as the high aptitude respondents. Thus, the item does not clearly discriminate among people with varying levels of the construct being measured. The item discrimination index can be used to estimate construct bias. Specifically, we would select an item, compute its discrimination index separately for two groups of people, and then compare the groups indexes. If the two discrimination index values are approximately equal, then we conclude that the item is probably not biased. However, if the two discrimination index values are not approximately equal, then we conclude that the item is probably biased in some way. That is, we would conclude that the item seems to belong on the test for one group, but not for the other group. By including the item on the test for both groups, the test seems to be somewhat different for the two groups. This analysis would be conducted for each of the items on the test. An important feature of the item discrimination index as a measure of construct bias is that it is independent of the number of people in the groups that are being compared that answer an item correctly. For example, we might find that one of our mechanical aptitude items was 8

answered correctly by only 40% of the males but by 60% of the females. Even so, the item discrimination index for the question could be the same for both groups. In this case, we would assume that the item is functioning as a measure of mechanical aptitude in the same way for both groups, but that females know more about the material than males (i.e., more of them answered the item correctly). Factor Analysis A second method for examining construct bias is by conducting a factor analysis of items separately for two or more groups of people. Factor analysis as an important method for evaluating the internal structure of a test. Factor analysis as a statistical procedure for partitioning the variance or covariance among test items into clusters of factors that in some sense hang together. It is sometimes the case that responses to a group of items on a test are more highly positively correlated with each other than they are to responses given to other items on the test. The group of items that are highly correlated with each other statistically hang together, and they are believed to reflect a factor. If all of the items on a test have similar correlations with each other (i.e., there is no evidence of multiple groups of items), then we say that the test is homogeneous, or that all of the test score variance, other than error variance, is accounted for by a single factor. Factor analysis can be used to evaluate the internal structure of a test separately for two groups of people. For example, we might find that, among males, the mechanical aptitude test has a strong single-factor structure all of the items seem to be highly correlated with each other, suggesting that the test is a essentially measure of one and only one construct. To evaluate the potential presence of construct bias, we would need to examine the factor structure for females responses to the test items as well. If we found a single factor among females 9

responses, then we would conclude that the aptitude test has the same internal structure for males and females. Consequently, we would conclude that the test does not suffer from construct bias. However, we might conduct a factor analysis of females responses and found two factors or more. In this case, we would conclude that the test has different internal structure for males and females, and we would then conclude that the test does indeed suffer from construct bias. That is, the total test score reflects different psychological factors for males and for females. Differential Item Functioning Analyses Perhaps the best way to evaluate construct bias is a procedure called differential item functioning analysis. Differential item functioning analysis is a feature of a psychometric approach called Item Response Theory (IRT). An important aspect of IRT is the assumption that it is possible to estimate respondents trait levels directly from empirical sources of data. The trait levels are, in essence, estimates of participants true scores for the psychological attribute that is being measured. If we assume that we know the trait levels for all the people in two groups and we have their responses to a test item, then we can see if the trait levels and the item responses match-up in the same way for both groups. If they do not, then it is possible that the item is biased. IRT is based on the idea that there is a function relating a participant s trait level to the probability that he or she will answer a question on a test correctly. For example, we might find that an individual with a trait level that is one standard deviation above the mean has a.80 probability of answering a particular item correctly, but that an individual with a trait level that is one standard deviation below the mean has only a.20 probability of answering the item correctly. If you have a group of people take a test and you know their respective trait levels, then you can use specialized statistical software to draw an item characteristic curve (ICC) to 10

illustrate this function for each item. Furthermore, if you have two groups of people, then you can draw ICCs separately for each group. To evaluate the presence of construct bias, you would compare the ICCs of the two groups. If the item is not biased, then the two groups ICCs should be very similar. That is, the probability that two people will answer an item correctly should be the same if the two people have the same trait level. However, if the item is biased, then the two groups ICCs will be dissimilar. That is, the probability that two people (e.g., a male and a female) will answer an item correctly might be different even if the two people have the same trait level. Such a situation would clearly reflect the presence of construct bias. For example, suppose that you want to determine if an item on a mechanical aptitude test was biased with respect to biological sex; You could compute mechanical aptitude knowledge scores for each person in a study (these represent their trait levels), and you could compute the probability that the item is answered correctly for each person. You use this information to draw an ICC (See Figure 1.1). Now, you sort the people in your study into two groups (i.e., a group of males and a group of females) and draw ICC curves separately for each group. If the curves overlap, you would probably conclude that the item is not biased. Suppose however, you obtained the results illustrated in Figures 1.2 and 1.3. Results such as these would lead you to suspect item bias. Figure 1.3 is an example of uniform bias. In this example it appears that females, with the same mechanical knowledge as males, find the item more difficulty to answer than the males. Figure 1.4 illustrates non-uniform bias, a situation in which the ICCs differ in shape as well as location. In this case it appears that the item is measuring different traits for males and females. The ICC approach is a visual method for detecting construct bias, but there are IRT methods that are even more precise ways of evaluating the presence of construct bias (e.g., Smith, & Reise, 1998). 11

Although IRT s differential item functioning analysis is a strong method for identifying construct bias, it has a downside. IRT analyses are quite complex in a variety of ways which model to use, how to determine if parameter differences between groups are really different or simply due to measurement error, the need for very large sample sizes, the need for item samples and samples of people that are heterogeneous enough to represent the complete range of traits the test is designed to measure, and the need for specialized statistical software to conduct the analyses. These complexities are such that IRT is only still emerging as a widely-appreciated and understood method of detecting construct bias. Rank Order There is another quick and computationally easy way to get an estimate of construct bias if you have test items that can be ranked in order of difficulty. Using our 100 item aptitude test as an example, some of the test questions will probably be easier to answer than others. These questions can be ranked in order of difficulty. The rankings can be done separately for different groups (e.g., males and females). If the item ranks differ across groups, then we would suspect that test score construct bias exists. We would suspect this because each item does not appear to be a measure of the same thing for both groups. You can use the ranks to compute Spearman s rank order correlation coefficient (rho, interpreted in the same way as r ) to index rank order consistency across groups. If rho is low, e.g., <.90, we might suspect construct bias. If you found evidence of construct bias, you would probably want to follow-up on the finding with additional analyses to identify the particular source of the low correlation coefficient (see Jensen, 1980). Notice that the correlation between the ranks can be high even if the proportion of correct responses to each item differs across groups. Using our aptitude test as an example, males might be less likely than females to give correct answers to the test questions, but the rank ordering of xy 12

questions according to difficulty might be the same across groups. Again, as with the item discrimination index, group differences in correct responding are not by themselves an indication of test score bias. Detecting Predictive Bias: External Evaluation of a Test Predictive bias concerns the degree to which a test s scores are equally predictive of an outcome for two groups. For example, scores on the SAT are thought to measure academic achievement. On the assumption that academic achievement measured during secondary school years might be related to academic achievement during the freshman year in college (e.g., as measured by freshman GPA), institutions of higher education often use SAT scores to make admissions decisions. The idea is that it is possible to predict, at least with some degree of accuracy, student freshman year academic performance based on SAT scores. If it could be shown that the ability to successfully predict freshman academic achievement from SAT scores is different for different groups of people, then we might suspect that the SAT suffers from predictive test score bias. The existence of predictive bias is examined by obtaining scores on two variables or measures. Analyses are then conducted to examine the degree to which scores on the main test of interest (the predictor test) can be used to predict people s scores on another psychological measure (the outcome measure) that is thought to be conceptually related to scores on the main test of interest. Detection of predictive bias begins with the assumption that one size fits all that the test is equally predictive for all groups. As we will illustrate, analyses are conducted to evaluate this assumption. If those analyses confirm that the test is equally predictive for both groups, then we conclude that the test probably does not suffer from predictive bias (at least with regards to the specific outcome in question and the specific groups in question). However, if 13

those analyses indicate that one size does not fit all that the test is not equally predictive for both groups then we conclude that the test might suffer from predictive bias. Imagine that you are a training program selection officer working for a corporation that spends large sums of money training employees to develop mechanical skills needed by the corporation to run its operations. Your job is to select the most promising candidates for this training program. Because of the cost of the program, it is essential that you select only those people who are most likely to perform well in the training program. Your job depends on how well you make these selections. In an attempt to improve your selection success rate, you develop a mechanical aptitude test that you give to all trainee candidates. You assume that scores on the test are going to be related to some outcome measure of post-training performance. For example, following training each trainee might be rated by a supervisor, in terms of the trainee s level of mechanical competency. Further, you assume that there should be a positive linear relationship between the pre-training aptitude test scores and the post-training supervisor ratings of competence. That is, candidates with high aptitude scores (i.e., predictor scores) should have better ratings (i.e., outcome scores) than candidates with lower aptitude scores. In your development and evaluation of the aptitude test, you might be concerned about predictive test bias. Formally speaking, predictive bias has to do with the use of test scores to predict a relevant outcome (e.g., behavior, competency, or performance) in situations other than the testing situation in which the predictor test was administered. Thus, if you had reason to believe that the aptitude test was strongly predictive of supervisor ratings for males but not for females, then you would suspect that the test was predictively biased. To evaluate the efficacy of your new aptitude test and to evaluate any potential predictive bias, you will need to examine two issues a) does your test actually help you predict the 14

outcome of training, and b) does your test predict the outcome of training equally well for various groups of trainees? To address both issues, you will need data that can be used to evaluate the predictive effectiveness of your test. Data of this kind could be obtained by testing all trainees before they enter the program and then recording their scores on the outcome measure at the end of the training program. The two issues are often addressed by using a statistical procedure called regression, with which you can use the pre-training mechanical aptitude test scores to calculate predicted post-training supervisor rating scores. Basics of Regression Analysis Regression analysis is based on the assumption that there is a linear relationship between aptitude test scores and outcome scores. If there is such a relationship, then the formula for a straight line can be used to predict outcome scores from aptitude scores: Y = a + b( X) where Y is the predicted training outcome score for an individual training candidate, a is the intercept (the predicted value of a person s outcome score if that person had an aptitude test score of zero), b is the regression coefficient or slope (a number that tells you how much of a change you would expect to see in Y for a one point increase in aptitude test scores), and X is an individual s aptitude test score. Many popular statistical software packages can be used to conduct the regression analysis, which produces values of a and b. Once you have obtained the values for the intercept and slope of the regression equation, you can evaluate the predictive ability of the test. For example, you can take any individual s score on the aptitude test (X), plug it into the regression equation, and calculate a predicted score on the supervisor ratings ( Y ) for that individual. 15

To illustrate this process, we will use the data in Table 1.2. In this Table, we have aptitude scores for four trainees, along with each trainee s outcome score (note that an analysis of this kind would involve many more than four trainees). Based on a regression analysis conducted by using SPSS, the intercept (a) is 56.03 and the slope (b) is.58. These results tell us that a trainee with an aptitude score of zero is predicted to obtain an outcome rating of 56.03 and that a one-point difference in aptitude scores is associated with a.58 difference in outcome scores. As mentioned earlier, these values can be used to obtain predicted scores for all trainees, by plugging their aptitude scores into the following regression equation: Y = 56.03 +.58( X ) Predicted supervisor rating = 56.03 +.58(Aptitude Score) For example, a trainee with an aptitude score of 69 is predicted to earn a supervisor rating of 96.05: Y = 56.03+.58(69) Y = 96.05 Similarly, a trainee with an aptitude score of 70 is predicted to earn a supervisor rating of 96.63: Y = 56.03+.58(70) Y = 96.63 Note that the difference between these two predictions is.58 (96.63 96.05 =.58), which reflects the slope in the regression equation. That is, a one-point difference in aptitude test score (70-69 = 1) is associated with a.58 difference in outcome scores. If we calculate predicted rating scores for a wide range of aptitude test scores, then we can generate a regression line or a line of best fit. Each point on a regression line is associated 16

with the most likely (predicted) Y value for each possible X-value. The line is used to illustrate the association between predictor test scores and outcome scores. In Table 1.2, we have computed predicted scores ( Y ) for each trainee. In Figure 11.4, we have plotted each of our four candidate s observed outcome score against his or her observed predictor test score, and we have draw a regression line that reflect each candidate s predicted outcome score. Notice that each trainee s predicted score on the outcome differs from his or her actual score on the outcome. For example, for trainee #1, the observed outcome score is 75 but the predicted outcome score is 74.58. A difference between a predicted score and an observed outcome score is referred to as a residual. The standard deviation of the residuals is the standard error of estimate ( se ) that we discussed previously. The se for the data in Table 1.2 is 6.76. e This value reflects the inaccuracy of predictions; the larger the value, the less accurate the predictions. A test with strong predictive power will result in relatively small residuals, which will result in a relatively small se e. One Size Fits All: The Common Regression Equation The estimation of predictive bias usually begins by establishing what would happen if no bias exists. If a test is not biased, then one regression equation should be equally applicable to different groups of people. The assumption that different groups share a common regression equation is based on the idea that one size fits all regardless of biological sex, ethnicity, culture, or whichever group difference is being considered, a single regression equation adequately reflects the predictive ability of the test in question. Imagine that you give your aptitude test to a large number of trainee candidates (e.g., 100). Assume there is an equal number of male and female candidates and you want to make sure that your aptitude test is not biased with respect to the biological sex of the candidates. To e 17

begin your examination of this issue, you could compute the regression equation based on the data from the entire sample, regardless of sex. Imagine that you found that the intercept from this regression equation is a = 56.03, the slope is b =.58, and the standard error of estimate is se e = 6.76. These values represent the common regression equation, and they will be called the common intercept, the common slope, and the common standard error of estimate, respectively, Again, if you aptitude test is unbiased in terms of gender, then the common regression equation (calculated from males and females together) should be equally applicable to males and females. To evaluate the presence of predictive bias, additional regression analyses must be conducted. To determine whether the common regression equation is indeed equally applicable to males and females, we must calculate one regression equation for males and a one for females. We must then compare these group-level regression equations with the common regression equation. Three of values that we just discussed can be used to assess predictive test bias: a) the intercept b) the slope, and c) the standard error of estimate. If the group-level values do not match the common regression equation, then you might suspect that your aptitude test scores are biased. In practice, there are a variety of sophisticated statistical analyses that can be conducted on these values to precisely estimate the presence of predictive test bias, but our discussion will focus on the more conceptual level. To elucidate the meaning of various patterns of results, we first focus on the meaning of biased intercepts, then on biased slopes. However, in practice, it may be more likely that groups would differ on both of these elements of prediction rather than being exactly equal on one but differing on the other. Thus, we will also illustrate the effect of bias in terms of intercepts and slopes. 18

Intercept Bias Suppose that group-level regression analyses reveal that males and females have slopes and se e values that are similar to the common regression equation, but that their intercept values differ from the common intercept. In this case, you would suspect that your test suffers from intercept bias. For example, imagine that, in your evaluation of your aptitiude test, you conduct regression analyses separately for the 50 males and the 50 females. You find that, for both groups, the slope is b =.58 and the se e is 6.76, which are equal to the common slope and common se e. However, you find the intercept for males is a = 58.03 and the intercept for females is a = 54.03. Note that these group-level intercept values differ from the common intercept, indicating that one size does not fit all, at least in terms of the intercept. Thus, the test appears to suffer from intercept bias. What are the implications of intercept bias? The fact that the males intercept is higher than the females intercept indicates that males at any given level of aptitude will tend to receive higher supervisor ratings (outcome scores) than females at the same level of aptitude. To illustrate this, let us compute the predicted outcome score for a male with an aptitude score of 70 and for a female with an aptitude score of 70: Predicted Outcome Score for Male = 58.03 +.58(70) Predicted Outcome Score for Male = 98.63 Predicted Outcome Score for Female = 54.03 +.58(70) Predicted Outcome Score for Female = 94.63 These computations show that for a male and a female who have the exact same level of aptitude, the male is predicted to obtain a supervisor rating that is 4 points higher than the 19

female. If we assume that the supervisor ratings are themselves unbiased (an assumption that we will revisit later in this chapter), then this discrepancy indicates that the aptitude test does not work the same for males and females. As we saw earlier, the common regression equation resulted in a predicted supervisor rating of 96.63 for a trainee who had an aptitude score of 70. Comparing this result to the results of our group-level predictions, the common regression equation appears to under-estimate the prediction for males and to over-estimate the prediction for females. Thus, one size does not fit all, and the test appears to be predictively biased. If a test suffers only from intercept bias (i.e., the group-level intercepts are not equivalent to the common intercept, but the slopes and se e values are unbiased), then the size of the group discrepancy would be constant across all aptitude scores. We saw a four-point discrepancy for a male and a female who both had an aptitude score of 70, and if the aptitude test suffers only from intercept bias, then the sex difference will be four points at every level of aptitude. This is illustrated in Figure 1.5, which presents a common regression line (dashed) and two group-level regression lines. As this figure illustrates, the lines are parallel, suggesting that a male trainee of a given aptitude level will obtain a predicted rating that is always four-points higher than a female who has the same level of aptitude. Slope Bias A second way in which a test can be predictively biased is through slope bias Suppose that group-level regression analyses reveal that males and females have intercept and se e values that are similar to the common regression equation, but that their slope values differ from the common slope. This would indicate that the connection between predictor scores and outcome scores differs between the two groups. 20

For example, imagine that your analyses reveal that, for both groups, the intercept is a = 56.03 and the se is 6.76, which are equal to the common intercept and common se. However, e you find the slope for males is b =.53 and the slope for females is b =.60. Note that these group-level slope values differ from the common slope (i.e.,.58), indicating that one size does not fit all, in terms of the connection between predictor test scores and outcome scores. Slope bias has important implications for the degree of discrepancy between the groups predicted outcome scores. The fact that the males slope is weaker than the females slope indicates that the amount of bias is not constant across aptitude levels. To illustrate this, let us compute the predicted outcome score for a male with an aptitude score of 70 and for a female with an aptitude score of 70: Predicted Male Outcome Score = 56.03 +.53(70) Predicted Male Outcome Score = 93.13 Predicted Female Outcome Score = 56.03 +.60(70) Predicted Female Outcome Score = 98.03 This shows that, for a male and female with an aptitude of 70, the female will be predicted to have an outcome score that is 4.9 points higher than the male. Now, let us compute the predicted outcome score for a male and a female who have aptitude scores of 60: Predicted Male Outcome Score = 56.03 +.53(60) Predicted Male Outcome Score = 87.83 Predicted Female Outcome Score = 56.03 +.60(60) Predicted Female Outcome Score = 92.03 In this case, the female will be predicted to have an outcome score that is 4.2 points higher than the male. Thus, the bias (i.e., the degree to which the predicted outcome score e 21

differs for males and females who have the same level of aptitude) is relatively small for relatively low levels of aptitude, but it is larger for higher levels of aptitude. That is, the discrepancy between male and female predicted scores will tend to increase as scores on the aptitude test increase. This type of pure slope bias is illustrated in Figure 1.6, which shows that the regression lines for males and for females gradually move apart. Intercept and Slope Bias So far, we have illustrated pure intercept bias and pure slope bias cases in which either the intercept is biased or the slope is biased, but not both. To summarize, pure intercept bias indicates that there is a discrepancy between groups predicted scores and that the size of this discrepancy does not change as aptitude scores increase or decease in size. In contrast, pure slope bias indicates that the size of this discrepancy does change as aptitude scores increase or decease in size. It is also possible (perhaps even more so than either form of pure bias) for intercept and slope biases to exist simultaneously. In this case, there will be a complex relationship between the size of aptitude scores and the outcome scores for the different groups. For example, we might find that, for people who have low levels of aptitude, the predicted outcome scores for males might be higher than predicted outcome scores for females. But our analyses might also reveal that, for people who have high levels of aptitude, the predicted outcome scores for males might be lower than predicted outcome scores for females. Although there are many patterns of discrepancy that might occur, one possible outcome of this type is illustrated in Figure 11.7. Standard Error of Estimate The standard error of estimate is a value that represents the accuracy of your prediction of an outcome score based on values from a test used to make the prediction. In our example, we 22

are using aptitude test scores to predict scores on a training program effectiveness outcome measure. Our aptitude test would be considered biased if the group-level se e values differed from the common se e value. This bias would indicate that you can make more accurate predictions for people in one of the groups than for people in the other group. Outcome Score Bias Our discussion of predictive bias has focused on the possibility that the scores on the predictor test are biased. However, it is also possible that scores on the outcome variable could be biased. For example, it is possible that the supervisor who provide the post-training ratings of competence are biased in favor of one group and against another. The test we use to measure outcomes such as our 100 item mechanical competency test could also be biased. We have been assuming that the outcome measure is not biased but of course it could be. The Effect of Reliability As a final note, we should acknowledge that the standard error of estimate, the regression coefficient, and the intercept are all sensitive to test reliability. In our discussion of predictive bias we have been assuming high predictor test and outcome test score reliabilities, e.g., R xx greater than.90. A drop in test score reliability can have a profound effect on these parameters and thereby, at least potentially, affect predictive bias. These effects are complex and beyond the scope of our discussion but for interested readers we recommend Jensen (1980). Summary We have been focusing on test bias, which traditionally refers to the possibility that true differences among groups are systematically obscured (or artificially created). Although there are widely-used methods for coping with response biases, the methods that have been proposed for coping with test bias tend to be somewhat controversial and beyond the scope of our current 23

discussion. For a recent survey of the issues, interested readers are directed to Sackett, Schmitt, and Ellingson (2001). In sum, the validity of test score interpretation and use is a fundamental concern to behavioral scientists who are interested in psychological measurement. Through decades of conceptual and methodological development, psychometricians, test-users, and test-developers have articulated the meaning and evaluation of validity. Although threats to validity do exist, psychologists others interested in psychological measurement have made great strides in identifying such threats and in developing strategies for detecting, preventing, or minimizing them. Nevertheless, psychological tests should always be used and interpreted with close regard for the theoretical and evidential basis of their meaning and application. 24

Table 1.1 Item Discrimination Index Example Item Discrimination Index: Notice that I d DOES NOT depend solely on the proportion of test takers who get the item correct. If you look at item 9 you will see that the I d for that item is.45 which is exactly the same as the I d for item 5 although far fewer people answered item 9 correctly. Although I used 30% to identify the top and bottom groups, the actual percent used to identify these groups does not have to be 30%. You will find that the percentage tends to range from 25% to 33%. Items TOP 30% BOTTOM 30% Proportion Correct-TOP Proportion Correct BOTTOM 1 20.00 2.00 1.00 0.10 2 18.00 5.00 0.90 0.25 3 16.00 5.00 0.80 0.25 4 15.00 8.00 0.75 0.40 5 13.00 9.00 0.65 0.45 6 11.00 10.00 0.55 0.50 7 9.00 3.00 0.45 0.15 8 6.00 2.00 0.30 0.10 9 5.00 9.00 0.25 0.45 10 2.00 1.00 0.10 0.05 N top =20 I d = TOP% - BOTTOM% N bottom =20 Item I d 1 0.90 2 0.65 3 0.55 4 0.35 5 0.20 6 0.05 7 0.30 8 0.20 9-0.20 10 0.05 25

Table 1.2 Data for Illustrating Regression Analysis Aptitude Supervisor Predicted Trainee Test Score Rating Supervisor Rating 1 32 75 74.59 2 40 80 79.23 3 57 81 89.09 4 60 98 90.83 Variance (se e ) = 6.67 26

Figure 1.1 27

Figure 1.2 28

Figure 1.3 29

Figure 1.4 Scatterplot and Regression Line for Trainee s Aptitude Scores and Supervisor Ratings 100 95 Regression Line Supervisor Rating 90 85 80 75 70 30 35 40 45 50 Aptitude Test Score 55 60 30

Figure 1.5 31

Figure 1.6 32

Figure 1.7 33