Navigating Student Ratings of Instruction Sylvia d'apollonia and Philip C. Abrami Concordia University Many colleges and universities have adopted the use of student ratings of instruction as one (often the most influential) measure of instructional effectiveness. In this article, the authors present evidence that although effective instruction may be multidimensional, student ratings of instruction measure general instructional skill, which is a composite of three subskills: delivering instruction, facilitating interactions, and evaluating student learning. The authors subsequently report the results of a metaanalysis of the multisection validity studies that indicate that student ratings are moderately valid; however, administrative, instructor, and course characteristics influence student ratings of instruction. n navigating oceans, sailors depend on a variety of nautical instruments. Their success in traversing poorly charted seas is founded on the knowledge that the different instruments measuring latitude will provide similar readings of latitude but different readings of longitude. Similarly, when practitioners interpret student ratings, they expect different measures of the trait in question (i.e., instructional effectiveness) to give substantially similar results. However, they expect measures of irrelevant traits (e.g., athletic prowess, grading leniency, charisma) to give substantially different results. That is, they expect student ratings of instruction to reflect the true impact of teaching on student learning. Student ratings of instruction, first introduced into North American universities in the mid-1920s (Doyle, 1983), have been the subject of a huge body of literature including primary studies, reviews, and books. This literature has dealt not only with the psychometric properties of student ratings (Costin, Greenough, & Menges, 1971; Doyle, 1983; Feldman, 1978; Marsh, 1984) and the factor structure of ratings (Kulik & McKeachie, 1975; Marsh, 1991a, 1991b; McKeachie, 1979) but also with practical guides to faculty evaluation (Arreola, 1995; Braskamp & Ory, 1994; Centra, 1993). Many researchers have concluded that the reliability (Centra, 1993) and the validity (Abrami, d'apollonia, & Cohen, 1990; Cohen, 1981; Feldman, 1989, 1990; Marsh, 1987) of student ratings are generally good and that student ratings are the best, and often the only, method of providing objective evidence for summative evaluations of instruction (Scriven, 1988). However, a number of controversies and unanswered questions persist. Our objective in this article is to briefly and selectively review some of the literature on student ratings of instruction that addresses the following three questions: What is the structure of student ratings of instructional effectiveness? Is instructional effectiveness substantially correlated with other indicators of instructor-mediated learning in students? To what extent do student ratings confound irrelevant variables with instructional effectiveness? What Is the Structure of Student Ratings of Instructional Effectiveness? Most postsecondary institutions have adopted student ratings of instruction as one (often the most influential) measure of instructional effectiveness. Student rating forms are paper-and-pencil instruments on which students indicate their responses to items on some numerically based scale. Specific items such as "Does the instructor speak clearly?" and "Rate the degree to which the instructor was friendly toward students" tap specific instructional dimensions such as presentation skills and rapport. Global items such as "How would you rate the overall effectiveness of the instructor?" tap students' overall impressions of the instructor. Omnibus student rating forms, such as the Student Instructional Rating System (Michigan State University, 1971), the Students' Evaluation of Educational Quality (SEEQ; Marsh, 1987), and the Student Instructional Report (Linn, Centra, & Tucker, 1975), are usually multidimensional, consisting of several items tapping specific dimensions. They also may include overall global items assessing students' overall impressions of the course, the instructor, and the value of the course. Characteristics of Effective Instruction Items on student rating forms reflect the characteristics that experts believe (a) can be judged accurately by students and (b) are important to teaching, Nevertheless, researchers define instructional effectiveness from a num- Sylvia d'apollonia and Philip C. Abrami, Centre for the Study of Learning and Performance, Concordia University, Montreal, Quebec, Canada. The research reported in this article was supported by grants from the Social Sciences and Humanities Research Council (Government of Canada) and Fonds pour la formation de chercheurs et l'aide h la recherche (Government of Quebec). Correspondence concerning this article should be addressed to Sylvia d'apollonia, Centre for the Study of Learning and Performance, Concordia University, 1455 West DeMaisonneuve Boulevard, Montreal, Quebec, Canada H3G 1M8. Electronic mail may be sent via Internet to sdapoll @ vax2.concordia.ca. 1198 November 1997 American Psychologist Copyright 1997 by the American Psychological Association, Inc. 0003-066X/97/$2.00 Vol. 52, No. I1, 1198-1208
ber of different perspectives (Abrami, d'apollonia, & Rosenfield, 1996). These definitions may focus on either aspects of the instructional process (e.g., preparation of course materials, provision of feedback, grading) or the products that effective instruction promotes in students (e.g., subject-matter expertise, skill in problem solving, positive attitudes toward learning). However, in the past 30 years, faculty members have begun to use a variety of teaching methods other than didactic teaching. These alternative methods range from interactive seminars, laboratory sessions, and cooperative learning to more independent methods such as computer-assisted instruction, individualized instruction, and internships. Whether the definitions of instructional effectiveness are based on instructional products, processes, or the causal relationships between the two, they do not necessarily generalize across these other instructional contexts. Many multidimensional student rating forms contain items that assess the instructor's beliefs, knowledge, attitudes, organization and preparation, ability to communicate clearly and to stimulate students, interaction with students, feedback, workload, and fairness in grading. Feldman (1976, 1983, 1984) reviewed studies that either investigated students' statements about superior instructors or correlated students' evaluations of instructors with instructors' characteristics. Feldman developed several schemas consisting of high-inference categories that subsequently have been used to code items in different student rating forms (Abrami, 1988; Abrami & d'apollonia, 1990; Marsh, 1991a, 1991b). Abrami and d'apollonia (1990) explored the consistency and uniformity of the student rating forms used in multisection validity studies by coding and analyzing the items using the aforementioned coding schema. They concluded that student rating forms, each purported to measure instructional effectiveness, were not consistent in their operational definitions of instructional effectiveness. Thus, no one rating form represents effective instruction across contexts. Factor Studies of Individual Student Rating Forms A number of student rating forms have been factor analyzed to determine their dimensional structure (Frey, Leonard, & Beatty, 1975; Linnet al., 1975; Marsh, 1987). For example, Marsh and Hocevar (1984, 1991) identified nine instructional dimensions on the SEEQ and demonstrated that the factor structure of the SEEQ was invariant across different groups of students, academic disciplines, instructor levels, and course levels. However, the generalization of the factor structure across different pedagogical methods (e.g., lectures, small discussion groups) was not explored. Some researchers, notably Marsh and his colleague (Marsh, 1983a; Marsh & Hocevar, 1984, 1991), have interpreted the factor invariance of student rating forms as providing evidence for the construct validity of distinct instructional dimensions. However, Abrami and d'apol- Ionia (1990, 1991), Abrami et al. (1996), and Feldman (1976) have criticized the interpretation of factor analysis as evidence of the construct validity of student ratings. They contended that there were a number of methodological problems with this use of factor analysis. First, factor solutions reflect the items in the specific rating instrument in question; they do not necessarily reflect the construct in question (i.e., instructional effectiveness). Second, analysts often extract all factors having eigenvalues greater than one. Zwick and Velicer (1986) reported that this practice seriously overestimates the number of factors and extracts minor factors at the expense of the first principal component. Finally, the choice of factor analytical method determines the factor solution. For example, factor analysis conducted without rotation is designed to resolve one general or global component, whereas factor analysis conducted with rotation is designed to resolve specific factors of equal importance (Harman, 1976). However, the factor solutions are indeterminate (Gorsuch, 1983); many solutions exist, each resolving exactly the same amount of the total variance. Therefore, the solution that best describes the construct cannot be determined solely empirically. The solution must be supported by theory and by empirical research other than factor analysis. Abrami and d'apouonia (1991) reanalyzed the factor structure of the SEEQ by reconstructing the reproduced correlation matrix from the published pattern and factor correlation matrices (Marsh & Hocevar, 1984, 1991). They subsequently factor analyzed the estimated correlation matrix and replicated Marsh and Hocevar's (1984, 1991) results within rounding error. They argued that because 31 of the 35 items loaded heavily (>.60) on the first principal component, the principal-components solution was interpretable and provided evidence of a Global Instructional Skill factor. The two global items (Overall Instructor Ratings and Overall Course Ratings) had the highest loadings (.94) on the first principal component and explained almost 60% of the total variance in ratings. The remaining four items, concerning course difficulty and workload, loaded heavily on the second principal component. This component explained an additional 11% of the variance. The results of similar reanalyses of the factor studies for seven other multidimensional student rating forms are presented in Table 1. The principal-components solutions indicate that for six of the seven student rating forms, there was a large global component with 30% or more of the variance in student responses resolved by this first principal component. In general, most of the factor studies can be interpreted as providing evidence for one global component rather than several specific factors. Thus, the factor studies do not provide incontestable evidence that student rating forms measure distinct instructional factors. An alternative, and mathematically equivalent, interpretation is that student rating forms measure a global component, General Instructional Skill. Factor analysts may rotate the principal-components solution and reify specific factors, but these specific factors may November 1997 American Psychologist 1199
Table 1 Principal-Components Solution of Student Rating Forms % of variance extracted by principal components, in order of extraction (h > 1.0) Factor study 1 2 3 4 5 6 Linn, Centra, &Tucker (1975) 14.3 11.1 9.2 8.6 5.9 4.3 Bolton (personal communication, October 7, 1990); Bolton, Bonge, & Marr (1979) 45.1 10.1 8.2 7.6 5.9 Michigan State University {1971) 34.5 11.7 10.5 8.5 6.5 Frey, Leonard, & Beatty (1975) 39.9 14.5 9.5 7.2 6.5 5.1 Hartley & Hogan (1972) 39.2 5.7 4.4 Mintzes (1977) 30.0 7.5 6.7 5.9 5.2 Pambookian (1972) 33.4 26.1 9.7 7.8 6.6 5.2 be artifacts of the analytical procedures. The selection of which interpretation (specific factors or general skill) reflects instructional effectiveness as perceived by students cannot be made on the basis of further factor analyses, confirmatory or otherwise. Rather, this choice must be based on theory and evidence using other independent analytical methods. Information-Processing Models of Student Ratings Information-processing models of performance ratings (Cooper, 1981; Fiske & Neuberg, 1990; Judd, Drake, Downing, & Krosnick, 1991; Murphy, Jako, & Anhalt, 1993; Nathan & Lord, 1983) suggest that individuals minimize cognitive load by using common prototypes to organize their schemas of persons, roles, and events. Raters use supraordinate features, like general impressions, to attend to, store, retrieve, and integrate judgments of specific behaviors. To the degree that items on rating forms are semantically similar, they function as retrieval cues, activating the same node in the rater's associative network. Furthermore, because persons, roles, and behaviors are stored together, once an individual's personality schema is activated, roles can infer specific behaviors, or vice versa (Trzebinski, 1985). B.E. Becker and Cardy (1986) and Cooper (1981) argued that this reliance on general impressions (halo effects) not only does not distort ratings but also may increase accuracy. Thus, according to information-processing models of performance ratings, factor studies of student ratings reflect not only the structure of instructional effectiveness but also students' cognitive processes while they are rating their instructors. Researchers also have shown that student ratings collected in field settings demonstrate that students rate specific dimensions of instruction on the basis of their global evaluation. Marsh (1982, 1983b) reported that the halo effect accounted for 19% of the variance in student ratings. Whitely and Doyle (1976) demonstrated in a classroom setting that the factor structure of student ratings of instruction corresponded to students' valid im- plicit theories of teaching. Similarly, Murray's (1984) study indicated that students' global ratings of instructors were accurate when compared with trained observers' ratings of specific behaviors of instructors. Abrami and d'apollonia (1990, 1991) suggested that the general question that needs to be addressed is the dimensionality of instructional effectiveness across different forms and different types of classes. That is, the important question isn't whether one student rating form has a robust factor structure but rather whether the construct of instructional effectiveness has a common meaning or structure across different forms and contexts. This question has been approached both logically and empirically. Logical approaches to integration of factor studies. Several researchers (Feldman, 1976; Kulik & McKeachie, 1975; Marsh, 1987, 1991a, 1991b; Widlak, McDaniel, & Feldhusen, 1973) logically integrated factor studies across forms purporting to measure the same dimensions of effective instruction. Feldman as well as Widlak et al. concluded that across multidimensional student rating forms, teaching is evaluated in terms of three roles: presenter, facilitator, and manager, t Empirical approaches to integration of factor studies, Abrami and his colleagues (Abrami et al., 1996; d'apollonia, Abrami, & Rosenfield, 1993) used multivariate approaches to meta-analysis to uncover the common factor structure across student rating forms. They coded the 458 items in 17 student rating forms into common instructional categories, extracted a reliable 2 set of intercategory correlation coefficients from the reproduced correlation matrices, aggregated them to produce an aggregate correlation matrix, and subsequently factor analyzed that correlation matrix. They reported that there Chau (1994) conducted a confirmatory higher order factor analysis on responses to the SEEQ and also demonstrated the presence of the same three factors. 2 Intercategory correlation coefficients for 195 items were eliminated from the data set because the values were not consistent (homogeneous) across student rating forms. 1200 November 1997 American Psychologist
was a large common principal component across student rating forms that explained about 63% of the variance in instructional effectiveness. When the factor solution was rotated obliquely, four first-order factors were obtained. The first factor, in order of importance, was tentatively identified as representing the instructor's role in delivering information. It included the global scales (Overall Course and Instructor Ratings) as well as such behaviors as monitoring of learning, enthusiasm for teaching, clarity, presentation, and organization. The second factor was tentatively identified as representing the instructor's role in facilitating a social learning environment. It included such behaviors as general attitudes, concem for students, availability, respect for others, and tolerance of diversity. The third factor was tentatively identified as representing the instructor's role in evaluating student learning. It included such behaviors as evaluation, feedback, and the provision of a friendly classroom atmosphere. The fourth factor included four miscellaneous teaching behaviors: maintaining discipline, choosing appropriate required materials, knowing the content, and using instructional objectives. This factor was not interpretable. This first-order factor structure is very similar to the three clusters of instruction described by Widlak et al. (1973) and by Feldman (1976). When Factor 3 and Factor 4 from Abrami et al.'s (1996) study are combined, 18 of the 20 instructional categories common to Abrami et al.'s study and Feldman's (1976) study have the same distribution. Abrami et al. subsequently conducted a secondorder factor analysis and reported that both the secondorder factor analysis and the principal-components analysis indicated that student ratings of instruction measure General Instructional Skill. General Instructional Skill consists of three subskills: delivering instruction, facilitating interactions, and evaluating student learning. In conclusion, Marsh and his colleague (Marsh & Hocevar, 1984, 1991) contended that the factor invariance of well-designed rating forms, such as the SEEQ, supports the construct validity of specific instructional dimensions. However, an equally viable interpretation of the factor studies is that student rating forms measure a global component, General Instructional Skill. This interpretation is supported by three independent lines of evidence: secondary factor analyses, studies on human information processing during performance ratings, and the integration of factor studies. Is Instructional Effectiveness Correlated With Other Indicators of Student Learning? One important criterion for the validity of student ratings of instruction is that the teaching processes reflected in student rating forms should cause (or at least predict) student learning. That is, if student ratings measure instructional effectiveness, students whose instructors are judged the most effective also should learn more. One of the methods for determining this aspect of the validity of student ratings is the multisection validity design. Multisection Validi~/ Design The multisection validity design, first used by Remmers, Martin, and Elliot (1949), is used by researchers to correlate the section average score on student ratings with the section average score on a common achievement test across multiple sections of a course. In the strongest multisection designs, all sections use a common textbook and a common syllabus. Ideally, either students are randomly assigned to sections or student ability is statistically controlled. Although this validation design has been criticized by Marsh (1984, 1987; Marsh & Dunkin, 1992) on a number of grounds (problems of statistical validity, problems with the measurement of both instructional effectiveness and student learning, and problems with rival explanations), this design minimizes the extent to which the correlation between student ratings and achievement can be explained by factors other than instructor influences. Abrami et al. (1996) argued that although the multisection validity design is not perfect, it has high internal validity. To date, more than 40 studies have utilized the multisection validity design to determine the validity of student ratings. In addition, there have been numerous reviews, both qualitative and quantitative. Because this validation design has been used so extensively, with many student rating forms under diverse conditions, it provides the most generalizable evidence for the validity of student ratings. Meta-Analysis of Multisection Validity Studies The multisection validity literature has been extensively reviewed both qualitatively (Feldman, 1978; Kulik & McKeachie, 1975; Marsh, 1987) and quantitatively (Abrami, 1984; Abrami et al., 1990; Cohen, 1981, 1982; d'apollonia & Abrami, 1996; Dowell & Neal, 1982; Feldman, 1989; McCallum, 1984). However, not only do the primary researchers reach opposite conclusions about the validity of student ratings, but so do some of the reviewers. For example, Cohen (1981) concluded that student ratings were valid measures of teaching effectiveness, whereas Dowell and Neal and McCallum concluded that ratings, at best, were poor measures of teaching effectiveness. Abrami, Cohen, and d'apollonia (1988) explored the reasons for the different conclusions. They compared six published meta-analyses of the multisection validity studies (Abrami, 1984; Cohen, 1981, 1982, 1983; Dowell & Neal, 1982; McCallum, 1984) and identified a number of decisions made by the reviewers that biased their findings. For example, reviewers differed on their inclusion criteria, on the number of outcomes they extracted, and on the analytical techniques they used. They recommended that future quantitative reviewers should improve both the coding of study features and the analysis. November 1997 American Psychologist 1201
Subsequently, d'apollonia and Abrami (1996) conducted a meta-analysis in which they extracted all the validity coefficients from 43 multisection validity studies (N = 741), coded them on the basis of the aforementioned common factor structure across rating forms and used multivariate methods of meta-analysis (Raudenbush, Becker, & Kalaian, 1988) to model the dependencies among the data and to calculate population parameters. They reported that the mean validity coefficient of student ratings of general instructional skill across the 43 studies was.33, with the 95% confidence interval extending from.29 to.37. 3 However, the correlation between student ratings and student learning was attenuated by unreliability in both the rating and achievement instruments. The average reliabilities of the student rating and achievement instruments in the multisection validity studies were.74 and.69, respectively. Therefore, when the correlation coefficient between student ratings of general instructional skill and student learning was corrected for attenuation (Downie & Heath, 1974), the correlation became.47, with a 95% confidence interval extending from.43 to.51. Thus, there was a moderate to large association between student ratings and student learning, indicating that student ratings of General Instructional Skill are valid measures of instructor-mediated learning in students. The variability in the validity coefficients reported by the primary investigators can be explained, in large part, by differences in the sampling variance of the individual studies. Do Student Ratings Confound Irrelevant Variables With Instructional Effectiveness? Some critics of student ratings argue that student ratings are not valid indices of instructor-mediated learning in students because irrelevant (biasing) characteristics are correlated with ratings. Validation techniques based on construct-validity approaches, such as multitraitmultimethod approaches, assume that characteristics that reduce the discriminant validity of an instrument are irrelevant. However, the characteristics that covary with student ratings may be relevant, acting as intermediate variables. For example, if student ratings were correlated with grades, some critics (e.g., Greenwald & Gillmore, in press) contend that student ratings would be biased because students might be rewarding their instructors for lenient grading practices. However, as discussed by Abrami et al. (1996), Abrami and Mizener (1985), Marsh (1980, 1987), and Feldman (in press), this interpretation is correct only if student learning is not causally affected by the putative biasing variable. If instructors were grading students leniently, but this practice had no effect on learning, grading leniency would be a biasing variable. Alternatively, if grading leniency enhanced students' perceptions of efficacy, encouraged students to work harder, or facilitated student learning, grading leniency would not be irrelevant or a biasing variable. Although many variables have been shown to influence student ratings of instruction, unless they can be shown to moderate the validity coefficient (the correlation between student ratings and student learning), they cannot be described as biasing variables. Cohen (1981, 1982) reported that three study features (instructor experience, timing, and evaluation bias) moderated the validity coefficient of student ratings and, therefore, may bias student ratings. However, Abrami et al. (1988) suggested that the failure to identify more biasing variables may have been the result of both the incomplete identification of possible moderators and too low statistical power. Therefore, they recommended that additional meta-analyses be conducted that address these two issues. Subsequently, d'apollonia and Abrami (1996, 1997) used nomological coding (Abrami et al., 1990) and more sophisticated meta-analytical techniques to conduct a more thorough meta-analysis of the multisection validity studies. They extracted all the validity coefficients from 43 multisection validity studies, coded them on the basis of the degree to which they represented general instructional skill, and coded 39 study features. d'apollonia and Abrami (1996, 1997) reported that 23 individual study features representing methodological and implementation characteristics, instrument characteristics (rating forms and achievement tests), and explanatory characteristics (instructors, students, and settings) significantly moderated the validity of student ratings of instruction. For example, the research quality of the study examining the strength of the association between instructors' effectiveness and student learning influenced the size of the correlation. Studies in which instructors graded their own students' examinations, in which in- 3 Cohen (1981, 1982) conducted a meta-analysis using the method developed by Glass (1978) in which the unit of analysis is the outcome. Each study provided one or more estimates (outcomes) of the mean population validity coefficient. However, individual studies usually vary in sample size and, therefore, have different sampling variances. In the Glassian approach to meta-analysis, this variation in sampling variance is not taken into account, and all studies are treated equally. Thus, in Cohen's meta-analyses, the 95% confidence interval was computed with the outcome (number of courses in the multisection validity studies) as the unit of analysis and included the error due to the individual sampling variances. However, d'apollonia and Abrami (1996) used the meta-analysis approach developed by Hedges and Olkin (1985) in which the variation in sampling variances among studies is taken into account; studies with smaller sample sizes (and larger sampling variances) contribute less in calculating the average. Thus, the population mean validity is a weighted average of all the estimates, where each estimate is weighted by the inverse of its sampling variance. Because the 95% confidence interval does not include these individual sources of error, it is smaller than the confidence interval computed on the basis of the unweighted approach of Glass (1978). There is controversy on whether the number of outcomes (number of courses in the multisection validity studies) or the number of participants (number of instructors in the multisection validity studies) should be the basis of the computation of the confidence interval (Rosenthal, 1995). Because the question we addressed concerns the validity of student ratings of instruction across instructors and not across multisection courses, we decided to use the instructor as the unit of analysis. 1202 November 1997 American Psychologist
structors knew the content of the final examination, and in which prior differences in student ability were not controlled reported lower validity coefficients than studies that controlled for these possible sources of bias. In addition to these study-quality variables, the rank, experience, and autonomy of the instructors and the discipline and size of the multisection course also biased the validity of student ratings of instruction. d'apollonia and Abrami (1996, 1997) subsequently constructed a hierarchical multiple regression model consisting of the following study features, entered in this sequence: structure (Factor 1: Presenting Material, Factor 2: Facilitating Interaction, Factor 3: Evaluating Learning), source of study (theses or articles), timing (after or before), group equivalence (none, statistical control, or experimental control), and rank (teaching assistants or faculty members). Because many of the primary studies did not report the values for some of these study features (e.g., only about 50% of the primary studies reported the timing of the evaluation), a subset of the data (225 outcomes) for which complete information was given on these features was selected. The goodness-of-fit statistic (Qe) was 210.5 (df = 200), indicating that the aforementioned study features explained all the systematic variation in the subset of data. In conclusion, the published multisection validity literature suggests that under appropriate conditions (all instructors are faculty members, evaluation is carried out prior to students' knowing their final grade, sections are equivalent in terms of student prior ability or equivalence is experimentally controlled) and the validity coefficient is corrected for attenuation, more than 45% of the variation in student learning among sections can be explained by student perceptions of instructor effectiveness. This 45% figure is, of course, an estimate of validity under the above circumstances. In practice, with the usual reliability of measures and the ratings context, only partially matching the set of conditions, the validity coefficient will be different. In any case, the data set is homogeneous, indicating that across different students, courses, and settings, student ratings are consistently valid. Policy Implications and Recommendations Many experts in faculty evaluation consider that the validity of summative evaluations based on student ratings is threatened by inappropriate data collection, analysis, reporting, and interpretation (Arreola, 1995; Franklin & Theall, 1990; Theall, 1994). The research described in this article indicates that student ratings of instruction measure General Instructional Skill; that interpretations of general instructional skill are substantially correlated with student learning; and that with the exception of a few variables that influence the quality of evaluations (structure, timing, section equivalence, etc.), student ratings of instruction are not plagued with biasing variables. In this section, we briefly discuss policy implications on the use of student ratings of instruction for summative evaluation. Use of Student Rating Forms for Summative Faculty Evaluation Although most researchers agree that teaching is multidimensional, they disagree on whether global or specific ratings should be used for summative decisions on retention, tenure, promotion, and salary. Some researchers (Frey et al., 1975; Marsh, 1984, 1987) have recommended that summative evaluations be based on a set of specific scores derived from multidimensional rating forms. They argue that because teaching is multifaceted, it can be evaluated best by measures that reflect this dimensionality. Because there is little agreement on which pedagogical behaviors necessarily define effective instruction, the specific factors in the set can be differentially weighted to meet individual needs. Other researchers (Abrami, 1985, 1988, 1989; Abrami & d'apollonia, 1990, 1991; Cashin & Downey, 1992; Cashin, Downey, & Sixbury, 1994) have argued, equally vigorously, that global ratings (or a single score representing a weighted average of the specific ratings 4) are more appropriate. They have argued that different student rating forms assess different dimensions of effective instruction and that because specific ratings are less reliable, valid, and generalizable than global ratings, they should not be used for personnel decisions. The factor analysis across student rating forms indicates that specific dimensions are not distinct. Rather, they are chunked during rating into three factors representing three instructional roles. Moreover, these factors are highly correlated and can be represented by a single hierarchical factor--general Instructional Skill. Thus, the research summarized in this article supports the view that global ratings or a single score representing General Instructional Skill should be used for summative evaluations. Such global assessments of instructional effectiveness are judgmental composites "of what the rater believes is all of the relevant information necessary for making accurate ratings" (Nathan & Tippins, 1990, p. 291). Moreover, specific ratings of instructors' effectiveness add little to the explained variance beyond that provided by global ratings (Cashin & Downey, 1992; Hativa & Raviv, 1993). Therefore, we recommend that summative evaluations be based on a number of items assessing overall effectiveness of instructors. Best Administrative Practices When student ratings are to be used for summative decisions, it is important that the circumstances under which student ratings are collected and analyzed be appropriate 4 Marsh (1991a, 1991b, 1994, 1995) agreed with Abrami that a weighted average of specific dimensions is a good compromise. However, there is no consensus on how to weight the specific dimensions. November 1997 American Psychologist 1203
and rigorously applied. Although the composition of student rating forms explains a large amount (approximately 45%) of the systematic variation in the validity of student ratings, items must be chosen with care. Because specific items have lower validity coefficients and may not generalize across different instructional contexts, we recommend that short rating forms with a few global items be used. Such scales have higher validity coefficients than most scales containing a large number of heterogeneous items. The association between student ratings and student learning was also significantly higher when instructional evaluation was carried out after, rather than before, the final examination. Students may be rewarding instructors who have given them high grades, or they may be using grading or their performance on the final examination as one of the indicators of instructional effectiveness. In either case, the relationship between instructional behavior and student learning may be confounded. Thus, because of this ambiguity, we recommend that student ratings be collected before the final examination. However, other administrative conditions, such as student anonymity, the stated purpose of the evaluations, or whether the instructors themselves carried out the evaluation procedure, did not significantly influence the validity of student ratings of instruction. Nevertheless, for ethical and legal reasons, standardized procedures should be used. Controlling for Bias in Student Ratings of Instruction The consensus has been that biasing variables play a minor role in student ratings of instruction (Marsh, 1987; Murray, 1984). However, the research summarized in this article indicates that instructors' rank, experience, and autonomy; the course discipline and class size; and section inequivalence moderate the validity of student ratings. In addition, some researchers have suggested that student motivation, course level, instructors' grading leniency, and instructors' expressivity may also bias student ratings of instructors. For example, students rate instructors more accurately when they evaluate full-time faculty teaching large science courses as compared with when they evaluate teaching assistants teaching mediumsized classes in language courses. Similarly, the validity of student ratings is higher when students are randomly assigned to sections than when there are preexisting section inequivalences in student ability. Because instructors have no control over variables such as class size and student registration, some researchers suggest that rating scores be either norm referenced or statistically controlled for putative biasing variables. Cashin (1992) and Aleamoni (1996) argued that student ratings of instruction should be interpreted relative to those of a reference group teaching similar students in similar courses in the same department. This solution has been criticized by Hativa (1993), Abrami (1993), and McKeachie (1996) on a number of grounds. First, how does one decide on which variables to base the reference groups? Should only the student ratings of instructors who are equally lenient graders (and presumably equally poor instructors) be compared? Second, although several variables have been shown to influence ratings in correlational studies (e.g., grading leniency and instructors' expressivity), these variables may exert their influence through student learning. Therefore, they would not be biasing variables. Third, in many cases, the norm group would be so small that the data would be unreliable. Moreover, if norms are used to rank instructors, these individual differences in rank may be too small to have any practical significance. Finally, Abrami and McKeachie argued that such norms have negative effects on faculty members' morale. By definition, half the faculty would be below the norm, yet they could be excellent teachers. Other researchers (Greenwald & Gillmore, in press) have suggested that student ratings should be statistically controlled for grading leniency. However, some of the same arguments as those made against norm referencing also hold for statistical control. In effect, statistically controlling for a variable such as grading leniency computes the rating score as if all instructors graded their students with equal lenience. It achieves mathematically what norming does by selection. Although the aforementioned variables appear to bias student ratings, the multisection validity studies, like other correlational studies, do not indicate whether the influence of the biasing variable is on student ratings or on student learning. For example, science students' global ratings of instruction have a relatively higher validity than those of art students because of problems either with student ratings or with achievement tests. Art students rate their instructors relatively more favorably (Centra & Creech, 1976; Feldman, 1978; Neumann & Neumann, 1985) than do science students. It may be that art students give their instructors undeserved high ratings. Alternatively, achievement tests in art courses may have relatively greater problems with validity and reliability and may not accurately measure what art students have learned. In the first case, student ratings would be biased. However, in the second case, the measurement of student learning would be biased, and student ratings may accurately reflect instructional effectiveness. Therefore, researchers must investigate the mechanisms whereby these putative biasing variables influence student ratings before attempting to control for their influence. Laboratory studies that have used the "Dr. Fox" paradigm have shown that instructors' expressivity (Abrami, Leventhal, & Perry, 1982; Abrami, Perry, & Leventhai, 1982) and grading practices (Abrami, Perry, & Leventhal, 1982) can unduly influence student ratings of instruction. However, these studies also have indicated that these variables do not seriously bias student ratings. Expressivity has a practically meaningful influence on student ratings of instructors, with high-expressive instructors scoring about 1.20 standard deviations above lowexpressive instructors. However, instructors' expressivity 1204 November 1997 American Psychologist
also influences student learning. Researchers who have investigated instructors' expressivity have concluded that it is not a biasing variable but rather that it exerts its influence by affecting student learning (Murray, Rushton, & Paunonen, 1990). Liberal grading practices increased student ratings, at most, by slightly less than 0.5 on a 5-point scale. Moreover, in some cases they decreased ratings. Thus, grading practices are not a practical threat to the validity of student ratings. In both cases, controlling for the putative biasing variable "overcorrects" the influence of the putative biasing variable, removing some of the genuine influence of effective instruction from the rating scores. In effect, poor instructors may be doubly rewarded both by students and by evaluators. Instead, we recommend that student ratings not be overinterpreted. In general, experts recommend that comprehensive systems of faculty evaluation be developed, of which student ratings of instruction are only one, albeit important, component (Arreola, 1995; Braskamp & try, 1994; Centra, 1993). Within such a system, student ratings should be used to make only crude judgments of instructional effectiveness (exceptional, adequate, and unacceptable). REFERENCES Abrami, P.C. (1984). Using meta-analytic techniques to review the instructional evaluation literature. Postsecondary Education Newsletter, 6,8. Abrami, P. C. (1985). Dimensions of effective college instruction. Review of Higher Education, 8, 211-228. Abrami, P. C. (1988). The dimensions of effective college instruction. Montreal, Quebec, Canada: Concordia University. Abrami, P. C. (1989). Seeking the truth about student ratings of instruction. Educational Researcher, 18, 43-45. Abrami, P. C. (1993). Using student rating norm groups for summative evaluation. Evaluation and Faculty Development, 13, 5-9. Abrami, P. C., Cohen, P. A., & d'apollonia, S. (1988). Implementation problems in meta-analysis. Review of Educational Research, 58, 151-179. Abrami, P. C., & d'apollonia, S. (1990). The dimensionality of ratings and their use in personnel decisions. In M. Theall & J. Franklin (Eds.), Student ratings of instruction: Issues for improving practice (pp. 97-111). San Francisco: Jossey-Bass. Abrami, P.C., & d'apollonia, S. (1991). Multidimensional students' evaluations of teaching effectiveness--generalizability of "N = 1" research: Comment on Marsh (1991). Journal of Educational Psychology, 83, 411-415. Abrami, E C., d'apollonia, S., & Cohen, E A. (1990). The validity of student ratings of instruction: What we know and what we do not. Journal of Educational Psychology, 82, 219-231. Abrami, E C., d'apollonia, S., & Rosenfield, S. (1996). The dimensionality of student ratings of instruction: What we know and what we do not. In J. C. Smart (Ed.), Higher education: Handbook of theory and research (Vol. 11, pp. 213-264). New York: Agathon Press. Abrami, E C., Dickens, W. J., Perry, R. E, & Leventhal, L. (1980). Do teacher standards for assigning grades affect student evaluations of instruction? Journal of Educational Psychology, 72, 107-118. Abrami, E C., Leventhal, L., & Perry, R. E (1982). Educational seduction. Review of Educational Research, 52, 446-464. Abrami, E C., & Mizener, D. A. (1985). Student/instructor attitude similarity, student ratings, and course performance. Journal of Educational Psychology, 77, 693-702. Abrami, E C., Perry, R. P., & Leventhal, L. (1982). The relationship between student personality characteristics, teacher ratings, and student achievement. Journal of Educational Psychology, 74, 111-125. Aleamoni, L. M. (1996). Why we do need norms of student ratings to evaluate faculty: Reaction to McKeachie. Evaluation and Faculty Development, 14, 18-19. Arreola, R. A. (1995). Developing a comprehensive faculty evaluation system. Bolton, MA: Anker. Becker, B. E., & Cardy, R. L. (1986). Influence of halo error on appraisal effectiveness: A conceptual and empirical reconsideration. Journal of Applied Psychology, 71, 662-671. Becker, B. J. (1992). Models of science achievement: Forces affecting male and female performance in school science. In T. D. Cook, H. Cooper, D. S. Cordray, H. Hartmann, L. V. Hedges, R. J. Light, T. A. Louis, & E Mosteller (Eds.), Meta-analysisfor explanation: A casebook (pp. 209-281). New York: Russell Sage Foundation. Bolton, A., Bonge, D., & Marr, J. (1979). Ratings of instruction, examination performance, and subsequent enrollment in psychology courses. Teaching of Psychology, 6, 82-85. Braskamp, L. A., & try, J. C. (1994). Assessing faculty work: Enhancing individual and institutional performance. San Francisco: Jossey- Bass. Cashin, W. E. (1992). Student ratings: The need for comparative data. Evaluation and Faculty Development, 12, 1-6. Cashin, W. E., & Downey, R. G. (1992). Using global student rating items for summative evaluation. Journal of Educational Psychology, 84, 563-572. Cashin, W. E., Downey, R.G., & Sixbury, G.R. (1994). Global and specific ratings of teaching effectiveness and their relation to course objectives: A reply to Marsh (1994). Journal of Educational Psychology, 86, 649-657. Centra, J. A. (1993). Reflective faculty evaluation: Enhancing teaching and determining faculty effectiveness. San Francisco: Jossey-Bass. Centra, J. A., & Creech, E R. (1976). The relationship between student, teacher, and course characteristics and student ratings of teacher effectiveness (Project Rep. No. 76-1). Princeton, NJ: Educational Testing Service. Chau, H. (1994, April). Higher-order factor analysis of multidimensional students' evaluations of teaching effectiveness. Paper presented at the 75th annual meeting of the American Educational Research Association, New Orleans, LA. Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis of multisection validity studies. Review of Educational Research, 51, 281-309. Cohen, E A. (1982). Validity of student ratings in psychology courses: A meta-analysis of multisection validity studies. Teaching of Psychology, 9, 78-82. Cohen, E A. (1983). Comment on a selective review of the validity of student ratings of teaching. Journal of Higher Education, 54, 448-458. Cooper, H. M. (1981). On the significance of effects and the effects of significance. Journal of Personality and Social Psychology, 41, 1013-1018. Costin, E, Greenough, W. T., & Menges, R. J. (1971). Student ratings of college teaching: Reliability, validity, and usefulness. Review of Educational Research, 41, 511-536. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. d'apollonia, S., & Abrami, E (1996, April). Variables moderating the validity of student ratings of instruction: A meta-analysis. Paper presented at the 77th annual meeting of the American Educational Research Association, New York. d'apollonia, S., & Abrami, E (1997). Scaling the ivory tower, Pt. II: Student ratings of instruction in North America. Psychology Teaching Review, 6, 60-77. d'apollonia, S., Abrami, E, & Rosenfield, S. (1993, April). The dimensionality of student ratings of instruction: A meta-analysis of the factor studies. Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA. Dowell, D. A., & Neal, J. A. (1982). A selective review of the validity of student ratings of teaching. Journal of Higher Education, 53, 51-62. November 1997 American Psychologist 1205
Downie, N. M., & Heath, R. W. (1974). Basic statistical methods. New York: Harper & Row. Doyle, K. O., Jr. (1983). Evaluating teaching. Lexington, MA: Lexington Books. Feldman, K. A. (1976). The superior college teacher from the student's view. Research in Higher Education, 5, 243-288. Feldman, K.A. (1978). Course characteristics and college students' ratings of their teachers and courses: What we know and what we don't. Research in Higher Education, 9, 199-242. Feldman, K. A. (1983). Seniority and experience of college teachers as related to evaluations they receive from students. Research in Higher Education, 18, 3-124. Feldman, K. A. (1984). Class size and college students' evaluations of teachers and courses: A closer look. Research in Higher Education, 21, 45-116. Feldman, K. A. (1989). The association between student ratings of specific instructional dimensions and student achievement: Refining and extending the synthesis of data from multisection validity studies. Research in Higher Education, 30, 583-645. Feldman, K. A. (1990). An afterword for "The Association Between Student Ratings of Specific Instructional Dimensions and Student Achievement: Refining and Extending the Synthesis of Data From Multisection Validity Studies." Research in Higher Education, 31, 315-318. Feldman, K. A. (in press). Reflections on the study of effective college teaching and student ratings: One continuing quest and two unresolved issues. In J. C. Smard (Ed.), Higher education: Handbook of theory and research. Fiske, S. T., & Neuberg, S. L. (1990). A continuum of impression formation, from category-based to individuating process: Influences of information and motivation on attention and interpretation. Advances in Experimental Social Psychology, 23, 1-74. Franklin, J. L., & Theall, M. (1990). Communicating student ratings to decision makers: Design for good practice. In M. Theall & J. Franklin (Eds.), Student ratings of instruction: Issues for improving practice (pp. 75-93). San Francisco: Jossey-Bass. Prey, P.W., Leonard, D.W., & Beatty, W.W. (1975). Student rating of instruction: Validation research. American Educational Research Journal, 12, 435-447. Gaski, J. E (1987). On "Construct Validity of Measures of College Teaching Effectiveness." Journal of Educational Psychology, 79, 326-330. Glass, G. V. (1978). Integrating findings: The meta-analysis of research. In L. S. Shulman (Ed.), Review of research in education (Vol. 5, pp. 351-379). Itaska, IL: Peacock. Gorsuch, R. I. (1983). Factor analysis. Hillsdale, NJ: Erlbaum. Greenwald, A. G. (1997a). Do manipulated course grades influence student ratings of instructors? A small meta-analysis. Manuscript in preparation. Greenwald, A. G. (1997b). Validity concerns and usefulness of student ratings of instruction. American Psychologist, 52, 1182-1186. Greenwald, A. G., & Gillmore, G. M. (1997). Grading leniency is a removable contaminant of student ratings. American Psychologist, 52, 1209-1217. Greenwald, A. G., & Gillmore, G.M. (in press). No pain, no gain? The importance of measuring course workload in student ratings of instruction. Journal of Educational Psychology. Harman, H. H. (1976). Modern factor analysis (3rd rev. ed.). Chicago: University of Chicago Press. Hartley, E. L., & Hogan, T. P. (1972). Some additional factors in student evaluation of courses. American Educational Research Journal, 9, 241-250. Hativa, N. (1993). Student ratings: A non-comparative interpretation. Evaluation and Faculty Development, 13, 1-4. Hativa, N., & Raviv, A. (1993). Using a single score for summative evaluation by students. Research in Higher Education, 34, 625-646. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Judd, C. M., Drake, R. A., Downing, J. W., & Krosnick, J. A. (1991). Some dynamic properties of attitude structures: Context-induced re- sponse facilitation and polarization. Journal of Personality and Social Psychology, 60, 193-202. Kulik, J. A., & McKeachie, W. J. (1975). The evaluation of teachers in higher education. Review of Research in Education, 3, 210-240. Linn, R. L., Centra, J. A., & Tucker, L. (1975). Between, within, and total group factor analysis of student ratings of instruction. Multivariate Behavioral Research, 10, 277-288. Marsh, H. W. (1980). The influence of student, course, and instructor characteristics on evaluations of university teaching. American Educational Research Journal, 17, 219-237. Marsh, H. W. (1982). Validity of students' evaluations of college teaching: A multitrait-multimethod analysis. Journal of Educational Psychology, 74, 264-279. Marsh, H. W. (1983a). Multidimensional ratings of teaching effectiveness by students from different academic settings and their relation to student/course/instructor characteristics. Journal of Educational Psychology, 75, 150-166. Marsh, H. W. (1983b). Multitrait-multimethod analysis: Distinguishing between items and traits. Educational and Psychological Measurement, 43, 351-358. Marsh, H. W. (1984). Students' evaluations of university teaching: Dimensionality, reliability, validity, potential biases, and utility. Journal of Educational Psychology, 76, 707-754. Marsh, H. W. (1987). Students' evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, //(Whole Issue No. 3). Marsh, H.W. (1991a). A multidimensional perspective on students' evaluations of teaching effectiveness: Reply to Abrami and d'apol- Ionia (1991). Journal of Educational Psychology, 83, 416-421. Marsh, H. W. (1991 b). Multidimensional students' evaluations of teaching effectiveness: A test of alternative higher-order structures. Journal of Educational Psychology, 83, 285-296. Marsh, H.W. (1994). Comments to "Review of the Dimensionality of Student Ratings of Instruction" at the 1994 Annual American Educational Research Association Meeting by d'apollonia, Abrami, and Rosenfield. Instructional Evaluation and Faculty Development, 14, 13-19. Marsh, H. W. (1995). Still weighting for the right criteria to validate student evaluations of teaching in the IDEA system. Journal of Educational Psychology, 87, 666-679. Marsh, H. W., & Dunkin, M. J. (1992). Students' evaluations of university teaching: A multidimensional perspective. In J. Smart (Ed.), Higher education: Handbook of theory and research (Vol. 8, pp. 143-233). New York: Agathon Press. Marsh, H. W., & Hocevar, D. (1984). The factorial invariance of student evaluations of college teaching. American Educational Research Journal, 21, 341-366. Marsh, H. W., & Hocevar, D. (1991). Multidimensional perspective on students' evaluations of teaching effectiveness: The generality of factor structures across academic discipline, instructor level and course level. Teaching and Teacher Education, 7, 9-18. Marsh, H. W., & Roche, L. A. (1997). Making students' evaluations of teaching effectiveness effective: The critical issues of validity, bias, and utility. American Psychologist, 52, 1187-1197. McCallum, L.W. (1984). A meta-analysis of course evaluation data and its use in the tenure decision. Research in Higher Education, 21, 150-158. McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 65, 384-397. McKeachie, W.J. (1996). Do we need norms of student ratings to evaluate faculty? Instructional Evaluation and Faculty Development, 14, 14-17. McKeachie, W. J. (1997). Student ratings: The validity of use. American Psychologist, 52, 1218-1225. Michigan State University. (1971 ). Student Instructional Rating System: Stability of factor structure (SIRS Research Rep. No. 2). East Lansing: Michigan State University, Office of Evaluation Services. Mintzes, J. J. (1977). Field test and validation of a teaching evaluation 1206 November 1997 American Psychologist
instrument: The student opinion of teaching. Windsor, Ontario, Canada. Murphy, K. R., Jako, R. A., & Anhalt, R. L. (1993). Nature and consequence of halo error: A critical analysis. Journal of Applied Psychology, 78, 218-225. Murray, H. G. (1984). Impact of formative and summative evaluation of teaching in North American universities. Assessment and Evaluation in Higher Education, 9, 117-132. Murray, H. G., Rushton, J. P., & Paunonen, S. V. (1990). Teacher personality traits and student instructional ratings in six types of university courses. Journal of Educational Psychology, 82, 250-261. Nathan, B. R., & Lord, R. G. (1983). Cognitive categorization and dimensional schemata: A process approach to the study of halo in performance ratings. Journal of Applied Psychology, 68, 102-114. Nathan, B. R., & Tippins, N. (1990). The consequences of halo "error" in performance ratings: A field study of the moderating effect of halo on test validation results. Journal of Applied Psychology, 75, 290-296. Neumann, L., & Neumann, Y. (1985). Determinants of students' instructional evaluation: A comparison of four levels of academic areas. Journal of Educational Psychology, 78, 152-158. Pambookian, H.S. (1972). The effect of feedback from students to college instructors on their teaching behavior. Unpublished doctoral dissertation, University of Michigan. Postscript Raudenbush, S. W., Becker, B. J., & Kalaian, H. (1988). Modeling multivariate effect sizes. Psychological Bulletin, 103, 111-120, Remmers, H.H., Martin, E D., & Elliot, D.N. (1949). Are student ratings of their instructors related to their grades? Purdue University Studies in Higher Education, 44, 17-26. Rosenthal, R. (1995). Writing meta-analytic reviews. Psychological Bulletin, 118, 183-192. Scriven, M. (1988). The validity of student ratings. Instructional Evaluation, 9, 5-18. Theall, M. (1994). What's wrong with faculty evaluation: A debate on the state of practice. Instructional Evaluation and Faculty Development, 14, 27-34. Trzebinski, J. (1985). Action-oriented representations of implicit personality theories. Journal of Personality and Social Psychology, 48, 1387-1397. Whitely, S. E., & Doyle, K. O. (1976). Implicit theories in student ratings. American Educational Research Journal 13, 241-253. Widlak, E W., McDaniel, E. D., & Feldhusen, J. E (1973, February). Factor analysis of an instructor rating scale. Paper presented at the 54th annual meeting of the American Educational Research Association, New Orleans, LA. Zwick, W. R., & Velicer, W. E (1986). Comparison of five rules for determining the number of components to retain. Psychological Bulletin, 99, 432-442. Comment on Greenwald Greenwald (1997b, this issue) states that discriminant validity measures the degree to which ratings are not "influenced by factors other than teaching effectiveness" (p. 1182). Thus, such variables must be independent of teaching effectiveness. We consider that studies that investigate the influence of biasing variables on the validity coefficient of student ratings address the question of discriminant validity. Thus, we disagree with Greenwald (1997b) and maintain that Cohen's (1981, 1982) and our own meta-analyses in this article (see also d'apollonia, Abrami, & Rosenfield, 1993) that measured the degree to which putative biasing variables distort the validity coefficient of student ratings are as concerned with discriminant validity as they are with convergent validity. We disagree on two counts with the conclusion made by Greenwald (1997b) that there is a moderate to large influence of grading practices on student ratings. First, Abrami, Dickens, Perry, and Leventhal (1980) and Marsh (1987) criticized the field studies included in Greenwald's (1997a) meta-analysis for not controlling for viable rival hypotheses. Second, one of the shortcomings of the meta-analytical approach used by Greenwald (1997a) is that moderate to large average effect sizes are misleading because the approach overlooks real differences in effect sizes (B. J. Becker, 1992). Abrami et al. (1980) demonstrated in several well-controlled laboratory studies that although grading standards influenced student ratings, both the degree and the direction of the influence depended on contextual variables. For example, lenient grading practices increased student ratings of high-expressive instructors but decreased student ratings of low-expressive instructors. Similarly, grading practices influenced student ratings under low-incentive conditions but not under the high-incentive conditions typical of college classes. Because the influence of grading practices on student ratings is not consistent, we consider it to be inappropriate to report an overall effect size for grading leniency. Comment on Marsh and Roche We agree with Marsh and Roche (1997, this issue) that teaching is multidimensional, as are well-designed student rating forms such as the SEEQ. However, students are neither trained raters nor psychometricians. When students are rating teaching, especially retroactively, they tend to rate specific dimensions of instruction on the basis of their global evaluation. This is not to suggest that ratings are invalid. Students' global ratings of instructors are accurate when compared with trained observers' ratings of specific behaviors of instructors (Murray, 1984). Thus, student ratings of instruction have a hierarchical structure, as demonstrated by Marsh (1991b), and can efficiently generate global scores of instructional effectiveness for personnel decisions. We also agree with Marsh and Roche (1997) that validation approaches that incorporate construct validation are desirable. That is, researchers should demonstrate that the scores and interpretation determined by student rating forms correspond to other measures in terms of the underlying theoretical trait (Cronbach & Meehl, 1955). However, the multitrait-multimethod approach described by Marsh and Roche based on multiple raters (current students, former students, instructors themselves, colleagues, and administrators) using the same instrument has been criticized on a number of grounds (Abrami et al., 1996; Gaski, 1987). Under these conditions, the multitrait-multimethod approach can be considered to assess reliability rather than validity. Thus, we believe that it provides evidence that the different raters may share a common theory of teaching and thus rate teaching behaviors similarly. Comment on Greenwald and Gillmore The controversy over grading leniency centers around three issues: (a) What does prior evidence suggest? (b) What weight should be assigned to the new evidence offered by Greenwald and Gillmore (1997, this issue)? and (c) What should be done to improve the interpretation of student ratings? Prior evidence includes some poorly controlled field studies, which are especially difficult to interpret, and several more well-controlled laboratory-simulation experiments. In the latter category, the experiments by Abrami et al. (1980) on the effects of grading November 1997 American Psychologist 1207
leniency found small and inconsistent effects. In one condition, a poor-quality instructor who assigned high grades received worse evaluations from students. Thus, we are of the view that prior evidence fails to demonstrate convincingly that there are large and consistent positive biases in ratings that are attributable to grading leniency. New evidence offered by Greenwald and Gillmore (1997) has been used by them to suggest a strong grading-leniency effect. In our opinion, evidence that student ratings are correlated with expected or actual grades does not necessarily indicate bias. Grading leniency can influence learning, and to the degree that it does so, it does not bias student ratings. Greenwald and Gillmore (1997) constructed five alternative models to explain the grades-ratings correlation. They subsequently tested these models against their University of Washington data set. However, their methodology is appropriate only if each model is correctly specified (all relevant variables and paths are correctly specified). If this condition is not met, the exercise becomes one of setting up a straw man to knock down. We disagree with several of the assumptions Greenwald and Gillmore (1997) made to test these alternative models (the entries in their Table 1), for example, models postulating that a third variable (e.g., true student learning causing both high ratings and high grades) could not also generate a positive withinclasses grades-ratings correlation, a larger correlation with relative grades, a halo effect, or a negative between-classes gradesworkload correlation (d'apollonia & Abrami, 1997). We invite readers to create such scenarios. Correlational methods, such as path analyses, cannot substitute for well-constructed field experiments that investigate the causal relationship between grading leniency and student ratings. Therefore, we agree with the postscript by Greenwald and Gillmore (1997) that the "best way to study the effect of grading leniency.., is by means of experiments with grading-leniency manipulations" (p. 1217). We urge readers to consider prior experiments, such as those of Abrami et al. (1980). Finally, the need to undertake adjustments in ratings requires that several criteria are met. First, there must be unequivocal evidence of bias. Second, simple adjustments require that the effects of the bias are uniform in size and are consistent across teaching contexts. Third, the adjustment must not affect valid influences on student ratings (e.g., teaching-quality effects). Thus, we believe that it remains to be demonstrated whether adjustments in ratings are necessary and whether adjustment schemes improve the ability of ratings to predict teaching influences on student learning and other important outcomes of instruction. Comment on McKeachie We agree with McKeachie (1997, this issue) that the multisection validity studies, by virtue of the common objective achievement test, may reflect an unsophisticated model of teaching effectiveness. In addition to McKeachie's recommendation that student rating forms should reflect models of teaching based on cognitive and motivational research, we suggest that they should also take into consideration students' attitudes about the evaluation process. 1208 November 1997 American Psychologist