Educational Studies Vol. 35, No. 5, December 2009, 547 552 A revalidation of the SET37 questionnaire for student evaluations of teaching Dimitri Mortelmans and Pieter Spooren* Faculty of Political and Social Sciences, University of Antwerp, Sint Jacobstraat 2, B-2000 Antwerp, Belgium CEDS_A_388201.sgm 10.1080/03055690902880299 Educational 0305-5698 Original Taylor 0000002009 PieterSpooren pieter.spooren@ua.ac.be and & Article Francis (print)/1465-4300 Studies (online) In this study, the authors report on the validity and reliability of a paper-and-pencil instrument called SET37 used for Student Evaluation of Teaching (SET) in higher education. Using confirmatory factor analysis on 2525 questionnaires, a revalidation of the SET37 shows construct and discriminant validity of the 12 dimensions included in the instrument. The retest of the instrument reveals (again) the existence of a second-order factor explaining a substantial amount of the variance in seven dimensions of the instrument. In sum, the results provide strong support for both the relevance of the questionnaire and the hypothesis that the SET37 has a multidimensional structure but is compatible with a general underlying factor called Teacher professionalism. Keywords: student evaluation of teaching; teacher evaluation; validity studies; teaching quality; higher education Introduction Going through the great amount of literature concerning the use and the validity of Student Evaluation of Teaching (SET), four main research topics can be distinguished. First, much attention goes to the reliability of student opinions when mapping educational quality (for extensive reviews on this topic, see among others Marsh 1987). The second question deals with the use of SET. For instance, should students be considered (capable of) being full-fledged stakeholders in the evaluation procedure? Third, does student feedback lead to the improvement of instruction (see among others Kember, Leung, and Kwan 2002)? Fourth, is it possible to develop sound instruments for SET? With respect to the latter question, it must be said that many instruments lack a theoretical foundation and/or information on the validation tests (including the results) these instruments were put through. Taking into account that SET are not only used in a formative way (e.g. What can I do to improve my teaching? ), but on the contrary in many college campuses form part of personnel decisions (e.g. tenure and promotion decisions), it is our opinion that SET instruments should meet a panoply of requirements such as reliability, construct validity, convergent validity etc. On top of that, frequent revalidation procedures on new datasets are desirable since (for instance) educational principles change over time (e.g. from classic lectures towards more active and student-centred education). *Corresponding author. Email: pieter.spooren@ua.ac.be ISSN 0305-5698 print/issn 1465-3400 online 2009 Taylor & Francis DOI: 10.1080/03055690902880299 http://www.informaworld.com
548 D. Mortelmans and P. Spooren Objectives In this contribution, we report on a revalidation procedure of our SET37 questionnaire we designed at the University of Antwerp (Belgium) based on both educational theory and empirical testing (Spooren, Mortelmans, and Denekens 2007). The basic idea behind the instrument is that teaching skills should be considered latent constructs and thus cannot be measured by means of a single-item approach. Therefore, we decided to develop a Likert scale-based instrument that consists of item sets relevant to measuring student s attitudes towards several dimensions of teaching (e.g. presentation skills, clarity of course objectives, teacher s help during the learning process, etc.) instead of capturing a particular quality (dimension) through one item. On top of that, the Likert-type scale is very useable and provides the possibility of a quality check on the results by means of the Cronbach s alpha statistic on each dimension every time a course is evaluated by the students. Finally, reliability and validity of these multi-item scales can be extensively tested. The study resulted in a 31-item questionnaire, representing 10 dimensions of teaching. In the meantime, two of these 10 scales were revised and two additional scales were introduced, using the same methodology resulting in the validated 12-scale instrument SET37. Next to this validation study, we looked into the existence of a higher order factor below the various teaching dimensions in the instrument (Spooren and Mortelmans 2006). Such a halo-factor might lead to incorrect interpretations of evaluation results. On the other hand, it is shown that, although student ratings are considered to be multidimensional, students give similar ratings across a lot of evaluation items. In other words, it seems that student ratings have a multidimensional structure but are compatible with a very strong general underlying factor (Apodaca and Grad 2005). We termed this higher order factor Teacher professionalism. Results Instrument We use the same SET37-instrument as in the original validation study (Spooren, Mortelmans, and Denekens 2007), representing 12 balanced scales. All 37-Likert items were used in a paper-and-pencil interview in the students classroom. All items were measured on a six-point scale (going from strongly disagree to strongly agree ). Participants Student evaluation forms were administered during the spring semester of the 2005 2006 academic year. A total of 2525 evaluation forms completed by 620 students, enrolled in 118 courses at a medium-sized multidisciplinary university (12,000 students), were analysed in the present study. Evaluations were completed in various educational programmes by both graduate and undergraduate students. Revalidation Confirmatory factor analysis is the best technique to extensively test the reliability and validity of the different scales in the instrument. The results of the present study are summarised in Table 1. First, the analysis (for which we used the Lisrel 8.53 program)
Educational Studies 549 Table 1. Revalidation of the evaluation instrument: confirmatory factor analysis. Standardised factor loadings t-test Probability Variance t-test R 2 extracted F1: Clarity of course objectives 0.75 (ρ c = 0.90) Item 0101 0.87 113.84 0.001 0.76 Item 0102 0.88 116.98 0.001 0.78 Item 0103 0.84 95.83 0.001 0.70 F2: Value of subject matter 0.64 (ρ c = 0.84) Item 0201 0.86 99.46 0.001 0.74 Item 0202 0.78 76.11 0.001 0.61 Item 0203 0.75 67.46 0.001 0.56 F3: Build-up of subject matter 0.69 (ρ c = 0.87) Item 0301 0.80 83.83 0.001 0.64 Item 0302 0.81 86.46 0.001 0.65 Item 0303 0.89 116.55 0.001 0.79 F4: Presentation skills (ρ c = 0.94) 0.84 Item 0401 0.93 175.31 0.001 0.86 Item 0402 0.91 152.33 0.001 0.82 Item 0403 0.91 146.31 0.001 0.83 F5: Harmony organisation 0.66 course-learning (ρ c = 0.85) Item 0501 0.84 94.87 0.001 0.70 Item 0502 0.82 86.57 0.001 0.67 Item 0503 0.78 74.93 0.001 0.61 F6: Course materials 0.79 (contribution to understanding the subject matter) (ρ c = 0.94) Item 0601 0.87 118.26 0.001 0.75 Item 0602 0.91 151.06 0.001 0.82 Item 0603 0.86 108.16 0.001 0.73 Item 0604 0.93 170.98 0.001 0.86 F7: Course difficulty (ρ c = 0.92) 0.80 Item 0701 0.86 114.81 0.001 0.74 Item 0702 0.91 149.81 0.001 0.83 Item 0703 0.92 159.04 0.001 0.85 F8: Help of the teacher during the 0.80 learning process (ρ c = 0.92) Item 0801 0.89 124.99 0.001 0.79 Item 0802 0.89 134.16 0.001 0.80 Item 0803 0.91 139.15 0.001 0.82
550 D. Mortelmans and P. Spooren Table 1. (Continued). Standardised factor loadings t-test Probability Variance t-test R 2 extracted F9: Authenticity of the examination(s) (ρ c = 0.80) Item 0901 0.63 45.86 0.001 0.39 Item 0902 0.82 76.25 0.001 0.66 Item 0903 0.81 75.68 0.001 0.66 F10: Linking-up with foreknowledge (ρ c = 0.90) Item 1001 0.82 86.70 0.001 0.67 Item 1002 0.89 126.73 0.001 0.78 Item 1003 0.90 140.63 0.001 0.82 F11: Content validity of the examination(s) (ρ c = 0.90) Item 1101 0.84 92.90 0.001 0.70 Item 1102 0.82 96.75 0.001 0.68 Item 1103 0.93 138.82 0.001 0.87 F12: Formative evaluation(s) (ρ c = 0.86) Item 1201 0.89 107.58 0.001 0.79 Item 1202 0.75 70.70 0.001 0.56 Item 1203 0.81 83.76 0.001 0.66 Note: For an overview of some items (1 per scale), see Appendix 1. shows a fair fit to the data (RMSEA = 0.04; GFI = 0.98; CFI = 0.96; NNFI = 0.96) although the χ 2 test of exact fit is significant whereas the objective is to achieve a nonsignificant p-value (χ 2 (563) = 2783.99, p <.05). However, Hatcher (1994) indicates that a significant χ 2 does not make a confirmatory factor analysis model inadequate. Convergence validity of the constructs is shown by the large (between 0.63 and 0.93) and significant factor loadings of the items on their posited indicators. On top of that, none of the correlations between the constructs is high enough to challenge convergent validity (all correlations lower than 0.90). Discriminant validity is evidenced by means of the confidence interval test (the 95% confidence interval around the correlations never equals one) and the χ 2 test (all models wherein the correlation between two constructs was set to zero showed a worse fit to the data). The variance extracted test also demonstrates discriminant validity of the constructs since (with a few exceptions) the explained variance of two constructs is greater than their squared correlation. Finally, internal consistency of all scales is shown because the composite reliability (ρ c in Table 1) is higher than 0.80 every time. 0.57 0.75 0.75 0.67 Second-order factor As we mentioned earlier, there might exist a (unidimensional) second-order factor behind various (multiple) dimensions in SET results. For instance, the general impression students have towards a teacher and his or her course might influence SET scores
Educational Studies 551 on dimensions which are considered free-standing at first sight. Since we found such a halo-factor in an earlier study on this instrument, we decided to put it to the test while revalidating the questionnaire. Confirmatory factor analysis, wherein a secondorder factor was assumed behind the same seven scales (namely Factors 1 4 and 6 8 in Table 1) as in the previous study (Spooren and Mortelmans 2006), shows a good fit to the data (RMSEA = 0.05; GFI = 0.98; CFI = 0.96; NNFI = 0.96). Item loadings (e.g. first-factor loadings) on the second-order factor are high (between 0.82 and 0.95), and 82% of the variance in the scales is explained by this halo factor. These results thus again show the existence of a general underlying factor in the SET37 scores. This offers a promising perspective towards the use of SET37 in both a formative and a summative way. When using the SET37 as an instrument for feedback, one could use the results on one or more particular dimensions when working on the improvement of (teaching) a course. On the other hand, an overall score derived from the (weighted if necessary) SET37 scores on dimensions of which we know they belong to a (valid and reliable) general factor, could be used for the evaluation of teaching staff. At least, when sound answers are found considering other SET-related research topics. Notes on contributors Dimitri Mortelmans is an associate professor in sociology at the Faculty of Political and Social Sciences of the University of Antwerp, Belgium. He teaches qualitative research methods, multivariate statistics and advanced study of populations, families and the life course. He is head of the Research Centre for Longitudinal and Life Course Studies (CELLO) and also head of the Innovation and Quality of Education Centre. His main research topics cover divorce, work-life balance and leisure time of youngsters. Pieter Spooren is affiliated as an educational advisor at the Faculty of Political and Social Sciences of the University of Antwerp, Belgium. His particular activities are educational innovation and evaluation of the educational process and of educators. His main research interests focus on student evaluation of teaching (SET), in particular their use and validity. References Apodaca, P., and H. Grad. 2005. The dimensionality of student ratings of teaching: Integration of uni- and multi-dimensional models. Studies in Higher Education 30: 723 48. Hatcher, L. 1994. A step-by-step approach to using the SAS system for factor analysis and structural equation modelling. Cary, NC: SAS Institute. Kember, D., D. Leung, and K. Kwan. 2002. Does the use of student feedback questionnaires improve the overall quality of teaching? Assessment & Evaluation in Higher Education 27: 412 25. Marsh, H.W. 1987. Student s evaluations of university teaching: Research findings, methodological issues, and directions for further research. International Journal of Educational Research 11: 253 388. Spooren, P., and D. Mortelmans. 2006. Teacher professionalism and student evaluation of teaching: Will better teachers receive higher ratings and will better students give higher ratings? Educational Studies 32: 201 14. Spooren, P., D. Mortelmans, and J. Denekens. 2007. Student evaluation of teaching quality in higher education: Development of an instrument based on 10 Likert scales. Assessment & Evaluation in Higher Education 32: 667 79.
552 D. Mortelmans and P. Spooren Appendix 1. Overview of some items included in the questionnaire (1 per scale) Dimension Item number Description Clarity of course objectives 0103 The information presented by the lecturer at the start of the course clearly specified what I should learn and accomplish Value of subject matter 0202 Some topics covered in this course are, in my opinion, completely redundant (REVERSED) Build-up of subject matter 0302 The different topics covered in this course were completely unrelated (REVERSED) Presentation skills 0401 The lecturer explained the material well Harmony organisation course-learning 0503 The lecturer sometimes presented us with an assignment which forced me to think critically Course materials (contribution to understanding the subject matter) 0602 The study material was not appealing to study (REVERSED) Course difficulty 0702 The level of difficulty of this course was acceptable Help of teacher during learning process 0801 The lecturer helped me with questions and problems which arose during this course Authenticity of the examination(s) 0903 The evaluation procedure consisted of more than plainly reproducing the contents of the course Linking-up with foreknowledge 1001 The contents of this course did not connect with anything we already knew or learned (REVERSED) 1103 The examination was a good reflection of the contents of this course Content validity of examination(s) Formative evaluation(s) 1203 This lecturer did not give any performance feedback throughout the course Note: Translated from Dutch.