Validating LanguEdge Courseware Scores Against Faculty Ratings and Student Self-assessments

Similar documents
Information for teachers about online TOEIC Listening and Reading practice tests from

Assessing English-Language Proficiency in All Four Language Domains: Is It Really Necessary?

The Official Study Guide

Assessing Adult English Language Learners

Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress 1

ETS Automated Scoring and NLP Technologies

Interpreting and Using SAT Scores

Understanding Your Praxis Scores

Study Guide for the Physical Education: Content and Design Test

National assessment of foreign languages in Sweden

Evaluating Analytical Writing for Admission to Graduate Business Programs

ACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores

Running head: REVIEW OF IELTS, MET, AND TOEFL 1

Transadaptation: Publishing Assessments in World Languages

Study Guide for the Mathematics: Proofs, Models, and Problems, Part I, Test

Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh

Course-Based Key Assessment #4 ESOL Teaching Portfolio and Reflective Summary of Field Experience Used in TESL 5040 Practicum in ESOL

Principles of Data-Driven Instruction

Interpretive Guide for the Achievement Levels Report (2003 Revision) ITBS/ITED Testing Program

Rygaards International Secondary School Assessment, Recording and Reporting Policy

Research Rationale for the Criterion SM Online Writing Evaluation Service

Comparison of the Cambridge Exams main suite, IELTS and TOEFL

Automated Scoring for the Assessment of Common Core Standards

The Michigan State University - Certificate of English Language Proficiency (MSU- CELP)

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias

Constructing a TpB Questionnaire: Conceptual and Methodological Considerations

Washback of IELTS on the Assumption College English Program

The. Languages Ladder. Steps to Success. The

Handbook on Test Development: Helpful Tips for Creating Reliable and Valid Classroom Tests. Allan S. Cohen. and. James A. Wollack

Test Item Analysis & Decision Making Offered by the Measurement and Evaluation Center

Study Guide for the Middle School Science Test

Understanding Types of Assessment Within an RTI Framework

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

Assessment That Drives Instruction

Rubrics for Assessment

ELPS TELPAS. Proficiency Level Descriptors

The Official Study Guide

General Information About the Illinois Certification Testing System

Dr. Wei Wei Royal Melbourne Institute of Technology, Vietnam Campus January 2013

Alignment of the National Standards for Learning Languages with the Common Core State Standards

Chinese Tests Chinese Proficiency Test (HSK)

Chinese Proficiency Test (HSK)

Final Project Report

Language Skills in a Multilingual Society Syed Mohamed Singapore Examinations and Assessment Board

Writing a degree project at Lund University student perspectives

Study Guide for the Special Education: Core Knowledge Tests

APEC Online Consumer Checklist for English Language Programs

Study Guide for the Music: Content Knowledge Test

Section 8 Foreign Languages. Article 1 OVERALL OBJECTIVE

Comparative Analysis on the Armenian and Korean Languages

Bilingual Education Assessment Urdu (034) NY-SG-FLD034-01

Understanding Your Praxis Scores

Requirements & Guidelines for the Preparation of the New Mexico Online Portfolio for Alternative Licensure

Understanding Your Test Record and Profile Chart for the PSB-Nursing School Aptitude Examination (RN)

oxford english testing.com

The Michigan State University - Certificate of English Language Proficiency (MSU-CELP)

Student Learning Outcomes in Hybrid and Face-to-Face Beginning Spanish Language Courses

Speech-Language Pathology Study Guide

Section 2b: Observation/Feedback Approach

ESB Level 2 Award in ESOL Skills for Life (Writing) (QCF)

Testing Services. Association. Dental Report 3 Admission 2011 Testing Program. User s Manual 2009

Study Guide for the English Language, Literature, and Composition: Content Knowledge Test

TExES English as a Second Language Supplemental (154) Test at a Glance

How To Test English Language Skills On A Touch Tablet

ILLINOIS STATE BOARD OF EDUCATION MEETING March 19-20, Action Item: Routes to Paraprofessional Qualification

assessment report ... Academic & Social English for ELL Students: Assessing Both with the Stanford English Language Proficiency Test

Rubrics for Assessing Student Writing, Listening, and Speaking High School

COMPUTER TECHNOLOGY IN TEACHING READING

National assessment of foreign languages in Sweden

An Analysis of IDEA Student Ratings of Instruction in Traditional Versus Online Courses Data

How Do We Assess Students in the Interpreting Examinations?

Advantages and Disadvantages of Various Assessment Methods

SOUTH SEATTLE COMMUNITY COLLEGE (General Education) COURSE OUTLINE Revision: (Don Bissonnette and Kris Lysaker) July 2009

The new Verbal Reasoning and Quantitative Reasoning score scales:

A Minimum English Proficiency Standard for The Test of English as a Foreign Language Internet-Based Test (TOEFL ibt)

Colorado State University s Systems Engineering degree programs.

English Language Proficiency (055)

French Language and Culture. Curriculum Framework

Oral Fluency Assessment

New HSK Vocabulary CLPS CEF. C2 Level V HSK (Level V) 2500 C1. HSK (Level IV) 1200 Level IV B2. HSK (Level III) 600 Level III B1

Early Childhood Measurement and Evaluation Tool Review

Program Overview. This guide discusses Language Central for Math s program components, instructional design, and lesson features.

Writing learning objectives

Study Guide for the Elementary Education: Content Knowledge Test

Basic Assessment Concepts for Teachers and School Administrators.

Sample TPB Questionnaire

ICAO Language Proficiency in Ab-Initio Flight Training. Ms. Angela C. Albritton Aviation English Consultant

University of North Carolina at Chapel Hill

Integrating Reading and Writing for Effective Language Teaching

Assessment Policy. 1 Introduction. 2 Background

CREATING LEARNING OUTCOMES

The importance of using marketing information systems in five stars hotels working in Jordan: An empirical study

Transcription:

RESEARCH REPORT April 2003 RR-03-11 Validating LanguEdge Courseware Scores Against Faculty Ratings and Student Self-assessments Donald E. Powers Carsten Roever Kristin L. Huff Catherine S. Trapani Research & Development Division Princeton, NJ 08541

Validating LanguEdge Courseware Scores Against Faculty Ratings and Student Self-assessments Donald E. Powers, Carsten Roever, Kristin L. Huff, and Catherine S. Trapani Educational Testing Service, Princeton, NJ April 2003

Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from: Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541

Abstract LanguEdge Courseware is a software tool that is designed to help teachers of English as a second language (ESL) build and assess the communicative skills of their students. The purpose of this study was to generate information to help LanguEdge Courseware users understand better the meaning (or validity) of the assessment scores bases on the LanguEdge Courseware. Specifically, the objective was to describe, for each of the four sections of the LanguEdge assessment, relevant characteristics of test takers at various test score levels. To accomplish this objective, we gathered data that represent two different perspectives those of instructors and those of students themselves. Approximately 3,000 students each took one of two parallel forms of the LanguEdge assessment at domestic and international testing sites. Participants also completed a number of self-assessment questions about their English language skills. In addition, for some study participants, instructors rated selected language skills. LanguEdge test scores related moderately (correlations mostly in the.30s and.40s) with student self-assessments. Of the four LanguEdge tests, Listening exhibited the strongest relationships to self-assessments; Speaking, the next strongest; Reading, the next; and Writing, the least. The correlations of faculty ratings with each of the LanguEdge section test scores were generally in the.40s, with some reaching the.50s. The correlations between the various student self-assessment scales and faculty ratings were modest, mostly in the.30s. These correlations suggest that students and faculty had different perspectives on students English language skills. i

As isolated entities, summary test scores, even when accompanied by normative data, are not especially informative about what test takers know and can do. In an effort to make test scores more useful, some testing programs for example, the National Assessment of Educational Progress (NAEP) have implemented relatively sophisticated reporting procedures in order to facilitate test score interpretations. One such effort, generally known as proficiency scaling, is usually but not always based on item response theory (IRT) methods (Beaton & Allen, 1992) and entails procedures such as the following. Several ability levels are selected on an overall ability/proficiency score scale. For each of these levels, individual items are selected such that, at a given level of ability, examinees have a specified probability (say, 80%) of answering each item correctly. At lower levels of ability, however, examinees have a significantly lower probability of answering each of these items correctly, but a high probability of answering some other set of items correctly. Experts then judge the items that examinees correctly answer at each level in order to characterize examinee proficiency at various score points (see, for example, Mullis & Jenkins, 1988; Beaton & Allen, 1992). The resulting scales have a number of attractive features. They are, however, not entirely problem-free. For example, the proficiencies that underlie success on test items at various score levels are not always readily inferred, especially when the domains being tested are either multidimensional or ill defined. Such attempts can give rise to questionable inferences about examinee proficiency (Forsyth, 1991), possibly because test users do not adequately understand the score reports (Hambleton & Slater, 1994). Another noteworthy aspect of proficiency scaling is that it is internally focused. That is, proficiency scales are given meaning by referencing performance on the test items that the scales comprise. Because score levels are interpreted according to the items that determine scores, the method may appear to be circular. At the least, the method has a bootstrapping nature insofar as it makes use of existing resources (i.e., test items) to improve an existing state (i.e., test score interpretations). In contrast, the effort undertaken here approached test score meaning from an external perspective. The aim was to relate test score levels to nontest, external indicators of examinees language proficiency. The test scores of interest were those based on the LanguEdge Courseware software. 1

Overview LanguEdge Courseware (http://www.toefl.org/languedge.html) is a professional development tool designed to help teachers of English as a second language (ESL) build and assess the communicative skills of their students. The courseware package consists of interactive software (two full-length tests of reading, writing, speaking, and listening) and supporting materials (a teacher's guide, a scoring handbook, and a score interpretation guide). The package is based on the likely test format of a future version of the Test of English as a Foreign Language (TOEFL ), which will employ tasks that integrate speaking and writing with reading and listening. The purpose of this study was to try out procedures that might, eventually, prove useful for generating information to help LanguEdge courseware users better understand the meaning (or validity) of LanguEdge test scores. Specifically, the objective was to describe, for each of four sections of the test, relevant characteristics of test takers at various test score levels, thereby helping to establish the validity of test score distinctions among test takers. To accomplish this objective, we gathered data that represent two different perspectives those of instructors and those of students themselves. The collection of multiple sources of information is consistent with commonly accepted standards for test validation (Messick, 1989; American Educational Research Association, 1999). Instructors assessments of students English language skills were gathered because teachers seem well-positioned to judge the academic skills of their students. The (less obvious) rationale for collecting student self-assessments was as follows. Self-assessments of various sorts self-reports, checklists, self-testing, mutual peer assessment, diary-keeping, log books, behaviorally anchored questionnaires, global proficiency scales, and can-do statements (Oscarson, 1997) have proven to be useful indicators in a variety of evaluation contexts, especially in the assessment of language skills. Upshur (1975), for instance, noted that language learners typically have a wider view of their successes and failures than do external evaluators. More generally, Shrauger and Osberg (1981) concluded that there is substantial evidence, both empirical and conceptual, that self-assessors frequently have both the information and the motivation to make effective judgments about themselves. 2

Methods Sample Selection In the spring of 2002, approximately 3,000 candidates were recruited both internationally and domestically (United States and Canada) to participate in a field study of the LanguEdge Courseware. Each of these students took one of two parallel forms of the LanguEdge assessment at one of 18 domestic and 12 international test sites. After deleting records for test takers whose motivation was questionable, usable test data were available for 2,703 test takers. The field study sample was generally representative of the TOEFL population in terms of native language. A majority (60%) of field study participants came from the following native language groups: Chinese (18%), Spanish (13%), Arabic (7%), Korean (7%), Japanese (5%), French (4%), Indonesian (3%), and Latvian (3%). These groups constitute approximately 61% of the TOEFL test-taking population and are represented in the following proportions: 23%, 5%, 5%, 12%, 13%, 2%, 1%, and <1%, respectively. The field study sample was also generally representative of the TOEFL population in terms of level of English language proficiency as measured by the paper-and-pencil TOEFL. Both the domestic and international field study subsamples performed slightly better on each section of the TOEFL than did their domestic and international counterparts who took the TOEFL. The mean scores on the Listening, Structure, and Reading sections, which range from 20 to 67 (or 68), were, respectively, 53.7, 51.8, and 52.9 for the study sample. The same mean scores for the TOEFL operational test population were 52.6, 49.3, and 51.6 (domestic test takers) and 50.5, 50.7, and 52.6 (international test takers). The differences between the study sample and the operational testing population were relatively small, ranging from approximately.03 to.34 standard deviation units on each of the three scales (listening, structure, and reading). Procedure/Instruments Each study participant took the LanguEdge assessment along with a retired paper-based TOEFL test (TOEFL PPT). LanguEdge has four sections, corresponding to the four modalities of communication: Listening, Reading, Speaking, and Writing. The LanguEdge assessment is composed of several different item types, including (a) conventional four-choice, single-correctanswer multiple-choice items, (b) multiple-choice items requiring one or more correct responses, 3

(c) extended written response (essay) items, and (d) spoken response items. Productive response items (i.e., the Speaking and Writing items) require evaluation by trained human raters and are worth 1 to 5 points each. With respect to scoring, raw score totals for Listening and Reading are calculated by summing the number of points awarded for each item answered correctly. Classical equipercentile equating methods were used to equate Listening and Reading scores across the two forms of the assessment. In addition to being equated, Listening and Reading scores were linearly scaled to have a minimum value of 1 and a maximum value of 25, respectively. There are five Speaking tasks and three Writing tasks in each form of LanguEdge. Several of these tasks are designed to reflect the integrated nature of communicative language ability. One of the Speaking tasks is integrated with Listening (Listening/Speaking) and the other with Reading (Reading/Speaking). These tasks require examinees to either read or to listen to a stimulus and then to speak about it. Similarly, there are two integrated Writing tasks that are administered as part of the Listening and Reading sections (i.e., Listening/Writing and Reading/Writing). The remaining tasks (three Speaking and one Writing) are referred to as independent tasks, as responses do not require examinees to read or listen to an extended verbal stimulus. Scores on the five Speaking tasks and scores on the three Writing tasks comprise the Speaking and Writing total scores, respectively. Scores for these sections of the assessment have not been scaled or equated. Instead, scores are reported as the average of scores on each of the tasks. Before they were tested, participants were also asked to complete a number of questions about their English language skills. Several kinds of self-assessment questions were developed. Two sets of can-do type statements were devised on the basis of reviews of existing statements (e.g., Tannenbaum, Rosenfeld, Breyer, & Wilson, 2003) and with regard to the claims being made for LanguEdge. Only statements that concerned academically related language competencies, not more general language skills, were written. One set (19 items) asked test takers to rate (on a 5-point scale ranging from extremely well to not at all ) their ability to perform each of several language tasks. The other set (20 items) asked test takers to indicate the extent to which they agreed or disagreed (on a 5-point scale ranging from completely agree to completely disagree ) with each of several other can-do statements. For each set, approximately equal numbers of questions addressed each of the four language modalities (Listening, Reading, Speaking, and Writing). 4

Test takers were also asked to compare (on a 5-point scale from a lot higher to a lot lower ) their English language ability in each of the four language modalities with those of other students both in classes they were taking to learn English and also, if applicable, in subject classes (biology or business, for example) in which the instruction was in English. Test takers were also asked to provide a rating (on a 5-point scale from extremely good to poor ) of their overall English language ability. Finally, test takers who had taken some or all of their classes in English were asked to indicate (on a 5-point scale ranging from not at all difficult to extremely difficult ) how difficult it was for them to learn from courses because of problems with reading English or with understanding spoken English. They were also asked to indicate how much difficulty they had encountered when attempting to demonstrate what they had learned because of problems with speaking English or with writing English. In addition to completing self-assessment questions about their language skills, study participants who tested at U.S. sites (but not international sites) were also asked to contact two people who had taught them during the past year and to give each one the Faculty Assessment Form. Study participants were asked to contact only people who had had some opportunity to observe their English language skills. Participants were told that after faculty completed the forms, faculty would mail the envelopes directly to us. The instructions that accompanied the Faculty Assessment Form asked faculty to provide their opinions about the student s English language skills. Specifically, instructors were told that Educational Testing Service was developing a new TOEFL to facilitate the admission and placement of nonnative speakers of English in academic programs in North America and that, in conjunction with this effort, we were gathering a variety of information about the students who had taken the first version of the test in order to establish more firmly the meaning of scores on the new assessment. Instructors were also told that they had been asked to provide information because they had had relevant contact with the student who had contacted them. Finally, they were informed that their assessment would be treated confidentially and would not be shared with anyone, including the student. The Faculty Assessment Form asked faculty to indicate (on a 5-point scale ranging from not successful at all to extremely successful ) how successful the student had been at (1) understanding lectures, discussions, and oral instructions 5

(2) understanding (a) the main ideas in reading assignments and (b) written instructions for exams/assignments (3) making him/herself understood by you and other students during classroom and other discussions (4) expressing ideas in writing and responding to assigned topics. Faculty were also asked to compare (on a 7-point scale ranging well below average to well above average ) the student s overall command of English with that of other nonnative English students they had taught. For each question, instructors were allowed to omit their rating, if appropriate, and to respond instead that they had not had adequate opportunity to observe the student s language skills. Instructors were also asked to indicate their current position or title, the approximate number of nonnative speakers of English they had taught at their current and previous academic institutions, and just how much opportunity they had had to observe the student s facility with the English language (little if any, some, a moderate amount, or a substantial amount). A final item on the form requested the faculty member s telephone number and e-mail address for verification purposes only. This item was included only to discourage study participants from completing the form themselves. Results Student Self-assessments It was important to first establish the extent to which test takers were consistent in reporting about their own language skills. For this purpose, 4-, 5-, or 6-item scales were formed by summing responses to individual items having the same response format (e.g., how well or agree ) for each language modality. Table 1 shows the number of items that comprised each scale, as well as the internal consistency reliability estimate (coefficient alpha) for each of the various scales. As is clear, each of the various scales exhibits reasonably high internal consistency, ranging from a low of.81 (for four items asking students to compare their English language skills with those of other students in English language classes) to.95 for the five-item scale asking students to rate how well they could perform various reading tasks. 6

Table 1 Reliability Estimates for Language Skill Self-assessments Scale How well scales Number of items Coefficient alpha Listening 5.93 Reading 5.95 Speaking 5.93 Writing 4.89 Composite 19.97 Agreement scales Listening 4.88 Reading 6.92 Speaking 5.89 Writing 5.91 Comparison scales Students in ESL classes 4.81 Students in subject courses 4.88 Overall English ability 4.84 Difficulty with English 4.85 Note. Ns for scales range from 2,235 to 2,629 due to nonresponse to some questions. The internal consistency reliability estimates for the LanguEdge test sections were.88,.89,.80, and.76 for the Listening, Reading, Speaking, and Writing sections, respectively. The intercorrelations among LanguEdge section scores ranged from.57 between Reading and Speaking to.76 between Listening and Reading. All other intercorrelations were in the mid to high.60s. Table 2 shows the correlations of each of the various student self-assessment scales with performance on each section of LanguEdge. Generally, test scores related least strongly to the 7

scales on which students were asked to compare their abilities to those of other students. They related most strongly, generally, to the various can-do scales (both those using a how well response format and those using an agree format). Of the four LanguEdge tests, Listening most often exhibited the strongest relationships to self-assessments; Speaking, the next strongest; Reading, the next; and Writing, the least. Table 2 Correlations of Self-assessment Scales With LanguEdge Scores LanguEdge score Self-assessment scale M SD Listening Reading Speaking Writing How well Scales Listening 12.8 4.1.47(.50).31(.33).49(.55).29(.33) Reading 12.4 4.0.46(.49).41(.43).42(.47).31(.36) Speaking 14.0 4.1.33(.35).18(.19).43(.48).19(.22) Writing 11.4 3.1.36(.38).26(.28).41(.46).26(.30) Composite 51.0 13.5.46(.49).32(.34).48(.54).29(.33) Agreement Scales Listening 8.4 2.9.48(.51).34(.36).46(.51).28(.32) Reading 12.9 4.3.51(.54).43(.46).44(.49).32(.37) Speaking 11.0 3.7.41(.44).28(.30).44(.49).26(.30) Writing 11.4 3.7.40(.43).31(.33).40(.45).28(.32) Composite 43.8 13.3.49(.52).37(.39).48(.54).31(.36) Comparison Scales Students in ESL classes 10.5 2.7.25(.27).14(.15).33(.37).16(.18) Students in subject courses 11.1 3.1.16(.17).07(.07).21(.23).04(.05) (Table continues) 8

Table 2 (continued) LanguEdge score Self-assessment scale M SD Listening Reading Speaking Writing Overall English ability 11.2 3.0.36(.38).22(.23).44(.49).21(.24) Difficulty with English 8.1 2.9.40(.43).29(.31).40(.45).24(.28) Note. Ns range from 2,235 to 2,616 for Reading and Listening, from 818 to 952 for Speaking, and from 1,117 to 1,303 for Writing. The different Ns reflect mainly that all responses could not be scored in time to meet the schedule for data analysis. Entries in parentheses have been corrected for attenuation due to unreliability of LanguEdge scores. Faculty Ratings Faculty returned ratings for 819 of the study participants. For 637 participants, two ratings were available. The sample for whom faculty ratings were returned had slightly lower LanguEdge scores on average but was reasonably representative of the total study sample in terms of the range of test performances. Faculty who returned rating forms described their positions or titles as follows: faculty member (45%), teaching assistant (11%), ESL instructor (38%), and other (6%). Nearly all respondents reported having had an opportunity to observe the student s facility with English either some (17%), a moderate amount (40%), or a substantial amount (41%). (About 1% of the respondents said they had had little if any opportunity to observe the student s English language skills, and so they were deleted from the analysis.) Respondents reported having taught various numbers of nonnative speakers of English at their current and previous academic institutions, with 6% having taught fewer than 10 such students, 25% from 10 to 100, and 70% more than 100. A scale consisting of all four faculty ratings (one for each language modality) was highly internally consistent, exhibiting a coefficient alpha of.91. Table 3 shows the agreement statistics between pairs of faculty raters for each of the four ratings, plus those for a fifth, which is an overall rating of students language skills. As can be seen, the agreement rates are modest, indicating that instructors did not agree completely, possibly because of different perspectives, about the English language skills of the students they taught. Rates of exact agreement ranged from 39% to 50%, and rates of agreement that were exact or within one point ranged from 74% 9

to 94%. Correlations between pairs of faculty raters ranged from.47 to.52, and Cohen s kappa ranged from.21 to.26. Weighted kappas ranged from.33 to.39. (Kappa values of.21 to.40 have been described by Landis and Koch [1977] as fair. ) Table 3 Agreement Statistics for Faculty Ratings Faculty rating Exact agreement (%) Exact or adjacent (%) Statistic r Kappa Weighted kappa In general, how successful has this student been: in understanding lectures, discussions, and oral instructions 49.8 94.0.52.26.39 at understanding (a) the main ideas in reading assignments and (b) written instructions for exams/assignments 47.3 92.7.47 n.e. n.e. at making him/herself understood by you and other students during classroom and other discussions 47.0 90.5.51.25.37 at expressing ideas in writing and responding to assigned topics 44.9 89.2.47.21.33 Compared to other nonnative English students you have taught, how is this student s overall command of the English language? 38.9 73.7.49.21.36 Note. N = 637 test takers for whom two faculty ratings were available. n.e. = not estimable. Table 4 shows the correlations of faculty ratings (mean of two ratings when available) with each of the LanguEdge section test scores. With few exceptions, these correlations are all in the.40s, with some reaching the.50s. The correlations between the various student selfassessment scales and faculty ratings were modest, ranging from.09 to.41, with a majority 10

(65%) falling in the.30s. These correlations suggest that students and faculty had different perspectives on students English language skills. Table 4 Correlation of Instructor Ratings With LanguEdge Scores LanguEdge test score Faculty rating L R S W In general, how successful has this student been: in understanding lectures, discussions, and oral instructions.49(.52).42(.45).47(.53).36(.41) at understanding (a) the main ideas in reading assignments and (b) written instructions for exams/assignments.47(.50).45(.48).42(.47).40(.46) at making him/herself understood by you and other students during classroom and other discussions.43(.46).36(.38).42(.47).35(.40) at expressing ideas in writing and responding to assigned topics.45(.48).43(.46).42(.47).42(.48) Composite Rating (sum of the four above).52(.55).47(.50).51(.57).44(.50) Compared to other nonnative English students you have taught, how is this student s overall command of the English language?.51(.54).45(.48).53(.59).41(.47) Note. Ns range from 400 to 465 for Writing, from 260 to 303 for Speaking, from 716 to 819 for Listening and for Reading. All correlations are significant at the.001 level or beyond. Entries in parentheses have been corrected for attenuation due to unreliability of LanguEdge scores. Characteristics of Test Takers at LanguEdge Score Levels Tables 5-8 show for each LanguEdge test section the relationships between score level and both student self-assessments and instructor ratings. Table entries are percentages of either students or instructors who gave various responses to each question. For instance, the first line in Table 5 11

shows, by test takers score level, the percentages of instructors who judged that students at the score level had been more than moderately successful (i.e., very successful or extremely successful) in understanding lectures, etc. Each table contains only the assessments and ratings for the language modality matching the test section. For example, Table 5 shows that, for Listening scores, 32% of faculty participants felt that test takers who scored at the lowest level (1-5) had been more than moderately successful in understanding lectures, discussions, and oral instructions. On the other hand, students who scored at the highest level on the Listening test (21-25) were judged much more often (by 90% of faculty raters) as being more than moderately successful. The corresponding faculty rating for (a) Reading (success at understanding main ideas in reading assignments and written instructions for exams/assignments), (b) Speaking (success at making him/herself understood by faculty and students during classroom and other discussions), and (c) Writing (success in expressing ideas in writing and responding to assigned topics) are shown in Tables 6, 7, and 8, respectively. 12

Table 5 Key Descriptors of LanguEdge Learners by Listening Score Level Descriptor Faculty (%) Judging that students had been more than moderately successful at understanding lectures, discussions, and oral instructions 32 Test score level 1-5 6-10 11-15 16-20 21-25 42 60 78 90 Who felt that students overall command of English was at least somewhat above average when compared with other nonnative students they had taught 13 22 41 68 77 13 Students (%) Who agreed that they could: remember the most important points in a lecture 34 37 45 63 78 understand instructors directions about assignments and their due dates 43 60 75 89 95 recognize which points in a lecture are important and which are less so 33 41 57 72 84 relate information that they hear to what they know 29 42 63 76 88 Who said they did not perform well at: understanding the main ideas of lectures and conversations 31 29 14 5 2 understanding important facts and details of lectures 36 33 19 9 5 understanding the relationships among ideas in a lecture 36 32 18 11 5 understanding a speaker s attitude or opinion recognizing why a speaker is saying something 38 43 28 32 15 16 9 10 5 4 (Table continues)

Table 5 (continued) Test score level Descriptor 1-5 Students (%) Who felt their listening ability was lower than that of other students in ESL classes 29 6-10 11-15 16-20 21-25 22 14 10 5 Who felt that problems understanding spoken English made learning difficult 52 41 35 22 10 14

Table 6 Key Descriptors of LanguEdge Learners by Reading Score Level 15 Descriptor Faculty (%) Judging that students had been more than moderately successful at understanding the main ideas in reading assignments and written instructions for exams 36 Who felt that students overall command of English was at least somewhat above average when compared with other nonnative students they had taught Students (%) Who agreed that they could: Test score level 1-5 6-10 11-15 16-20 21-25 54 73 83 92 20 33 50 66 83 quickly find information in academic texts 42 49 62 75 86 understand the most important points when reading an academic text 40 55 71 83 91 figure out the meaning of unknown words by using context and 34 43 60 71 83 background knowledge remember major ideas when reading an academic text 42 50 66 75 85 understand charts and graphs in academic texts 42 53 73 84 91 understand academic texts well enough to answer questions about them 41 45 63 75 85 (Table continues)

Table 6 (continued) Test score level Descriptor 1-5 6-10 11-15 16-20 21-25 Students (%) Who said they did not perform well at: understanding vocabulary and grammar 25 26 13 5 2 understanding major ideas 20 13 6 2 0 understanding how the ideas in a text relate to each other 26 23 10 6 3 understanding the relative importance of ideas 28 18 9 4 3 organizing or outlining the important ideas and concepts in texts 29 26 12 7 4 16 Who felt their reading ability was lower than that of other students in ESL classes 18 14 5 4 2 Who felt that problems reading English made learning difficult 43 29 16 10 6

Table 7 Key Descriptors of LanguEdge Learners by Speaking Score Level Descriptor Faculty (%) Judging that students had been more than moderately successful at making himself/herself understood by during classroom and other discussions 44 Test score level 1-2 2-3 3-4 4-5 57 76 86 Who felt that students overall command of English was at least somewhat above average when compared with other nonnative students they had taught 22 34 72 83 17 Students (%) Who agreed that they could: state and support their opinion 31 51 68 85 make themselves understood when asking a question 56 70 81 93 talk for a few minutes about a familiar topic 39 66 73 90 give prepared presentations 38 62 78 90 talk about facts or theories they know well and explain them in English 28 55 68 82 (Table continues)

Table 7 (continued) 18 Descriptor Students (%) Who said they did not perform well at: Test score level 1-2 2-3 3-4 4-5 speaking for one minute in response to a question 53 36 23 17 getting other people to understand them 26 16 7 5 participating in conversations or discussions 36 25 17 6 orally summarizing information from a lecture listened to in English 47 38 25 8 orally summarizing information they have read in English Who felt their speaking ability was lower than that of other students in ESL classes Who felt that problems speaking English made it difficult to demonstrate learning 40 23 16 7 19 17 11 6 46 41 25 13

Table 8 Key Descriptors of LanguEdge Learners by Writing Score Level 19 Descriptor Faculty (%) Judging that students had been more than moderately successful at understanding the main ideas in expressing ideas in writing and responding to assigned topics Who felt that students overall command of English was at least somewhat above average when compared with other nonnative students they had taught Students (%) Who agreed that they could: Test score level 1-2 2-3 3-4 4-5 35 68 77 83 30 61 73 88 express ideas & arguments effectively when writing in English 43 62 70 76 support ideas with examples or data when writing 48 63 77 77 write texts that are long enough without writing too much 41 58 68 73 organize text so that the reader understands the main and supporting ideas 51 69 80 85 write more or less formally depending on the purpose and the reader 42 58 68 75 (Table continues)

Table 8 (continued) Test score level Descriptor 1-2 2-3 3-4 4-5 Students (%) Who said they did not perform well at: writing an essay in class on an assigned topic 30 19 11 10 summarizing & paraphrasing in writing information read in English 26 15 10 9 summarizing in writing information that was listened to in English 39 30 20 14 using correct grammar, vocabulary, spelling and punctuation when writing 39 28 16 12 20 Who felt their writing ability was lower than that of other students in ESL classes 20 13 8 8 Who felt that problems writing English made it difficult to demonstrate learning 38 29 18 17

Student-self assessments are shown in a similar manner in each table. For example, Table 5 reveals that 34% of the students who obtained LanguEdge listening scores of 1-5 agreed that they could remember important points in a lecture, whereas 78% of those at the highest level (21-25) agreed that they could do this. We note that for all but one of the various ratings (understanding vocabulary and grammar), percentages increase (or decrease) monotonically as expected. Finally, it may be useful to LanguEdge users to know how test takers viewed the various tasks that make up the assessment, that is, how valid they appeared to be. Table 9 shows the reactions of field study participants to each of the LanguEdge tasks. As can be seen, students generally viewed the tasks as being appropriate ones on which to demonstrate their English language skills. With the exception of two speaking tasks (speaking about a lecture and speaking about a reading passage), each of the tasks was deemed by nearly 80% (or more) of test takers to have been a good way in which to demonstrate their skills. Table 9 Test Taker Agreement With Statements About LanguEdge Tasks Statement Writing about a general topic was a good way to demonstrate my ability to write in English. This was a good test of my ability to understand conversations and lectures in English. Answering questions about single points or details in the reading text was a good way for me to demonstrate my reading ability. Answering questions by organizing information from the entire reading passage into a table was a good way for me to demonstrate my reading ability. Percent agreeing or strongly agreeing 90 82 82 82 (Table continues) 21

Table 9 (continued) Statement This was a good test of my ability to read and understand academic texts in English. Writing about a reading passage was a good way to demonstrate my ability to write in English. Speaking about general topics was a good way to demonstrate my ability to speak in English. Writing about a lecture was a good way to demonstrate my ability to write in English. Speaking about a lecture was a good way to demonstrate my ability to speak in English. Speaking about a reading passage was a good way to demonstrate my ability to speak in English. Percent agreeing or strongly agreeing 80 79 78 78 65 62 Note. Ns range from 2,685 to 2,694. Discussion Although faculty ratings and student self-assessments proved to relate only modestly to each other, both related significantly to scores on each section of the LanguEdge assessment. LanguEdge test scores related moderately (correlations mostly in the.30s and.40s) with student self-assessments. The correlations of faculty ratings with each of the LanguEdge section test scores were generally in the.40s, with some reaching the.50s. Moreover, individually, each of the faculty ratings and student self-assessment questions distinguished among test takers scoring at different levels on the assessments. This was true for each of the four LanguEdge test sections. The correlations between the various student self-assessment scales and faculty ratings were modest, mostly in the.30s, suggesting that students and faculty had different perspectives on students English language skills. How do the correlations between self-assessments and test scores found in this study compare with those detected in other efforts? The answer is generally quite favorably. For instance, several reviews or meta-analyses have been conducted in which self-assessments have 22

been shown to correlate, on average, about.35 with peer and supervisor ratings (Harris & Schaubroeck, 1988), about.29 with a variety of performance measures (Mabe & West, 1982), about.39 with teacher evaluations (Falchikov & Boud, 1989), and in the.60s for studies dealing with self-assessment in second and foreign languages (Ross, 1998). The correlations computed here also compare favorably with those typically found in test validity studies. For instance, in the context of graduate admissions, Graduate Record Examinations (GRE ) General Test scores generally correlate in the.20.40 range with graduate grade averages (Briel, O Neill, & Scheuneman, 1993; Kuncel, Hezlett, & Ones, 2001) and in the.30.50 range with such criteria as faculty ratings and performance on comprehensive examinations (Kuncel, Hezlett, & Ones, 2001). We believe, therefore, that the validity criteria employed here (i.e., faculty judgments and student self-assessments) may prove useful in providing additional meaning to LanguEdge test scores. An obvious limitation of the study, however, is that we have provided no validation of students self-assessments themselves. That is, we did not attempt to verify that students knew and could actually do what they said they could do (beyond, of course, obtaining somewhat similar ratings from faculty). Moreover, pairs of faculty members did not agree very strongly with regard to their assessments of the students they had taught. Despite this lack of agreement (which may simply reflect different but legitimate perspectives), LanguEdge scores correlated significantly with faculty ratings. The strength of the study, we believe, is that, unlike previous efforts that have relied on internal anchors (i.e., the items constituting a test), we have enhanced test score meaning by referencing external anchors. A shortcoming of this study is that relatively few anchor items were administered, therefore precluding a more selective identification of the most discriminating items for score interpretation. Consequently, no attempt was made to summarize and interpret performance at the various score levels by generalizing across sets of items, as has been the practice for internal methods for which much larger numbers of test items have usually been available. Next steps in developing this methodology might be to take a more model-based (rather than solely data-driven) approach in order to provide more stable estimates of the relationships between test scores and validation criteria. In addition, a larger number of external anchors could be administered in order to select only those that exhibit the greatest ability to distinguish among score levels. 23

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191 204. Briel, J. B., O Neill, K. A., & Scheuneman, J. D. (Eds.). (1993). GRE technical manual: Test development, score interpretation, and research for the Graduate Record Examinations Program (pp. 67 88). Princeton, NJ: Educational Testing Service. Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis. Review of Educational Research, 59, 395 430. Forsyth, R. A. (1991). Do NAEP scales yield valid criterion-referenced interpretations? Educational Measurement Issues and Practice, 10, 3-9, 16. Hambleton, R. K., & Slater, S. (1994, October). Using performance standards to report national and state assessment data: Are the reports understandable and how can they be improved? Paper presented at the Joint Conference on Standard Setting for Large-Scale Assessments, Washington, DC. Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, and peersupervisor ratings. Personnel Psychology, 41, 43 62. Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). Comprehensive meta-analysis of the predictive validity of the Graduate Record Examinations: Implications for graduate student selection and performance. Psychological Bulletin, 127, 162 181. Landis, J. D., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159 174. Mabe, P. A., & West, S. G. (1982). Validity of self-evaluation of ability: A review and metaanalysis. Journal of Applied Psychology, 67, 280 296. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13 103). Washington, DC: American Council on Education. Mullis, I. V. S., & Jenkins, L. B. (1988). The science report card: Elements of risk and recovery. Princeton, NJ: Educational Testing Service. 24

Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C. Clapham & D. Corson (Eds.), The encyclopedia of language and education: Vol. 7. Language testing and assessment (pp. 175 187). Dordrecht, The Netherlands: Kluwer Academic Publishers. Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of experiential factors. Language Testing, 15, 1 20. Shrauger, J. S., & Osberg, T. M. (1981). The relative accuracy of self-predictions and judgments by others of psychological assessment. Psychological Bulletin, 90, 322 351. Tannenbaum, R. J., Rosenfeld, M., Breyer, F. J., & Wilson K. (2003). Linking TOEIC scores to self-assessments of English-language abilities: A study of score interpretation. Manuscript submitted for publication. Upshur, J. (1975). Objective evaluation of oral proficiency in the ESOL classroom. In L. Palmer & B. Spolsky (Eds.), Papers on language testing 1967-1974 (pp. 53 65). Washington, DC: TESOL. 25