Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress 1



Similar documents
relating to household s disposable income. A Gini Coefficient of zero indicates

Delegation in human resource management

With data up to: May Monthly Electricity Statistics

What Is the Total Public Spending on Education?

TOWARDS PUBLIC PROCUREMENT KEY PERFORMANCE INDICATORS. Paulo Magina Public Sector Integrity Division

Among the 34 OECD countries, Belgium performed above the OECD average in each of

What Proportion of National Wealth Is Spent on Education?

41 T Korea, Rep T Netherlands T Japan E Bulgaria T Argentina T Czech Republic T Greece 50.

A Comparison of the Tax Burden on Labor in the OECD By Kyle Pomerleau

International comparisons of obesity prevalence

INTERNATIONAL COMPARISONS OF PART-TIME WORK

How many students study abroad and where do they go?

Achievement in America: Going the Distance to Improve Results and Close Gaps

Expenditure and Outputs in the Irish Health System: A Cross Country Comparison

International Women's Day PwC Women in Work Index

Country note - Greece

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

A Second Chance School in Hungary

Country note China. More than 255 million people in OECD and G20 countries have now attained tertiary education (Table A1.3a).

Global Economic Briefing: Global Inflation

Higher education institutions as places to integrate individual lifelong learning strategies

PUBLIC VS. PRIVATE HEALTH CARE IN CANADA. Norma Kozhaya, Ph.D Economist, Montreal economic Institute CPBI, Winnipeg June 15, 2007

Foreign Taxes Paid and Foreign Source Income INTECH Global Income Managed Volatility Fund

Preventing fraud and corruption in public procurement

Waiting times and other barriers to health care access

(OECD, 2012) Equity and Quality in Education: Supporting Disadvantaged Students and Schools

Health and welfare Humanities and arts Social sciences, bussiness and law. Ireland. Portugal. Denmark. Spain. New Zealand. Argentina 1.

Eliminating Double Taxation through Corporate Integration

Energy Efficiency Indicators for Public Electricity Production from Fossil Fuels

PISA FOR SCHOOLS. How is my school comparing internationally? Andreas Schleicher Director for Education and Skills OECD. Madrid, September 22 nd

How Many Students Will Enter Tertiary Education?

PUBLIC & PRIVATE HEALTH CARE IN CANADA

PERMANENT AND TEMPORARY WORKERS

Automated Essay Scoring with the E-rater System Yigal Attali Educational Testing Service

Creating the Future of Public Education: Graduation Requirements in New York State. NYS Board of Regents Regional Forum

Trends in Digitally-Enabled Trade in Services. by Maria Borga and Jennifer Koncz-Bruner

Review of R&D Tax Credit. Invitation for Submissions

EMBO Short Term Fellowships Information for applicants

Ageing OECD Societies

International investment continues to struggle

How Much Time Do Teachers Spend Teaching?

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

Education at a Glance. OECD Indicators. Annex: UOE Data Collection Sources

Hong Kong s Health Spending 1989 to 2033

Percent of adults with postsecondary degree. Japan. Mexico. Canada. Poland. Denmark. Greece Germany. Italy. Austria. Korea.

PhD Education in Educational Sciences in Finland: Systematic Development of the Programmes

The U.S Health Care Paradox: How Spending More is Getting Us Less

Updated development of global greenhouse gas emissions 2013

Culture Change in the Workforce in an Aging America: Are We Making Any Progress?

32 nd National Conference on Law & Higher Education

(OECD, 2012) Equity and Quality in Education: Supporting Disadvantaged Students and Schools

A Comparison of the Tax Burden on Labor in the OECD

Early Childhood Education and Care

Student performance in mathematics, reading and science

ANNUAL HOURS WORKED. Belgium:

STATISTICS FOR THE FURNITURE INDUSTRY AND TRADE

A new ranking of the world s most innovative countries: Notes on methodology. An Economist Intelligence Unit report Sponsored by Cisco

OECD review of the secondary school modernisation programme in Portugal

Renewing Finland through skills. Forum Criteriorum Helsinki 29 September 2015 State Secretary Olli-Pekka Heinonen

Internationalization and higher education policy: Recent developments in Finland

INTERNATIONAL COMPARISONS OF HOURLY COMPENSATION COSTS

How Many Students Finish Secondary Education?

Paul E. Lingenfelter President, State Higher Education Executive Officers

How To Calculate Tertiary Type A Graduation Rate

Well-being at school: does infrastructure matter?

Reporting practices for domestic and total debt securities

OECD PROGRAMME FOR INTERNATIONAL STUDENT ASSESSMENT (PISA) Sixth meeting of the Board of Participating Countries 1-3 March 1999, Tokyo

Working Holiday Maker visa programme report

A BETTER RETIREMENT PORTFOLIO FOR MEMBERS IN DC INVESTMENT DEFAULTS

UTX Europe V2 - Enhancements

Achievement of 15-Year-Olds in England: PISA 2012 National Report (OECD Programme for International Student Assessment)

Working Holiday Maker visa programme report. 31 December 2014

Canada GO 2535 TM World Traveller's edition Maps of North America (Canada, US, Mexico), Western and Central Europe (including Russia) CAD 349,95

Improving the quality and flexibility of data collection from financial institutions

The value of accredited certification

Size and Development of the Shadow Economy of 31 European and 5 other OECD Countries from 2003 to 2015: Different Developments

Insurance corporations and pension funds in OECD countries

- 2 - Chart 2. Annual percent change in hourly compensation costs in manufacturing and exchange rates,

Consumer Over-indebtedness and Financial Vulnerability: Evidence from Credit Bureau Data

Global Effective Tax Rates

SURVEY OF INVESTMENT REGULATION OF PENSION FUNDS. OECD Secretariat

Building a Global Internet Company: Driving Traffic to Your Site. Benjamin Edelman Harvard Business School

Dividends Tax: Summary of withholding tax rates per South African Double Taxation Agreements currently in force Version: 2 Updated:

GfK PURCHASING POWER INTERNATIONAL

DAC Guidelines and Reference Series. Quality Standards for Development Evaluation

THE LOW INTEREST RATE ENVIRONMENT AND ITS IMPACT ON INSURANCE MARKETS. Mamiko Yokoi-Arai

Health Care a Public or Private Good?

(OECD, 2012) Equity and Quality in Education: Supporting Disadvantaged Students and Schools

Testing Theories of Policy-Making: Educational Funding MICAH MCFADDEN

OCTOBER Russell-Parametric Cross-Sectional Volatility (CrossVol ) Indexes Construction and Methodology

SF3.1: Marriage and divorce rates

Higher Education in Finland

Mapping Global Value Chains

Information for applicants, employers and supervisors. Periods of adaptation

FAQs for Two-factor Authentication

ON OECD I-O DATABASE AND ITS EXTENSION TO INTER-COUNTRY INTER- INDUSTRY ANALYSIS " Norihiko YAMANO"

Transcription:

Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress 1 Randy Elliot Bennett Educational Testing Service Princeton, NJ 08541 rbennett@ets.org 1 This paper is from a keynote presentation made at the annual conference of the International Association for Educational Assessment, Singapore, May 2006. Portions of the paper were adapted from a previous presentation to the annual conference of the Dutch Examinations Association (Nederlandse Vereniging voor Examens [NVE]), Almelo, the Netherlands, November 2005.

Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress In this paper, I review the results of two research studies conducted for the US National Assessment of Educational Progress (NAEP). NAEP is a sample survey of what US students in grades 4, 8, and 12 know and can do. The studies I review concern doing writing assessment on computer. Why would one want to do writing assessment on computer? The answer is simple. The tools of choice in the workplace and in advanced academic environments have changed. The writing tool of choice is now the computer. At the school level, a similar transition is occurring, not only in the US, but in many other countries. PISA reports that among its participating countries, as of 2003, 48% of 15-year olds indicated using a word processor at least a few times a week (see Figure 1). Three years later, that percentage is surely higher. Figure 1: Percentage of Students Reporting They Use Word Processing Almost every day or A few times each week as of 2003 80 60 40 20 0 Canada United States Australia Iceland Denmark New Zealand Sweden Austria Belgium Korea Portugal Switzerland OECD average Czech Republic Italy Germany Mexico Greece Poland Hungary Finland Ireland Turkey Slovak Republic Japan United Kingdom1 Note. From Schleicher (2006). As students change the medium in which they routinely write, the idea of testing them in a different medium becomes problematic from validity, fairness, and credibility perspectives. Even so, for practical reasons the transition of large-scale writing assessment from paper to computer delivery will be a gradual one. Given that fact, some students may take their writing tests on paper while others take their tests on computer. Does it matter? That is, are the scores from paper and computer-based writing assessments comparable? Are the Scores from Computer and Paper Writing Assessments Comparable? Why should we care about the comparability of scores across delivery modes? We should care because if delivery mode affects scores, our ability to draw valid conclusions from test results may be reduced under some circumstances. For example, it may be reduced: 2

if we want to compare results over time and the delivery mode has changed from paper to computer; if we want to compare individuals--or aggregate results across them--when some individuals have taken the test on paper and some have taken it on computer (especially if the assignment to delivery mode was not voluntary); or if we want to compare individuals (or groups) taking the test on computer and computer delivery affects one type of individual more than another, such that the difference in scores between two individuals (or groups) suddenly becomes larger or smaller than it was for paper testing. Under any of these circumstances, the conclusions we draw from test results may be wrong. Recognizing this fact, and knowing that NAEP would eventually need to assess writing proficiency on computer, the National Center for Education Statistics commissioned the Writing Online (WOL) study (Horkay, Bennett, Allen, & Kaplan, 2005). This study administered the same two essay questions to one group of students on paper and to another comparable group on computer. Both groups were selected to be nationally representative samples of 8 th grade students (i.e., about 13 years old). The students taking the online writing test used a simplified word processor to enter their responses. This simplified word processor contained functions for cutting, pasting, and copying text; for undoing the most recent action; and for checking spelling (see Figure 2). Figure 2: Screen Shot of the Writing Online Computer Interface Note. A writing prompt is on the left and a simplified word processor is on the right. From Horkay et al., (2005). 3

It s worth noting that, if a student is a good typist who often uses similar word processing tools to write and revise, he or she might write a better essay using this simplified word processor than using pencil and paper. On the other hand, if the student can t type, the essay written here may well be worse than if it had been composed in the traditional way. To find out how experienced students were with computers, the group taking the online version of this essay-writing test completed three indicators of computer familiarity. Two of the computer-familiarity indicators were based on students responses to questions about, in one case, their extent of general computer use and, in the other case, their frequency of using the computer for writing. The third indicator was a hands-on measure that included typing speed (the number of words typed in two minutes from a 78-word passage), typing accuracy (the sum of errors made in typing the passage), and text editing (the number of editing tasks completed correctly, including deleting, inserting, modifying, and moving text). Finally, the group taking the online writing test also had previously taken a paper writing test. This paper test consisted of two essays that were different from the ones employed in the online test. What did the nationally representative sample of 8 th -grade students taking the online writing test report about the extent to which it used computers for writing? In 2002 when these data were collected, 93 percent of students reported using a computer at least to some extent to write. But substantial and almost equal numbers said to a large extent (30%) vs. to a small extent or not at all (29%). Did 8 th -grade students perform differently on the paper and computer writing tests? This study found no significant difference in scores between students taking the writing test on paper and the group taking the same two essays on computer. That result is logically consistent with the distribution of computer use for writing just cited, where a substantial proportion of students used the computer to a large extent and an almost equal proportion used it hardly at all. Was computer familiarity associated with online test performance? This study found that the hands-on measure of computer familiarity significantly predicted online writing score, after controlling for paper writing performance. The greater the computer familiarity, the higher was the online writing score. To give a sense of the strength of this association, for the same paper writing performance, a student with computer familiarity one standard deviation below the mean would be predicted to get a computer writing score almost one point lower on a 1-6 scale than a student having computer familiarity one standard deviation above the mean. Larger differences in computer familiarity between students with the same paper writing proficiency would be associated with correspondingly bigger discrepancies in computer writing scores. If We Can Score Student Writing Automatically, Should We Care How the Machine Does it? Regardless of delivery mode, the need to process results efficiently is important because student writing is time consuming and costly to grade. Assuming that the comparability issues just 4

described can be dealt with, computer-based testing offers a potentially significant efficiency advantage. Because students taking online writing assessments enter their responses in digital form, those responses can be scored automatically. Several US testing programs already use automated essay scoring operationally. These programs usually employ it in conjunction with human examiners when the assessment is for making highstakes decisions. No human examiner is used when the decisions are of less consequence. Testing programs that employ automated essay scoring include the Graduate Management Admission Test (GMAT) Analytical Writing Assessment, ACCUPLACER: WritePlacer Plus and COMPASS e-write (used by US postsecondary institutions for placement in remedial writing courses), and the Indiana English 11 End-of Course Assessment, which is used for certifying public schools as outstanding and, optionally, as one piece of evidence in assigning course grades to secondary school students. There are at least four commercially available automated essay scoring systems including PEG (Project Essay Grade) (Measurement, Inc.), IEA (Intelligent Essay Assessor) (Pearson Knowledge Technologies), Intellimetric (Vantage), and e-rater (ETS). In general, the scores generated by these programs agree about as well with human ratings as the human ratings agree among themselves (see Keith, 2003, for a review). How does a machine score essays? The same general process is used by each of the commercially available programs. First, the developers identify relevant text features that can be extracted by computer (e.g., the similarity of the words used in an essay to the words used in high-scoring essays, the average word length, the frequency of grammatical errors, the number of words in the response). Next, they create a program to extract those features. Third, they combine the extracted features to form a score. And finally, they evaluate the machine scores empirically. But even if its agreement with human scores is good, does it matter how the machine does the scoring? For instance, does it matter how the developers choose to combine the features extracted by the machine to form scores? The method that appears to be used by each of the commercial programs is a brute-empirical one. That is, they use a weighted combination of features that best predicts the scores human judges would assign to those essays under operational grading conditions. Would writing experts choose those same combinatorial weights? In a study funded by the National Center for Education Statistics, we asked writing experts to weight according to their best judgment 12 features extracted by one automated essay scoring program (Bennett & Ben-Simon, 2005). The essays were taken from the NAEP Writing Online investigation described above. For purposes of the analysis, we grouped the features into five dimensions cited in previous research with the scoring program: Grammar, usage, mechanics, and style; Organization and development; Topical analysis (content); Word complexity; and Essay length. Two committees of experts were asked to weight each dimension independently on a 0-100 scale. These judgments were made by committee members in the abstract, without knowing how the automated essay scoring program worked. 5

Interestingly, the two committees agreed with one another to a remarkable extent on the relative importance of these dimensions in defining quality for essays written by US 8 th grade students. Equally important, the committees disagreed to a remarkable extent with the empirical weights assigned by the automated scoring system. The expert committees believed that over 60% of the essay score should be based on the combination of Organization and development and Topical analysis. The empirical weights, in contrast, gave only about 20% of the emphasis to these dimensions. What received most of the empirical weight? The combination of Grammar, usage, mechanics, and style and Essay length received almost 70% of it. Yet the committees had awarded only 20% to 26% of the weight to the combination of these dimensions. As noted, these judgments were made in the abstract before the mechanics of the automated scoring had been described. After both committees received information about the way in which the dimensions were measured, and after one committee saw the empirical weights, the judgment process was repeated. The committee that saw the empirical weights moved closer in its judgments to those weights. However, each committee still gave less emphasis to Grammar, usage, mechanics, and style, and to Essay length, than did the empirical weights. Similarly, each committee gave more consideration to Organization and development and to Topical analysis than did the machine scoring. Finally, scores generated using the empirically-informed committee weights proved to be about as meaningful as scores based solely on the empirical weights, suggesting that moderating the optimal weights does not necessarily lead to a reduction in scoring accuracy. Conclusion What lessons about technology and writing assessment did we learn from the NAEP studies described in this paper? The first lesson is that scores from paper and computer writing tests may mean different things because computer-based tests may directly measure not just writing skill but, for some students, computer facility. The implication of this finding is that, as long as significant numbers of students write better in one or the other mode, the method by which students are tested may produce different group proficiency estimates. For example, testing students in the mode in which they typically write may lead to different findings than assessing all students in either mode alone. The second lesson is that how automated scoring is done matters! The weighting of text features derived by an automated scoring system may not be the same as the one that would result from the judgments of writing experts. Statistically optimal approaches may work in a predictive sense but they may not work scientifically or educationally, especially if the dimensions students must concentrate on to improve scores are not the ones that writing experts most value. References Bennett, R. E., & Ben-Simon, A. (2005). Toward theoretically meaningful automated essay scoring. Unpublished final project report, Educational Testing Service. Horkay, N., Bennett, R. E., Allen, N., & Kaplan, B. (2005). Online assessment in writing. In B. Sandene, N. Horkay, R. E. Bennett, N. Allen, J. Braswell, B. Kaplan, & A. Oranje 6

(Eds.), Online assessment in mathematics and writing: Reports from the NAEP Technology- Based Assessment Project (NCES 2005-457). Washington, DC: National Center for Education Statistics, US Department of Education. Retrieved March 9, 2006 from http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2005457 Keith, T. Z. (2003). Validity of automated essay scoring systems. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Erlbaum. Schleicher, A. (2006). Are students ready for a technology-rich world? What PISA studies tell us. Paris, France: Organization for Economic Co-operation and Development (OECD). Retrieved March 6, 2006 from http://www.pisa.oecd.org/dataoecd/28/53/35996337.ppt 7