Automatic Detection of Grammar Errors in Primary School Children s Texts A Finite State Approach

Similar documents

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6

ICAME Journal No. 24. Reviews

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5

THE BACHELOR S DEGREE IN SPANISH

PTE Academic Preparation Course Outline

Language Arts Literacy Areas of Focus: Grade 6

Editing and Proofreading. University Learning Centre Writing Help Ron Cooley, Professor of English

Quality Assurance at NEMT, Inc.

How To Teach English To Other People

Study Plan for Master of Arts in Applied Linguistics

I. Title of Lesson: Learning from Artifacts and What Artifacts Tell Us

Writing Common Core KEY WORDS

Common Core Progress English Language Arts

Quality Assurance at NEMT, Inc.

KSE Comp. support for the writing process 2 1

Students will know Vocabulary: purpose details reasons phrases conclusion point of view persuasive evaluate

Key words related to the foci of the paper: master s degree, essay, admission exam, graders

SIXTH GRADE UNIT 1. Reading: Literature

Language Arts Literacy Areas of Focus: Grade 5

A + dvancer College Readiness Online Alignment to Florida PERT

No Evidence. 8.9 f X

Modern foreign languages

Virginia English Standards of Learning Grade 8

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

READING SPECIALIST STANDARDS

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Grade 4 Writing Assessment. Eligible Texas Essential Knowledge and Skills

Grading Benchmarks FIRST GRADE. Trimester st Student has achieved reading success at. Trimester st In above grade-level books, the

English Appendix 2: Vocabulary, grammar and punctuation

Syntactic Theory on Swedish

English Language Proficiency Standards: At A Glance February 19, 2014

Bilingual Education Assessment Urdu (034) NY-SG-FLD034-01

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

Grade 4 Writing Curriculum Map

Texas Success Initiative (TSI) Assessment

parent ROADMAP SUPPORTING YOUR CHILD IN GRADE FIVE ENGLISH LANGUAGE ARTS

Language Meaning and Use

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

How To Read With A Book

Level 4 Certificate in English for Business

Reading Competencies

Grade Genre Skills Lessons Mentor Texts and Resources 6 Grammar To Be Covered

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,

CHARACTERISTICS FOR STUDENTS WITH: LIMITED ENGLISH PROFICIENCY (LEP)

EDITING AND PROOFREADING. Read the following statements and identify if they are true (T) or false (F).

AK + ASD Writing Grade Level Expectations For Grades 3-6

Ms Juliani -Syllabus Special Education-Language/ Writing

A Writer s Reference, Seventh Edition Diana Hacker Nancy Sommers

How to become a successful language learner

Oxford Learning Institute University of Oxford

LANG 557 Capstone Paper . Purpose: Format: Content: introduction view of language

Interpretation of Test Scores for the ACCUPLACER Tests

CORRECTING AND GIVING FEEDBACK TO WRITING

Processing: current projects and research at the IXA Group

Get the most value from your surveys with text analysis

Grade 5. Ontario Provincial Curriculum-based Expectations Guideline Walking with Miskwaadesi and Walking with A`nó:wara By Subject/Strand

Integrating the Common Core Standards into the Music Curriculum

Students will know Vocabulary: claims evidence reasons relevant accurate phrases/clauses credible source (inc. oral) formal style clarify

Annotation Guidelines for Dutch-English Word Alignment

CALIFORNIA S TEACHING PERFORMANCE EXPECTATIONS (TPE)

Foundations of the Montessori Method (3 credits)

Any Town Public Schools Specific School Address, City State ZIP

How To Proofread

Guidelines for Seminar Papers and Final Papers (BA / MA Theses) at the Chair of Public Finance

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Grade: 9 (1) Students will build a framework for high school level academic writing by understanding the what of language, including:

Indiana Department of Education

Subordinating Ideas Using Phrases It All Started with Sputnik

Focus: Reading Unit of Study: Research & Media Literary; Informational Text; Biographies and Autobiographies

National Masters School in Language Technology

Date Re-Assessed. Indicator. CCSS.ELA-Literacy.RF.5.3 Know and apply grade-level phonics and word analysis skills in decoding words.

the treasure of lemon brown by walter dean myers

Academic Standards for Reading, Writing, Speaking, and Listening June 1, 2009 FINAL Elementary Standards Grades 3-8

Translators Handbook

Overview of MT techniques. Malek Boualem (FT)

KINDGERGARTEN. Listen to a story for a particular reason

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

PENNSYLVANIA COMMON CORE STANDARDS English Language Arts Grades 9-12

An Overview of Applied Linguistics

Texas Success Initiative (TSI) Assessment. Interpreting Your Score

Writing Rubrics. Eighth Grade. Based on the California State Writing Standards. Created by Miller seventh grade team 4/05..

Meeting the Standard in North Carolina

MATRIX OF STANDARDS AND COMPETENCIES FOR ENGLISH IN GRADES 7 10

French Language and Culture. Curriculum Framework

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

Advice for Class Teachers. Moderating pupils reading at P 4 NC Level 1

Faculty Response to Grammar Errors in the Writing of ESL Students. by Lyndall Nairn, Lynchburg College, Lynchburg, VA,

Index. 344 Grammar and Language Workbook, Grade 8

1 of 22 The National Strategies Primary Primary Framework for literacy and mathematics, Mark writing. Mark Year 1 Low level 1. Writing standards file

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

CREATING LEARNING OUTCOMES

Expanding Expression Tool

Transcription:

GOTHENBURG MONOGRAPHS IN LINGUISTICS 24 Automatic Detection of Grammar Errors in Primary School Children s Texts A Finite State Approach Sylvana Sofkova Hashemi Doctoral Dissertation Publicly defended in Lilla Hörsalen, Humanisten, Göteborg University, on June 7, 2003, at 10.15 for the degree of Doctor of Philosophy Department of Linguistics, Göteborg University, Sweden

ISBN 91-973895-5-2 c 2003 Sylvana Sofkova Hashemi Typeset by the author using LATEX Printed by Intellecta Docusys, Göteborg, Sweden, 2003

i Abstract This thesis concerns the analysis of grammar errors in Swedish texts written by primary school children and the development of a finite state system for finding such errors. Grammar errors are more frequent for this group of writers than for adults and the distribution of the error types is different in children s texts. In addition, other writing errors above word-level are discussed here, including punctuation and spelling errors resulting in existing words. The method used in the implemented tool FiniteCheck involves subtraction of finite state automata that represent grammars with varying degrees of detail, creating a machine that classifies phrases in a text containing certain kinds of errors. The current version of the system handles errors concerning agreement in noun phrases, and verb selection of finite and non-finite forms. At the lexical level, we attach all lexical tags to words and do not use a tagger which could eliminate information in incorrect text that might be needed later to find the error. At higher levels, structural ambiguity is treated by parsing order, grammar extension and some other heuristics. The simple finite state technique of subtraction has the advantage that the grammars one needs to write to find errors are always positive, describing the valid rules of Swedish rather than grammars describing the structure of errors. The rule sets remain quite small and practically no prediction of errors is necessary. The linguistic performance of the system is promising and shows comparable results for the error types implemented to other Swedish grammar checking tools, when tested on a small adult text not previously analyzed by the system. The performance of the other Swedish tools was also tested on the children s data collected for this study, revealing quite low recall rates. This fact motivates the need for adaptation of grammar checking techniques to children, whose errors are different from those found in adult writers and pose more challenge to current grammar checkers, that are oriented towards texts written by adult writers. The robustness and modularity of FiniteCheck makes it possible to perform both error detection and diagnostics. Moreover, the grammars can in principle be reused for other applications that do not necessarily have anything to do with error detection, such as extracting information in a given text or even parsing. KEY WORDS: grammar errors, spelling errors, punctuation, children s writing, Swedish, language checking, light parsing, finite state technology

ii

iii Acknowledgements Work on this thesis would not have been possible without contributions, support and encouragement from many people. The idea of developing a writing tool for supporting children in their text production and grammar emerged from a study on how primary school children write by hand in comparison to when they use a computer. Special thanks to my colleague Torbjörn Lager, who inspired me to do this study and whose children attended the school where I gathered my data. My main supervisor Robin Cooper awakened the idea of using finite state methods for grammar checking and launched the collaboration with the Xerox research group. I want to express my greatest gratitude to him for inspiring discussions during project meetings and supervision sessions, and his patience with my writing, struggling to understand every bit of it, always raising questions and always full of new exciting ideas. I really enjoyed our discussions and look forward to more. I would also like to thank my assistant supervisor Elisabet Engdahl who carefully read my writing and made sure that I expressed myself more clearly. Many thanks to all my colleagues at the Department of Linguistics for creating an inspiring research environment with interesting projects, seminars and conferences. I especially want to mention Leif Grönqvist for being the helping hand next door whenever, Robert Andersson for being my project colleague, Stina Ericsson for loan of LATEX-manual and for always being helpful, Ulla Veres for help with recruitment of new victims for writing experiments, Jens Allwood and Elisabeth Ahlsén for introducing me to the world of transcription and coding, Sally Boyd, Nataliya Berbyuk, Ulrika Ferm for support and encouragement, Shirley Nicholson for always available with books and also milk for coffee, Pia Cromberger always ready for a chat. A special thanks to Ylva Hård af Segerstad for fruitful discussions leading to future collaboration that I am looking forward to, and for being a friend. I also want to thank the children in my study and their teachers for providing me with their text creations, and Sven Strömqvist and Victoria Johansson for sharing their data collection. A special thanks to Genie Perdin who carefully proofread this thesis and gave me some encouraging last minute kicks. I also want to thank all my friends, who reminded me now and then about life outside the university. My deepest gratitude to my family for being there for me and for always believing in me. My husband Ali - I know the way was long and there were times I could be distant, but I am back. My daughter Sarah for being the sunshine of my life, my inspiration, my everything. My mother, father, sister and my big little brother... Sylvana Sofkova Hashemi Göteborg, May 2003

iv

v Table of Contents 1 Introduction 1 1.1 Written Language in a Computer Literate Society......... 1 1.2 Aim and Scope of the Study.................... 3 1.3 Outline of the Thesis........................ 5 I Writing 7 2 Writing and Grammar 9 2.1 Introduction............................. 9 2.2 Research on Writing in General.................. 10 2.3 Written Language and Computers................. 11 2.3.1 Learning to Write...................... 11 2.3.2 The Influence of Computers on Writing.......... 12 2.4 Studies of Grammar Errors..................... 14 2.4.1 Introduction......................... 14 2.4.2 Primary and Secondary Level Writers........... 14 2.4.3 Adult Writers........................ 15 2.5 Conclusion............................. 18 3 Data Collection and Analysis 21 3.1 Introduction............................. 21 3.2 Data Collection........................... 21 3.2.1 Introduction......................... 21 3.2.2 The Sub-Corpora...................... 23 3.3 Error Categories........................... 25 3.3.1 Introduction......................... 25 3.3.2 Spelling Errors....................... 26 3.3.3 Grammar Errors...................... 27 3.3.4 Spelling or Grammar Error?................ 28 3.3.5 Punctuation......................... 31 3.4 Types of Analysis.......................... 32 3.5 Error Coding and Tools....................... 34 3.5.1 Corpus Formats....................... 34 3.5.2 CHAT-format and CLAN-software............. 34

vi 4 Error Profile of the Data 37 4.1 Introduction............................. 37 4.2 General Overview.......................... 37 4.3 Grammar Errors........................... 41 4.3.1 Agreement in Noun Phrases................ 41 4.3.2 Agreement in Predicative Complement.......... 50 4.3.3 Definiteness in Single Nouns................ 52 4.3.4 Pronoun Case........................ 53 4.3.5 Verb Form......................... 55 4.3.6 Sentence Structure..................... 62 4.3.7 Word Choice........................ 67 4.3.8 Reference.......................... 69 4.3.9 Other Grammar Errors................... 71 4.3.10 Distribution of Grammar Errors.............. 72 4.3.11 Summary.......................... 77 4.4 Child Data vs. Other Data..................... 77 4.4.1 Primary and Secondary Level Writers........... 77 4.4.2 Evaluation Texts of Proof Reading Tools......... 80 4.4.3 Scarrie s Error Database.................. 85 4.4.4 Summary.......................... 88 4.5 Real Word Spelling Errors..................... 89 4.5.1 Introduction......................... 89 4.5.2 Spelling in Swedish.................... 89 4.5.3 Segmentation Errors.................... 91 4.5.4 Misspelled Words...................... 94 4.5.5 Distribution of Real Word Spelling Errors......... 98 4.5.6 Summary.......................... 100 4.6 Punctuation............................. 100 4.6.1 Introduction......................... 100 4.6.2 General Overview of Sentence Delimitation........ 101 4.6.3 The Orthographic Sentence................. 103 4.6.4 Punctuation Errors..................... 105 4.6.5 Summary.......................... 107 4.7 Conclusions............................. 107

vii II Grammar Checking 111 5 Error Detection and Previous Systems 113 5.1 Introduction............................. 113 5.2 What Is a Grammar Checker?.................... 114 5.2.1 Spelling vs. Grammar Checking.............. 114 5.2.2 Functionality........................ 114 5.2.3 Performance Measures and Their Interpretation...... 115 5.3 Possibilities for Error Detection.................. 117 5.3.1 Introduction......................... 117 5.3.2 The Means for Detection.................. 117 5.3.3 Summary and Conclusion................. 125 5.4 Grammar Checking Systems.................... 128 5.4.1 Introduction......................... 128 5.4.2 Methods and Techniques in Some Previous Systems... 128 5.4.3 Current Swedish Systems................. 130 5.4.4 Overview of The Swedish Systems............ 134 5.4.5 Summary.......................... 142 5.5 Performance on Child Data..................... 143 5.5.1 Introduction......................... 143 5.5.2 Evaluation Procedure.................... 143 5.5.3 The Systems Detection Procedures............ 145 5.5.4 The Systems Detection Results.............. 146 5.5.5 Overall Detection Results................. 168 5.6 Summary and Conclusion...................... 172 6 FiniteCheck: A Grammar Error Detector 173 6.1 Introduction............................. 173 6.2 Finite State Methods and Tools................... 175 6.2.1 Finite State Methods in NLP................ 175 6.2.2 Regular Grammars and Automata............. 176 6.2.3 Xerox Finite State Tool................... 177 6.2.4 Finite State Parsing..................... 180 6.3 System Architecture......................... 184 6.3.1 Introduction......................... 184 6.3.2 The System Flow...................... 186 6.3.3 Types of Automata..................... 189 6.4 The Lexicon............................. 191 6.4.1 Composition of The Lexicon................ 191 6.4.2 The Tagset......................... 193

viii 6.4.3 Categories and Features.................. 194 6.5 Broad Grammar........................... 195 6.6 Parsing............................... 196 6.6.1 Parsing Procedure..................... 196 6.6.2 The Heuristics of Parsing Order.............. 198 6.6.3 Further Ambiguity Resolution............... 201 6.6.4 Parsing Expansion and Adjustment............ 203 6.7 Narrow Grammar.......................... 205 6.7.1 Noun Phrase Grammar................... 205 6.7.2 Verb Grammar....................... 210 6.8 Error Detection and Diagnosis................... 214 6.8.1 Introduction......................... 214 6.8.2 Detection of Errors in Noun Phrases............ 215 6.8.3 Detection of Errors in the Verbal Head........... 216 6.9 Summary.............................. 216 7 Performance Results 219 7.1 Introduction............................. 219 7.2 Initial Performance on Child Data................. 219 7.2.1 Performance Results: Phase I............... 219 7.2.2 Grammatical Coverage................... 220 7.2.3 Flagging Accuracy..................... 223 7.3 Current Performance on Child Data................ 228 7.3.1 Introduction......................... 228 7.3.2 Improving Flagging Accuracy............... 229 7.3.3 Performance Results: Phase II............... 232 7.4 Overview of Performance on Child Data.............. 233 7.5 Performance on Other Text..................... 237 7.5.1 Performance Results of FiniteCheck............ 237 7.5.2 Performance Results of Other Tools............ 240 7.5.3 Overview of Performance on Other Text.......... 243 7.6 Summary and Conclusion...................... 246 8 Summary and Conclusion 249 8.1 Introduction............................. 249 8.2 Summary.............................. 249 8.2.1 Introduction......................... 249 8.2.2 Children s Writing Errors................. 250 8.2.3 Diagnosis and Possibilities for Detection......... 251 8.2.4 Detection of Grammar Errors............... 253

ix 8.3 Conclusion............................. 255 8.4 Future Plans............................. 256 8.4.1 Introduction......................... 256 8.4.2 Improving the System................... 256 8.4.3 Expanding Detection.................... 257 8.4.4 Generic Tool?........................ 258 8.4.5 Learning to Write in the Information Society....... 258 Bibliography 260 Appendices 276 A Grammatical Feature Categories 279 B Error Corpora 281 B.1 Grammar Errors........................... 282 B.2 Misspelled Words.......................... 293 B.3 Segmentation Errors........................ 306 C SUC Tagset 313 D Implementation 315 D.1 Broad Grammar........................... 315 D.2 Narrow Grammar: Noun Phrases.................. 315 D.3 Narrow Grammar: Verb Phrases.................. 318 D.4 Parser................................ 319 D.5 Filtering............................... 319 D.6 Error Finder............................. 320

x

LIST OF TABLES xi List of Tables 3.1 Child Data Overview........................ 22 4.1 General Overview of Sub-Corpora................. 38 4.2 General Overview by Age..................... 39 4.3 General Overview of Spelling Errors in Sub-Corpora....... 40 4.4 General Overview of Spelling Errors by Age............ 40 4.5 Number Agreement in Swedish................... 42 4.6 Gender Agreement in Swedish................... 42 4.7 Definiteness Agreement in Swedish................ 42 4.8 Noun Phrases with Proper Nouns as Head............. 44 4.9 Noun Phrases with Pronouns as Head............... 44 4.10 Noun Phrases without (Nominal) Head............... 45 4.11 Agreement in Partitive Noun Phrase in Swedish.......... 45 4.12 Gender and Number Agreement in Predicative Complement... 50 4.13 Personal Pronouns in Swedish................... 54 4.14 Finite and Non-finite Verb Forms.................. 55 4.15 Tense Structure........................... 56 4.16 Fa-sentence Word Order...................... 63 4.17 Af-sentence Word Order...................... 63 4.18 Distribution of Grammar Errors in Sub-Corpora.......... 74 4.19 Distribution of Grammar Errors by Age.............. 74 4.20 Examples of Grammar Errors in Teleman s Study......... 78 4.21 Examples of Grammar Errors from the Skrivsyntax Project.... 79 4.22 Grammar Errors in the Evaluation Texts of Grammatifix...... 81 4.23 Grammar Errors in Granska s Evaluation Corpus......... 82 4.24 General Error Ratio in Grammatifix, Granska and Child Data... 83 4.25 Three Error Types in Grammatifix, Granska and Child Data... 83 4.26 Grammar Errors in Scarrie s ECD and Child Data......... 86 4.27 Examples of Spelling Error Categories............... 90 4.28 Spelling Variants.......................... 91 4.29 Distribution of Real Word Segmentation Errors.......... 91 4.30 Distribution of Real Word Spelling Errors in Sub-Corpora.... 99 4.31 Distribution of Real Word Spelling Errors by Age......... 99 4.32 Sentence Delimitation in the Sub-Corpora............. 103 4.33 Sentence Delimitation by Age................... 103 4.34 Major Delimiter Errors in Sub-Corpora.............. 105 4.35 Major Delimiter Errors by Age................... 105 4.36 Comma Errors in Sub-Corpora................... 106

xii LIST OF TABLES 4.37 Comma Errors by Age....................... 107 5.1 Summary of Detection Possibilities in Child Data......... 126 5.2 Overview of the Grammar Error Types in Grammatifix (GF), Granska (GR) and Scarrie (SC).................. 137 5.3 Overview of the Performance of Grammatifix, Granska and Scarrie 141 5.4 Performance Results of Grammatifix on Child Data........ 169 5.5 Performance Results of Granska on Child Data.......... 169 5.6 Performance Results of Scarrie on Child Data........... 170 5.7 Performance Results of Targeted Errors.............. 171 6.1 Some Expressions and Operators in XFST............. 178 6.2 Types of Directed Replacement................... 179 6.3 Noun Phrase Types......................... 206 7.1 Performance Results on Child Data: Phase I............ 220 7.2 False Alarms in Noun Phrases: Phase I............... 224 7.3 False Alarms in Finite Verbs: Phase I................ 226 7.4 False Alarms in Verb Clusters: Phase I............... 227 7.5 False Alarms in Noun Phrases: Phase II.............. 229 7.6 False Alarms in Finite Verbs: Phase II............... 231 7.7 False Alarms in Verb Clusters: Phase II.............. 231 7.8 Performance Results on Child Data: Phase II........... 232 7.9 Performance Results of FiniteCheck on Other Text........ 237 7.10 Performance Results of Grammatifix on Other Text........ 240 7.11 Performance Results of Granska on Other Text.......... 241 7.12 Performance Results of Scarrie on Other Text........... 242

LIST OF FIGURES xiii List of Figures 3.1 Principles for Error Categorization................. 31 4.1 Grammar Error Distribution.................... 73 4.2 Error Density in Sub-Corpora.................... 76 4.3 Error Density in Age Groups.................... 76 4.4 Three Error Types in Grammatifix (black line), Granska (gray line) and Child Data (white line).................... 84 4.5 Error Distribution of Selected Error Types in Scarrie....... 87 4.6 Error Distribution of Selected Error Types in Child Data..... 87 6.1 The System Architecture of FiniteCheck.............. 185 7.1 False Alarms: Phase I vs. Phase II................. 233 7.2 Overview of Recall in Child Data................. 234 7.3 Overview of Precision in Child Data................ 235 7.4 Overview of Overall Performance in Child Data.......... 236 7.5 Overview of Recall in Other Text.................. 244 7.6 Overview of Precision in Other Text................ 244 7.7 Overview of Overall Performance in Other Text.......... 245

xiv

Chapter 1 Introduction 1.1 Written Language in a Computer Literate Society Written language plays an important role in our society. A great deal of our communication occurs by means of writing, which besides the traditional paper and pen, is facilitated by the computer, the Internet and other applications such as for instance the mobile phone. Word processing and sending messages via email are among the most usual activities on computers. Other communicated media that enable written communication are also becoming popular such as webchat or instant messaging on the Internet or text messaging (Short-Message-Service, SMS) via the mobile phone. 1 The present doctoral dissertation concerns word processing on computers, in particular the linguistic tools integrated in such authoring aids. The use of word processors for writing both in educational and professional settings modifies the process, practice and acquisition of writing. With a word processor, it is not only easy to produce a text with a neat layout, but it supports the writer throughout the whole writing process. Text may be restructured and revised at any time during text production without leaving any trace of the changes that have been made. Text may be reused and a new text composed by cutting and pasting passages. Iconic material such as pictures 2 (or even sounds) can be inserted, linguistic aids can be used for proofreading a text. Writing acquisition can be enhanced by use of a word processor. For instance, focus on somewhat more technical aspects such as physically shaping letters with a pen shifts toward the more cognitive processes of text 1 Studies of computer-mediated communication are provided by e.g. Severinson Eklundh (1994); Crystal (2001); Herring (2001). A recent dissertation by Hård af Segerstad (2002) explores especially how written Swedish is used in email, webchat and SMS. 2 Smileys or emoticons (e.g. :-) happy face ) are more used in computer-mediated communication.

2 Chapter 1. production enabling the writer to apply the whole language register. Writing on a computer enhances in general both the motivation to write, revise or completely change a text (cf. Wresch, 1984; Daiute, 1985; Severinson Eklundh, 1993; Pontecorvo, 1997). The status of written language in our modern information society has developed. In contrast to ancient times, writing is no longer reserved for just a small minority of professional groups (e.g. priests and monks, bankers, important merchants). In particular, the emergence of computers in writing has led to the involvement of new user groups besides today s writing professionals like journalists, novelists and scientists. We write more nowadays in general, and the freedom of and control over one s own writing has increased. Texts are produced rapidly and are more seldom proofread by a careful secretary with knowledge of language. This is sometimes reflected in the quality and correctness of the resulting text (cf. Severinson Eklundh, 1995). Linguistic tools that check mechanics, grammar and style have taken over the secretarial function to some degree and are usually integrated in word processing software. Spelling checkers and hyphenators that check writing mechanics and identify violations on individual words have existed for some time now. Grammar checkers that recognize syntactic errors and often also violations of punctuation, word capitalization conventions, number and date formatting and other style-related issues, thus working above the word level, are a rather new technology, especially for such minor small languages like Swedish. Grammar checking tools for languages such as English, French, Dutch, Spanish, and Greek were being developed in the 1980 s, whereas research on Swedish writing aids aimed at grammatical deviance started quite recently. In addition to the present work, there are three research groups working in this area. The Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology (KTH), with a long tradition of research in writing and authoring aids, is responsible for Granska. Development of this tool has occurred over a series of projects starting in 1994 (Domeij et al., 1996, 1998; Carlberger et al., 2002). The Department of Linguistics, Uppsala University was involved in an EU-sponsored project, Scarrie, between 1996 and 1999. The goal of this project was development of language tools for Danish, Norwegian and Swedish (Sågvall Hein, 1998a; Sågvall Hein et al., 1999). Finally, a Finnish language engineering company Lingsoft Inc. developed Grammatifix. Initiated in 1997, and completed in 1999, this tool was released on the market in November 1998, and has been part of the Swedish Microsoft Office Package since 2000 (Arppe, 2000; Birn, 2000). The three Swedish systems mainly use parsing techniques with some degree of feature relaxation and/or explicit error rules for detection of errors. Grammatifix and Granska are developed as generic tools and are tested on adult (mostly pro-

Introduction 3 fessional) texts. Scarrie s end-users are professional writers from newspapers and publishing firms. 1.2 Aim and Scope of the Study The primary purpose of the present work is to detect grammar errors by means of linguistic descriptions of correct language use rather than describing the structure of errors. The ideal would be to develop a generic method for detection of grammar errors in unrestricted text that could be applied to different writing populations displaying different error types without the need for rewriting the grammars of the system. That is, instead of describing the errors made by different groups of writers resulting in distinct sets of error rules, use the same grammar set for detection. This approach of identifying errors in text without explicit description of them contrasts with the other three Swedish grammar checkers. Using this method, we will hopefully cover many different cases of errors and minimize the possibility of overlooking some errors. We chose primary school children as the targeted population as a new group of users not covered by the previous Swedish projects. Children as beginning writers, are in the process of acquiring written language, unlike adult writers, and will probably produce relatively more errors and errors of a different kind than adult writers. Their writing errors have probably more to do with competence than performance. Grammar checkers for this group have to have different coverage and concentrate on different kinds of errors. Further, the positive impact of computers on children s writing opens new opportunities for the application of language technology. The role of proofreading tools for educational purposes is a rather new application area and this work can be considered a first step in that direction. Against this background, the main goal of the present thesis is handling children s errors and experimenting with positive grammatical descriptions using finite state techniques. The work is divided into three subtasks, including first, an overall error analysis of the collected children s texts, then exploring the nature and possibilities for detection of errors and finally, implementation of detection of (some) grammatical error types. Here is a brief characterization of these three tasks: I. Investigation of children s writing errors: The targeted data for a grammar checker can be selected either by intuitions about errors that will probably occur, or by directly looking at errors that actually occur. In the present work, the second approach of empirical analysis will be applied. Texts from pupils at three primary schools were collected and analyzed for errors, focusing on errors above word-level including grammar errors, spelling errors resulting in existent words, and punctuation. The main focus lies on grammar errors

4 Chapter 1. as the basis for implementation. The questions that arise are: What grammar errors occur? How should the errors be categorized? What spelling errors result in lexicalized strings and are not captured by a spelling checker? What is the nature of these? How is punctuation used and what errors occur? II. Investigation of the possibilities for detection of these writing errors: The nature of errors will be explored along with available technology that can be applied in order to detect them. An interesting point is how the errors that are found are handled by the current systems. The questions that arise are: What is the nature of the error? What is the diagnosis of the error? What is needed to be able to detect the error? How are the grammar errors handled by the current Swedish grammar checkers, Grammatifix, Granska and Scarrie? III. Implementation of the detection of (some) grammar errors: A subset of errors will be chosen for implementation and will concern grammar checking to the level of detecting errors. Errors will obtain a description of the type of error detected. Implementation will not include any additional diagnosis or any suggestion of how to correct the error. The analysis will be shallow, using finite state techniques. The grammars will describe real syntactic relations rather than the structure of erroneous patterns. The difference between grammars of distinct accuracy will reveal the errors, that as finite state automata can be subtracted from each other. Karttunen et al. (1997a) use this technique to find instances of invalid dates and this is an attempt to apply their approach to a larger language domain. The work on this grammar error detector started at the Department of Linguistics at Göteborg University in 1998, in the project Finite State Grammar for Finding Grammatical Errors in Swedish Text and was a collaboration with the NADA group at KTH in the project Integrated Language Tools for Writing and Document Handling. 3 The present thesis describes both the initial development within this project and the continuation of it. The main contributions of this thesis concern understanding of incorrect language use in primary school children s writing and computational analysis of such incorrect text by means of correct language use, in particular: Collection of texts written by primary school children, written both by hand and on a computer. 3 This project was sponsored by HSFR/NUTEK Language Technology Programme and has its site at: http://www.nada.kth.se/iplab/langtools/

Introduction 5 Analysis of grammar errors, spelling errors and punctuation in the texts of primary school writers. Comparison of errors found in the present data with errors found in other studies on grammar errors. Comparison of error types covered by the three Swedish grammar checkers. Performance analysis of the three Swedish grammar checkers on the present data. Implementation of a grammar error detector that derives/compiles error patterns rather than writing the error grammar by hand. Performance analysis of the detector on the collected data and some portion of other data. 1.3 Outline of the Thesis The remaining chapters of the thesis fall into two parts. Part I: The first part is devoted to a discussion of writing and an analysis of the collected data and consists of three chapters. Chapter 2 provides a brief introduction to research on writing in general, writing acquisition, how computers influence writing and descriptions of previous findings on grammar errors, concluding with what grammar errors are to be expected in written Swedish. Chapter 3 gives an overview of the data collected and a discussion of error classification. Chapter 4 presents the error profile of the data. The chapter concludes with discussion of the requirements for a grammar error detector for the particular subjects of this study. Part II: The second part of the thesis concerns grammar checking and includes three chapters. Chapter 5 starts with a general overview of the requirements and functionalities of a grammar checker and what is required for the errors in the present data. Swedish grammar checkers are described and their performance is checked on the present data. Chapter 6 presents the implementation of a grammar error detector that handles these errors, including description of finite state formalism. The techniques of finite state parsing are explained. Chapter 7 presents the performance of this tool. The thesis ends with a concluding summary (Chapter 8). In addition, the thesis contains four appendices. Appendix A presents the grammatical feature categories

6 Chapter 1. used in the examples of errors or when explaining the grammar of Swedish. Appendix B presents the error corpora consisting of the grammar errors found in the present study (Appendix B.1), misspelled words (Appendix B.2) and segmentation errors (Appendix B.3). The tagset used is presented in Appendix C and some listings from the implementation are listed in Appendix D.

Part I Writing

8

Chapter 2 Writing and Grammar 2.1 Introduction Learning to write does not imply acquiring a completely new language (new grammar), since often at this stage (i.e. beginning school) a child already knows the majority of the (general) grammar rules. Rather, learning to write is a process of learning the difference between written language and the already acquired spoken language. Consequently, errors that one will find in the writing of primary school children often are due to their lack of knowledge of written language and consist of attempts to reproduce spoken language norms as an alternative to the standard written norm or to errors due to the as yet not acquired part of written language. Further, even when the writer knows the standard norm, errors can occur either as the result of disturbances such as tiredness, stress, etc. or because the writer cannot manage to keep together complex content and meaning constructions (cf. Teleman, 1991a). Another source of errors is the aids we use for writing, computers, which also impact on our writing and may give rise to errors. The main purpose of the present chapter is to see if previous studies on writing can give some hint on what grammar errors are to be expected in the writing by Swedish children. It provides a survey of previous studies of grammar errors, as well as some background research on writing in general and some insights into what it means to learn to write and how computers influence our writing. First, a short review of research on writing is presented (Section 2.2), followed by a short explanation of what acquisition of written language involves and how computers influence the way we write (Section 2.3). Previous findings on grammar errors in Swedish can be found in the following section, including studies of writing of children and adolescents, adults and the disabled (Section 2.4).

10 Chapter 2. 2.2 Research on Writing in General For a long period of time many considered written language (beginning with e.g. de Saussure, 1922; Bloomfield, 1933) to be a transcription of spoken (oral) language and not that important as, or even inferior to, spoken language. A similar view is also reflected in the research on literacy, where studies on writing were very few in comparison to research on reading. A turning point at the end of 1970s, is described by many as the writing crisis (Scardamalia and Bereiter, 1986), when an expansion in research occurs in teaching native language writing. During this period, more naturalistic methods for writing are propagated, i.e. learning to write by writing (Moffett, 1968), examination of the writing situation in English schools (e.g. Britton, 1982; Emig, 1982) and changing the focus of study from judgments of products and more text-oriented research to the strategies involved in the process of writing (see Flower and Hayes, 1981). In Sweden, writing skills were studied by focusing on the written product, often related to the social background of the child. Research has been devoted to spelling (e.g. Haage, 1954; Wallin, 1962, 1967; Dahlquist and Henrysson, 1963; Ahlström, 1964, 1966; Lindell, 1964) and writing of composition in connection to standardized tests (e.g. Björnsson, 1957, 1977; Ljung, 1959). There are also studies concerning writing development in primary and upper secondary schools (e.g. Grundin, 1975; Björnsson, 1977; Hultman and Westman, 1977; Lindell et al., 1978; Larsson, 1984). During the later half of the 1980s, research in Sweden took a new direction towards studies of writing strategies concerning writing as a process (e.g. Björk and Björk, 1983; Strömquist, 1987, 1989) and development of writing abilities focusing on writing activities between children and parents (e.g. Liberg, 1990) and text analysis (e.g. Garme, 1988; Wikborg and Björk, 1989; Josephson et al., 1990). This turning point was reflected in education by the introduction of process-oriented writing, as well. Some research concerned writing as a cognitive text-creating process using video-recordings of persons engaged in writing (e.g. Matsuhasi, 1982), or clinical experiments (e.g. Bereiter and Scardamalia, 1985). The use of computers in writing prompted studies on the influence of computers in writing (e.g. Severinson Eklundh and Sjöholm, 1989; Severinson Eklundh, 1993; Wikborg, 1990), resulting in the development of computer programs that register and record writing activities (e.g. Flower and Hayes, 1981; Severinson Eklundh, 1990; Kollberg, 1996; Strömqvist, 1996).

Writing and Grammar 11 2.3 Written Language and Computers 2.3.1 Learning to Write Writing, like speaking, is primarily aimed at expressing meaning. The most evident difference between written and spoken language lies in the physical channel. Written language is a single-channelled monologue, using only the visual channel (eye) with the addressee not present at the same time. It is a more relaxed, rather slow process affording longer time for consideration and the possibility to edit/correct the end product. Speech as a dialogue is simultaneous and involves participants present at the same time, where all the senses can be used to receive information. It is a fast process with little time for consideration and difficulty in correcting the end product. The rules and conventions of written language are more restrictive than the rules of spoken language in the sense that there are constructions in spoken language regarded as incorrect in written language. Writing is, in general, standardized with less (dialectal) variation in contrast to spoken language, which is dialectal and varied. Further, acquisition of written and spoken language occurs under different conditions and in different ways. Writing is taught in school by teachers with certain training, whereas speaking is learned privately (in a family, from peers, etc.), without any planning of the process. When learning to speak, we learn the language. When learning to write we already know the language (in the spoken form) (cf. Linell, 1982; Teleman, 1991b; Liberg, 1990). 1 Learning a written language means not only acquiring its more or less explicit norms and rules, but also learning to handle the overall writing system, including the more technical aspects, such as how to shape the letters, the boundaries between words, how a sentence is formed, as well as acquiring the grammatical, discursive, and strategic competence to convey a thought or message to the reader. In other words, writing entails being able to handle the means of writing, i.e. letters and grammar rules, and arranging them to form words and sentences and being able to use them in a variety of different contexts and for different purposes. During this development, children may compose text of different genre, but not necessarily apply the conventions of the writing system correctly. Children are quite creative and they often use conventions in their own ways, for instance using periods between words to separate them instead of blank spaces (cf. Mattingly, 1972; Chall, 1979; Lundberg, 1989; Liberg, 1990; Pontecorvo, 1997; Håkansson, 1998). 1 For further, more extensive definitions of differences between written and spoken language see e.g. Chafe (1985); Halliday (1985); Biber (1988).

12 Chapter 2. The above discussion leads to a view of learning to write as being the acquisition of a complex system of communication with several components. Following Hultman (1989, p.73), we can identify three aspects of writing: 1. the motor aspect: the movement of the hand when forming the letters or typing on the keyboard 2. the grammar aspect: the rules for spelling and punctuation, morphology and syntax on clause, sentence and text level 3. the pragmatic aspect: use of writing for a purpose, to argue, tell, describe, discuss, inform, refer, etc. The text has to be readable, reflecting the meaning of words and the effect they have. This thesis focuses on the grammar aspect, in particular on the syntactic relationships between words. Also some aspects of spelling and punctuation are covered. The text level is not analyzed here. 2.3.2 The Influence of Computers on Writing The view on writing has changed, it is no longer interpreted as a linear activity consisting of independent and temporally sequenced phases, but rather considered to be a dynamic, problem solving activity. According to Hayes and Flower (1980), as a cognitive process, writing is influenced by the task environment (the external social conditions) and the writer s long term memory, including cognitive processes of planning (generating and organizing ideas, setting goals, and decision-making of what to include, what to concentrate on), translation (the actual production) and revision (evaluation of what has been written, proof-reading, writing out and publishing). This process-based approach with the phases also referred to as prewriting, writing and rewriting has been adopted in writing instruction in school (e.g. Graves, 1983; Calkins, 1986; Strömquist, 1993) and is also considered to be well-suited to computer-assisted composition (Wresch, 1984; Montague, 1990). Writing on a computer makes text easy to structure, rearrange and rewrite. Many studies report writers decreased resistance to writing. They experience that it is easier to start to write and there is a possibility to revise under the whole process of writing, leave the text and then come back to it again and update and reuse old texts (e.g. Wresch, 1984; Severinson Eklundh, 1993). Also, studies of children s use of computers show that children who use a word-processor in school enjoy writing and editing activities more, considering writing on a computer to be much easier and more fun. They are more willing to revise and even completely

Writing and Grammar 13 change their texts and they write more in general (e.g. Daiute, 1985; Pontecorvo, 1997). The word processor affects the way we write in general. We usually plan less in the beginning when writing on a computer and revise more during writing. Thus, editing occurs during the whole process of writing and is not left solely to the final phase. In an investigation by Severinson Eklundh (1995) of twenty adult writers with academic backgrounds more than 80% of all editing was performed during writing and not after. The main disadvantage reported is that it is hard to get an overall perspective of a text on the screen, which then makes planning and revision more difficult and can in turn lead to texts being of worse quality (e.g. Hansen and Haas, 1988; Severinson Eklundh, 1993). Rewriting and rearranging of a text is easy to do on a word processor, for instance by copy and paste utilities that may easily give rise to errors that are hard to discover afterwards, especially in a brief perusal. Words and phrases can be repeated, omitted, transposed. Sentences can be too long (Pontecorvo, 1997) and errors that normally are not found in native speakers writing occur. The common claim is that writing in one s mother tongue normally results in the types of errors that are different from the public language norm, since most of the mother tongue s grammar is present before we begin school (Teleman, 1979). There are studies that clearly show that the use of word processors leads to completely new error types including some errors that were considered characteristic for second language writers. For instance, morpho-syntactic (agreement) errors have been found to be quite usual among native speakers in the studies of Bustamente and León (1996) and Domeij et al. (1996). The errors are connected to how we use the functions in a word processor and that revision is more local due to limitations in view on the screen (cf. Domeij et al., 1996; Domeij, 2003). Concerning text quality, there are studies that point out that the use of a word processor results in longer texts, both among children and adults. Some researchers claim that the quality of compositions improved when word processors were used (see e.g. Haas, 1989; Sofkova Hashemi, 1998). However, no reliable quality enhancement besides the length of a text is evident in any study. The effects of using a computer for revision are regarded by some as being positive both on the mechanics and the content of writing while others feel it promotes only surface level revision, not enhancing content or meaning (see the surveys in Hawisher, 1986; Pontecorvo, 1997; Domeij, 2003).

14 Chapter 2. 2.4 Studies of Grammar Errors 2.4.1 Introduction There are not many studies of grammar errors in written Swedish. Studies of adult writing are few, while research on children s writing development mostly concerns the early age of three to six years and development of spelling and use of the period and/or other punctuation marks and conventions (e.g. Allard and Sundblad, 1991). Recent expansion of development of grammar checking tools contributes to this field, however. Below, studies are presented of grammar errors found in the writing of primary and upper secondary school children, adults, error types covered by current proof reading tools and analysis of grammar errors in texts of adult writers used for evaluation of these tools. Some of these studies are described further in detail and are compared to the analysis of the children s texts gathered for the present thesis in Chapter 4 (Section 4.4). 2.4.2 Primary and Secondary Level Writers During the 1980s, several projects investigated the language of Swedish school children as a contribution to discussion of language development and language instruction (see e.g. the surveys in Östlund-Stjärnegårdh, 2002; Nyström, 2000). The writing of children in primary and upper secondary school was analyzed mostly with focus on lexical measures of productivity and language use, in terms of analysis of vocabulary, parts-of-speech distribution, length of words, word variation and also content, relation to gender, social background and the grades assigned to the texts (e.g. Hersvall et al., 1974; Hultman and Westman, 1977; Lindell et al., 1978; Pettersson, 1980; Larsson, 1984). Then, when the traditional productoriented view on writing switched to the new process-oriented paradigm, studies on writing concerned the text as a whole and as a communicative act (e.g. Chrystal and Ekvall, 1996, 1999; Liberg, 1999) and became more devoted to analysis of genre and referential issues (e.g. Öberg, 1997; Nyström, 2000) and relation to the grades assigned (e.g. Östlund-Stjärnegårdh, 2002) and modality (speech or writing) (e.g. Strömqvist et al., 2002). Quantitative analysis in this field still concerns lexical measures of variation, length, coherence, word order and sentence structure; very few studies note errors other than spelling or punctuation (e.g. Olevard, 1997; Hallencreutz, 2002). A study by Teleman (1979) shows examples (no quantitative measures) of both lexical and syntactic errors observed in the writing of children from the seventh year of primary school (among others). He reports on errors in function words,

Writing and Grammar 15 inflection with dialectal endings in nouns, dropping of the tense-endings on verbs and on use of nominative form of pronouns in place of accusative forms as is often the case in spoken Swedish. Also, errors in definiteness agreement, missing constituents, reference problems, word order and tense shift are exemplified as well as observation of erroneous use of or missing prepositions in idiomatic expressions. Another study of Hultman and Westman (1977), concerns analysis of national tests from third year students from upper secondary school. The aim of the project Skrivsyntax Written Syntax was to study writing practice in school from a linguistic point of view. The material included 151 compositions (88 757 words in total) with the subject Familjen och äktenskapet än en gång Family and marriage once more. Vocabulary, distribution of word categories, syntax and spelling were studied and compared to adult texts, between the marks assigned to the texts and between boys and girls. The study also included error analysis of punctuation, orthography, grammar, lexicon, semantics, stylistics and functionality of the text. Among grammar errors, gender agreement errors were reported being usual, and relatively many errors in pronoun case after preposition occurred. Errors in agreement between subject and predicative complement are also reported as rather frequent. Word order errors are also reported, mostly in the placement of adverbials. Other examples include verb form errors, subject related errors, reference, preposition use in idiomatic expressions and clauses with odd structure. 2.4.3 Adult Writers There are few studies of adult writing in Swedish. Those that exist are mostly devoted to the writing process as a whole or to social aspects of it with very little attention being paid to the mechanics of writing. However, the recent expansion in the development of Swedish computational grammar checking tools that require understanding of what error types should be treated by such tools, has made contributions to this field. The realization of what types of errors occur and should thus be included in such an authoring aid may be based on intuitive presuppositions of what rules could be violated, in addition to empirical analysis of text. More empirical evidence of grammar violations also comes from the evaluation of such tools, where the system is tested against a text corpus with hand-coded analysis of errors. There are three available grammar checkers for Swedish: Granska (Knutsson, 2001), Grammatifix (Birn, 2000) and Scarrie (Sågvall Hein et al., 1999). 2 Scarrie is explicitly devoted to professional writers of newspaper articles. The other two systems are not explicitly aimed at any special user groups, although their performance tests were provided mainly on newspaper texts. 2 These tools are described in detail in Chapter 5.

16 Chapter 2. Below, a survey of studies is presented of professional and non-professional writers, adult disabled writers, the grammar errors that are covered by the three Swedish grammar checkers, and grammar errors that occurred in the evaluation texts the performance of these systems was tested upon. Professional and Non-professional Writers Studies focusing on adult non-professional writing concern analysis of crime reports (Leijonhielm, 1989), post-school writing development (Hammarbäck, 1989), a socio-linguistic study concerning writing attitudes, i.e. what is written and who writes what at a local government office regardless of writing conventions (Gunnarsson, 1992) and some typical features in non-proof-read adult prose at a government authority are reported in Göransson (1998), the only investigation that addresses (to some extent) grammatical structure. Göransson (1998) describes her immediate impression when proof-reading texts written by her colleagues at a government authority, showing some typical features in this unedited adult prose. She examined reports, instructional texts, newspaper articles, formal letters, etc. The analysis distinguishes between high and low level errors. High level includes comprehensibility of the text, coherence and style, relevance for the context, ability to see one s own text with the eyes of others, choice of words, etc. Low level errors cover grammar and spelling errors. Among the grammar errors she only reports reference problems, choice of preposition and agreement errors. Among studies of professional writers, the language consultant Gabriella Sandström (1996) analyzed editing at the Swedish newspaper Svenska Dagbladet that included 29 articles written by 15 reporters. The original script, the edited version and the final version of the articles were analyzed. The analysis involved spelling, errors at lexical and syntactic level, formation errors, punctuation and graphical errors. The result showed that the journalists made most errors in punctuation, graphical errors and lexical errors and most of them disappeared during the editing process. Among the lexical errors, Sandström mentions errors in idiomatic expressions and in the choice of prepositions. Syntax errors also seem to be quite common, but the article does not give an analysis of the different kinds of syntax errors.

Writing and Grammar 17 Adults with Writing Disabilities Studies on writing strategies of disabled groups were conducted within the project Reading and Writing strategies of Disabled Groups, 3 including analysis of grammar for the dyslexic and deaf (Wengelin, 2002). The analysis of the writing of deaf adults included no frequency data and is not that important for the present study since it tends to reflect more strategies found in second language acquisition. Adult dyslexics mostly show problems with formation of sentences and frequent omission of constituents. Especially frequent were missing or erroneous conjunctions. Other errors concern agreement in noun phrase or the form of noun phrases, verb form, tense shift within sentences and incorrect choice of prepositions. Marking of sentence boundaries and punctuation is the main problem of these writers. Error Types in Proof Reading Tools The error types covered by a grammar checker should, in general, include the central constructions of the language and, in particular, those which give rise to errors. These constructions should allow precise descriptions so that false alarms can be avoided. The selection of what error types to include is then also dependent on the available technology and the possibility of detecting and correcting the error types (cf. Arppe, 2000; Birn, 2000). In the development of Grammatifix, the pre-analysis of existing error types in Swedish was based on linguistic intuition, personal observation and reference literature of Swedish grammar and writing conventions (Arppe, 2000). In the case of Granska, the pre-analysis involved analysis of empirical data such as newspaper texts and student compositions (Domeij et al., 1996; Domeij, 2003). In the Scarrie project, where journalists are the end-users, the stage of pre-analysis consisted of gathering corrections made by professional proof-readers at the newspapers involved. These corrections were stored in a database (The Swedish Error Corpora Database, ECD), that contains nearly 9,000 error entries, including spelling, grammar, punctuation, graphic and style, meaning and reference errors. Arppe (2000) provides an overview of the types of errors covered by the Swedish tools and reports, in short, that all the tools treat errors in noun phrase agreement and verb forms in verb chains. Scarrie and Granska also treat errors in compounds, whereas Grammatifix has the widest coverage in punctuation and number formatting errors. He points out that the error classification in these tools is similar, but not exactly the same. The depth and breadth of included error categor- 3 More information about this project may be found at: http://www.ling.gu.se/ wengelin/projects/r&r.

18 Chapter 2. ies differs in the subsets of phrases, level of syntactic complexity or in the position of detection in the sentence. They may, for instance, detect errors in syntactically simple fragments, but fail with syntactically more complex structures. These factors are further explained and exemplified in Chapter 5, where I also compare the error types covered by the individual tools. Among the grammar errors presented in Scarrie s ECD, errors in noun phrase agreement, predicative complement agreement, definiteness in single nouns, verb subcategorization and choice of preposition are the most frequent error types. Evaluation Texts of Proof Reading Tools Other empirical evidence of grammar errors can be observed in the evaluation of the three grammar checkers (Birn, 2000; Knutsson, 2001; Sågvall Hein et al., 1999). The performance of all the tools was tested on newspaper text, written by professional writers. Only the evaluation corpus of Granska included texts written by non-professionals as well, represented by student compositions. In general, the corpora analyzed are dominated by errors in verb form, agreement in noun phrases, prepositions and missing constituents. 2.5 Conclusion The main purpose of the present chapter was to investigate if previous research reveals which grammar errors to expect in the writing of primary school children. Apparently, grammar in general has a very low priority in the research on writing in Swedish. Grammar errors in children s writing have been analyzed at the upper level in primary school and in the upper secondary school and exist only as reports with some examples, without any particular reference to frequency. Some analyses have been performed on the writing of professional adult writers and in the research on the writing of adult dyslexic and deaf adults, with quantitative data for the dyslexic group. The only area that directly approaches grammar errors concerns the development of proofreading tools aiming particularly at grammar. These studies report on grammar errors in the writing of adults. Previous research presents no general characterization of grammar errors in children s writing. There are, however, few indications that children as beginning writers make errors different from adult writers. Teleman s observations indicate use of spoken forms that were not reported in the other studies. Some examples of errors in the Skrivsyntax project are evidently more related to the fact that the children have not yet mastered writing conventions (e.g. errors in the accusative

Writing and Grammar 19 case of plural pronouns) rather than making errors related to slip of the pen (e.g. due to lack of attention). In general, all the studies report errors in agreement (both in non phrase and predicative complement), verb form and the choice of prepositions in idiomatic expressions. Are these the central constructions in Swedish that give rise to grammar errors? It may be true for adult writers, but it is unclear regarding beginning writers. Analysis of grammar errors in the children data collected for the present study is presented in Chapter 4, together with a comparison of the findings of the previous studies of grammar errors presented above.

20

Chapter 3 Data Collection and Analysis 3.1 Introduction In this chapter we report on data that has been gathered for this study and the types of analysis provided on them. First, the data collection is presented and the different sub-corpora are described (Section 3.2). Then, a discussion follows of the kinds of errors analyzed and how they are classified (Section 3.3). The types of analyses in the present study are provided in the subsequent section (Section 3.4) and a description of error coding and tools that were used for that purpose end this chapter (Section 3.5). 3.2 Data Collection 3.2.1 Introduction The main goal of this thesis is to detect automatically grammar errors in texts written by children. In order to explore what errors actually occur, texts with different topics written by different subjects were collected to build an underlying corpus for analysis, hereafter referred to as the Child Data corpus. The material was collected on three separate occasions and has served as basis for other (previous) studies. The first collection of the data consists of both hand written and computer written compositions on set topics by 18 children between 9 and 11 years old - The Hand versus Computer Collection. The second collection involves the same subjects, this time, the children participate in an experiment and tell a story about a series of pictures, both orally and in writing on a computer - The Frog Story Collection. The third collection was presented from a project on

22 Chapter 3. development of literacy and includes eighty computer written compositions of 10 and 13 year old children on set topics in two genres - The Spencer Collection. 1 Table 3.1 gives an overview of the whole Child Data corpus, including the three collections mentioned above, divided into five sub-corpora by the writing topics the subjects were given: Deserted Village, Climbing Fireman, Frog Story, Spencer Narrative, Spencer Expository. Further information concerns the age of the subjects involved, the number of compositions, number of words, if the children wrote by hand or on computer and what writing aid was then used. Table 3.1: Child Data Overview HAND VS. COMPUTER COLLECTION: AGE COMP WORDS TOPIC WRITING AID Deserted Village 9-11 18 7 586 They arrived in a deserted village Climbing Fireman 9-11 18 4 505 Shown: a picture of a fireman climbing on a ladder FROG STORY COLLECTION: paper and pen Claris Works 3.0 Frog Story 9-11 18 4 907 Story-retelling: Frog ScriptLog where are you? SPENCER COLLECTION: Spencer Narrative 10 & 13 40 5 487 Narrative: Tell about a predicament you had rescued somebody from, or you had been rescued from Spencer Expository 10 & 13 40 7 327 Expository: Discuss the problems seen in the video TOTAL 134 29 812 ScriptLog ScriptLog Altogether 58 children between 9 and 13 years old wrote 134 papers, comprising a corpus of 29,812 words. Most of the papers are written on the computer. Only the first sub-corpus (Deserted Village) consists of 18 hand written compositions. The editor Claris Works 3.0 was used for 18 computer written texts. ScriptLog, a tool for experimental research on the on-line process of writing, was used for the remaining (98) computer written compositions. ScriptLog looks just like an ordin- 1 Many thanks to Victoria Johansson and Sven Strömqvist, Department of Linguistics, Lund University for sharing this collection of data.

Data Collection and Analysis 23 ary word processor to the user, but in addition to producing the written text, it also logs information of all events on the keyboard, the screen position of these events and their temporal distribution. 2 This section proceeds with detailed descriptions of the three collections that form the corpus, with information about when and for what purpose the material was collected, the subjects involved, the tasks they were given and the experiments they took part in. 3.2.2 The Sub-Corpora The Hand vs. Computer Collection The first collection originates from a study on the computer s influence on children s writing, gathered in autumn, 1996. The writing performance in hand written and computer written compositions on the same subjects was compared (see Sofkova, 1997). Results from this study showed both great individual variation among the subjects and similarities between the two modes, e.g. the distribution of spelling and segmentation errors, as well as improved performance in the essays written on the computer especially in the use of punctuation, capitals and the number of spelling errors. The subjects included a group of eighteen children, twelve girls and six boys, between the age of 9 and 11, all pupils at the intermediate level at a primary school. This school was picked because the children had some experience with writing on computers. Computers had already been introduced in their instruction and pupils were free to choose to write on a computer or by hand. If they chose to write on a computer, they wrote directly on the computer, using the Claris Works 3.0 wordprocessor. Other requirements were that the subjects should be monolingual and not have any reading or writing disabilities. The children wrote two essays - one by hand and one on the computer. At the beginning of this study, the children were already busy writing a composition, which now is part of the hand written material. They were given a heading for the hand written task: De kom till en övergiven by They arrived in a deserted village. For the computer written task, pupils were shown a picture of a fireman climbing on a ladder. They were also told not to use the spelling checker when writing in order to make the two tasks as comparable as possible. 2 A first prototype was developed in the project Reading and writing in a Linguistic and a didactic perspective (Strömqvist and Hellstrand, 1994). An early version of ScriptLog developed for Macintosh computers was used for collecting the data in this thesis (Strömqvist and Malmsten, 1998). There is now also a Windows version (Strömqvist and Karlsson, 2002).

24 Chapter 3. The Frog Story Collection The second collection is a story-telling experiment and involves the same subjects as in the Hand vs. Computer Collection. In April 1997, we invited the children to the Department of Linguistics at Göteborg University to take part in the experiment. They played a role as control group in the research project Reading and Writing Strategies of Disabled Groups, that aims at developing a unified research environment for contrastive studies of reading and writing processes in language users with different types of functional disabilities. 3 The experiment included a production task and the data were elicited both in written and spoken form (video-taped). A wordless picture story booklet Frog, where are you? by Mercer Mayer (1969) was used, a cartoon like series of 24 pictures about a boy, his dog and a frog that disappears. Each subject was asked to tell the story, picture by picture. At the beginning of the experiment the children were invited to look through the book to get an idea of the content. Then, the instruction was literally Kan du berätta vad som händer på de här bilderna? Can you tell what is happening in these pictures? Half of the children started with writing and then telling the story and half of them did the opposite. For the written task, the on-line process editor ScriptLog was used, storing all the writing activities. The SPENCER Collection The Spencer Project on Developing Literacy across Genres, Modalities and Languages 4 lasted between July 1997, and June 2000. The aim was to investigate the development of literacy in both speech and writing. Four age groups (grade school students, junior high school students, high school students and university students), and seven languages (Dutch, English, French, Hebrew, Icelandic, Spanish and Swedish) were studied. Schools were picked from areas where one could expect few immigrants in the classes, and also where the children had some experience with computers. The subjects came from middle class, monolingual families and they had no reading or writing disabilities. Another criterion was that at least one of the subject s parents had education beyond high school. 3 The project s directors are Sven Strömqvist and Elisabeth Ahlsén from the Department of Linguistics, Göteborg University. More information about this project may be found at: http: //www.ling.gu.se/ wengelin/projects/r&r. 4 The project was funded by the Spencer Foundation Major Grant for the Study of Developing Literacy to Ruth Berman, Tel Aviv University, who was the coordinator of this project. Each language/country involved has had its own contact person, for Swedish it was Sven Strömqvist from the Department of Linguistics at Lund University.

Data Collection and Analysis 25 All subjects had to create two spoken and two written texts, in two genres, expository and narrative. Each subject saw a short video (3 minutes long), containing scenes from a school day. After the video, the procedure varied depending on the order of genre and modality. 5 The topic for the narratives was to tell about an event when the subject had rescued somebody, or had been rescued by somebody from a predicament. They were asked to tell how it started, how it went on and how it ended. The topic for the expository text was to discuss the problems they had seen in the video, and possibly give some solutions. They were explicitly asked not to describe the video. Written material for two age groups from the Swedish part of the study is included in the present Child Data: the grade school students (10 year olds) and junior high school students (13 year olds). In total, 20 subjects from each age group were recruited. The texts the subjects wrote were logged in the on-line process editor ScriptLog. 3.3 Error Categories 3.3.1 Introduction The texts under analysis contain a wide variety of violations against written language norms, on all levels: lexical, syntactic, semantic and discourse. The main focus of this thesis is to analyze and detect grammar errors, but first we need to establish what a grammar error is and what distinguishes a grammar error from, for instance, a spelling error. Punctuation is another category of interest, important for deciding how to syntactically handle a text by a grammar error detector. The following section discusses categorization of the errors found in the data and explains what errors are classified as spelling errors as well as where the boundary lies between spelling and grammar errors. The error examples provided are glossed literally and translated into English. Grammatical features are placed within brackets following the word in the English gloss (e.g. klockan watch [def] ) (the different feature categories are listed in Appendix A). Occurrences of spelling violations are followed by the correct form within parentheses and preceded by, both in the Swedish example and the English gloss (e.g. var ( vad) was ( what) ). 5 There were four different orders in the experiment: Order A: Narrative spoken, Narrative written, Expository spoken, Expository written. Order B: Narrative written, Narrative spoken, Expository written, Expository spoken. Order C: Expository spoken, Expository written, Narrative spoken, Narrative written. Order D: Expository written, Expository spoken, Narrative written, Narrative spoken.

26 Chapter 3. 3.3.2 Spelling Errors Spelling errors are violations of the orthographic norms of a language, such as insertion (e.g. errour instead of error), omission (e.g. eror), substitution (e.g. errer) or transposition (e.g. erorr) of one or more letters within the boundaries of a word or omission of space between words (i.e. when words are written together) or insertion of space within a word (i.e. splitting a word into parts). Spelling errors may occur due to the subject s lack of linguistic knowledge of a particular rule (competence errors) or as a typographical mistake, when the subject knows the spelling, but makes a motor coordination slip (performance errors). The difference between a competence and a performance error is not always so easy to see in a given text. For example, the (nonsense) string gube deviates from the intended correct word gubbe old man by missing doubling of b and violates thus the consonant gemination rule for this particular word. The text where the error comes from, shows that this subject is (to some degree) familiar with this rule applying consonant gemination on other words, indicating that the error is likely to be a typo (i.e. a performance error) and that it occurred by mistake. On the other hand, the subject may not be aware that this rule applies to this particular word. 6 It is then more a question of insufficient knowledge and thus, a competence error. Spelling errors often give rise to non-existent words (a non-word error) as in the example above, but they can also lead to an already lexicalized string (a real word error). 7 For example, in the sentence in (3.1), the string damen also violates the consonant doubling rule and deviates from the intended correct word dammen dam [def] by omission of m. However, in this case the resultant string coincides with an existent word damen lady [def]. 8 The error still concerns the single word, but differs from non-word errors in that the realization now influences not only the erroneously spelled string but also the surrounding context. The newly-formed word completely changes the meaning of the sentence and gives rise to a sentence with a very peculiar meaning, where a particular lady is not deep. (3.1) Men damen ( dammen) but lady [def] ( dam [def]) But the dam is not so deep. är inte så djup. is not that deep Homophones, words that sound alike but are spelled differently, are another example of a spelling error realized as a real word. The classical examples are the 6 The word gubbe old man was used only once in the text. 7 Usually around 40% of all misspellings result in lexicalized strings (e.g. Kukich, 1992). The notion of non-word vs. real word spelling errors is a terminology used in research on spelling (cf. Kukich, 1992; Ingels, 1996). 8 Consonant doubling is used for distinguishing short and long vowels in Swedish.

Data Collection and Analysis 27 words hjärna brain and gärna with pleasure that are often substituted in written production and as carriers of different meanings completely change the semantics of the whole sentence. Another category of words that may result in non-words or real words in writing are the alternative morphological forms in different dialects. For instance, a spoken dialectal variation of the standard final plural suffix -or on nouns as in flicker girls (standard form is flick-or girls ) is normally not accepted in written form and thus realizes as a non-word in the written language. Other spoken forms, such as jag I normally reduced to ja in speech, coincide with other existent words and form real word errors in writing. In this case ja is homonymous with the interjection (or affirmative) ja yes. In neither case is it clear if the spoken form is used intentionally as some kind of stylistical marker or spelled in this way due to competence or performance insufficiency, meaning that the subject either had not acquired the written norm or that a typographical error occurred. Spelling errors are then violations of characters (or spaces) in single isolated words, that form (mostly) non-words or real words, the latter causing ungrammaticalities in text. 3.3.3 Grammar Errors Grammar errors violate (mostly) the syntactic rules of a language, such as feature agreement, order or choice of constituents in a phrase or sentence, thus concerning a wider context than a single word. 9 Like spelling errors, a grammar error may occur due to the subject s insufficient knowledge of such language rules. However, the difference is that when learning to write as a native speaker (as the subjects in this study), only the written language norms that deviate from the already acquired (spoken) grammatical knowledge have to be learned. As mentioned earlier, research reveals that native speakers make not only errors reflecting the norms of the group one belongs to as one might expect, but also other grammar errors that have been ascribed to the influence of computers on writing. That is, even a native speaker can make grammar errors when writing on a computer due to rewriting or rearranging text. Again, the real cause of an error is not always clear from the text. For instance, in the noun phrase denna uppsatsen this [def] essay [def] a violation of definiteness agreement occurs, since the demonstrative determiner denna this normally requires the following noun to be in the indefinite form. In this case, the form denna uppsats this [def] essay [indef] is the correct one (see Section 4.3.1). However, in certain regions of Sweden this construction is grammatical in speech. This 9 Choice of words may also lead to semantic or pragmatic anomaly.

28 Chapter 3. means that this error appears as a competence error since the subject is not familiar with the written norm and applies the acquired spoken norm. On the other hand, it could also be a typographical mistake, as would be the case if the subject first used a determiner like den the/that [def] that requires the following noun to be in definite form and then changed the determiner to the demonstrative one but forgot to change the definite form in the subsequent noun to indefinite. In earlier research grammar errors have been divided along two lines. Some researchers characterize the errors by application of the same operations as for orthographic rules also at this level, with omissions, insertions, substitutions and transpositions of words. Feature mismatch is then treated as a special case of substitution (e.g. Vosse, 1994; Ingels, 1996). For instance, in the incorrect noun phrase denna uppsatsen this [def] essay [def] the required indefinite noun is substituted by a definite noun. Word choice errors, such as incorrect verb particles or prepositions, are other examples of grammatical substitution. Word order errors occur as transpositions of words, i.e. all the correct words are present but their order is incorrect. Missing constituents in sentences concern omission of words, whereas redundant words concern insertion. Others separate feature mismatch from other error types and distinguish between structural errors, that include violations of the syntactic structure of a clause, and non-structural errors, that concern feature mismatch (e.g. Bustamente and León, 1996; Sågvall Hein, 1998a). 3.3.4 Spelling or Grammar Error? As mentioned in the beginning of this section, writing errors occur at all levels, including lexicon, syntax, semantics and discourse. The nature of an error is sometimes obvious, but in many cases it is unclear how to classify errors. The final versions of the text give very little hint about what was going on in the writer s mind at the time of text production. 10 Some kind of classification of writing errors is necessary, however, for detection and diagnosis of them. Consider for instance the sentence in (3.2), where a (non-finite) supine verb form försökt tried [sup] is used as the main verb of the second sentence. The word in isolation is an existent word in Swedish, but syntactically a verb in supine form is ungrammatical as the predicate of a main sentence (see Section 4.3.5). This non-finite verb form has to be preceded by a (finite) temporal auxiliary verb (har försökt have [pres] tried [sup] or hade försökt had [pret] tried [sup] ) or the form has to be exchanged for a finite verb form, such as present (försöker try [pres] ) 10 Probably some information can be gained from the log-files in the ScriptLog versions, but since not all data in the corpus are stored in that format, such an analysis has not been included in this thesis.

Data Collection and Analysis 29 or preterite (försökte tried [pret] ). In regard to the tense used in the preceding context, the last alternative of preterite form would be the best choice. (3.2) Han tittade he looked [pret] på hunden. Hunden försökt at the-dog the-dog tried [sup] He looked at the dog. The dog tried to climb down. att to klättra ner. climb down The problem of classification lies in that although one single letter distinguishes the word from the intended preterite form and could then be considered as an orthographical violation, the error is realized not as a new word, but rather another form of the intended word is formed. This error could occur as a result of editing if the writer first used a past perfect tense (hade försökt had tried ) and later changed the tense form to preterite (försökte tried ) by removing the temporal auxiliary verb, but forgot also to change the supine form (försökt tried [sup] ) to the correct preterite form. On the other hand, the correct preterite tense could be used by the subject already from the start. Then it is rather a question of a (real word) spelling error. The subject intended already from the beginning to write a preterite form, but intentionally or unintentionally omitted the final vowel -e, that happens to be a distinctive suffix for this verb. In the next example (3.3), a gender agreement error occurs between the neuter determiner det the and the common gender noun ända end, as a result of replacing enda only with ända end. The erroneous word is an existent word and differs from the intended word only in the single letter at the beginning (an orthographic violation). This is clearly a question of a spelling error, since the word does not form any other form of the intended word and it is realized as a completely new word with distinct meaning. (3.3) Det the [neu] ända ( enda) end [com] ( only) The only thing I know about... jag vet I know om about In the grammar checking literature, the categorization of writing errors is primarily divided into word-level errors and in errors requiring context larger than a word (cf. Sågvall Hein, 1998a; Arppe, 2000). Real word spelling errors were treated in Scarrie s Error Corpora Database as errors requiring wider context for recognition and were categorized in accordance with the means used for their detection. In other words, errors either belong to the category of grammar errors when violating syntactic rules, or are otherwise categorized as style, meaning and reference category (Wedbjer Rambell, 1999a, p.5). In this thesis, where grammar errors (syntactic violations) are the main foci, real word spelling errors will be classified as a separate category. This distinction is important for examination of the

30 Chapter 3. real nature of such errors, especially when presenting a diagnosis to the user. Such considerations are especially important when the user is a beginning writer. Obvious cases of spelling errors such as the one in (3.3) are treated as such, whereas the treatment of errors lying on the borderline between a spelling and a grammar error as in (3.2) depends on: what type of new formation occurred (other form of the same lemma or new lemma) what type of violation occurred (change in letter, morpheme or word) what level is influenced (lexical, syntactic or semantic) These principles are primarily aimed at the unclear cases, but seem to be applicable to other real word violations as well. The fact is that a majority of real word spelling errors form new words and violate semantics rather than syntax and just a few of them accidentally cause syntactic errors (see further in Section 5.3.2). It is the ones that form other forms of the same lemma that are tricky. They are treated here as grammar errors, but for diagnosis it is important to bear in mind that they also could be spelling errors. Figure 3.1 shows a scheme for error categorization. All violations of the written norm will be categorized starting with whether the error is realized as a non-word or a real word. Non-words are always classified as spelling errors. Real word errors are then further considered with regard to whether they form other forms of the same lemma or if new lemmas are created. In the case of same lemma (as in (3.2)), errors are classified as grammar errors. When new lemmas are formed, syntactic or semantic errors occur. Here a distinction is made between whether just a single letter is influenced, categorizing the error as a spelling error, or a whole word was substituted, categorizing it as a grammar error. For the errors realized as real words the following principles for error categorization then apply: 11 (3.4) (i). All real word errors, that violate a syntactic rule and result in other forms of the same lemma are classified as grammar errors. (ii). All real word errors resulting in new lemmas by a change of the whole word are classified as grammar errors. (iii). All real word errors resulting in new lemmas by a change in (one or more) letter(s) are classified as spelling errors. 11 Homophones are excepted from the principle (ii). They certainly form a new lemma by a change of the whole word, but are related to how the word is pronounced and thus are considered as spelling errors.

Data Collection and Analysis 31 Figure 3.1: Principles for Error Categorization For the above example (3.2), this means that considering the word in isolation, försökt tried [sup] is an existent word in Swedish. Considering the difference in deviation of the intended preterite form, no new lemma is created, rather another form of the same lemma that happens to lack the final suffix realized as a single vowel. Considering the context it appears in, a syntactic violation occurs, since the sentence has no finite verb. So, according to principle (i) for error categorization in (3.4), this error is classified as a grammar error, since no new lemma was created, the required preterite form simply was replaced by a supine form of the same verb. In the case of (3.3), this error also involves a real word, but here, a new lemma was created by substitution of a letter. The error is then, according to principle (iii) in (3.4), considered to be a spelling error, since no other form of the same lemma or substitution of the whole word occurred. 3.3.5 Punctuation Research on sentence development and the use of punctuation reveals that children mark out entities that are content rather than syntactically driven (e.g. Kress, 1994; Ledin, 1998). They form larger textual units, for instance, by joining together sentences that are topically closely connected, according to Kress (1994). In speech, such sequences would be joined by intonation due to topic. An example

32 Chapter 3. of such adjoined clauses is The boy I am writing about is called Sam he lived in the fields of Biggs Flat. (Kress, 1994, p.84). Others use a strategy of linking together sentences with connectives like and, then, so instead of punctuation marks, which can result in sentences of great length, here called long sentences (see Section 4.6 for examples). As we will see later on in Chapter 5, the Swedish grammar checking systems are based on texts written by adults and are able to rely on punctuation conventions for marking syntactic sentences in their detection rules or for scanning a text sentence by sentence. In accordance with the above discussion, this is not possible with the present data that consists of texts written by children. Occurrences of adjoined and long sentences are quite probable. In other words, analysis of the use of punctuation is important to confirm that also the subjects of the present study mark larger units. Thus, omissions of sentence boundaries are expected and have to be taken into consideration. 3.4 Types of Analysis The analysis of the Child Data starts with a general overview of the corpus, including frequency counts on words, word types, and all spelling errors. The main focus is on a descriptive error-oriented study of all errors above the lexical level, i.e. all that influence context. Only spelling errors resulting in non-words are not part of this analysis. The error types included are: 1. Real word spelling errors - misspelled words and segmentation errors resulting in existent words. 2. Grammar errors - syntactic and semantic violations in phrases and sentences. 3. Punctuation - sentence delimitation and the use of major delimiters and commas. The main focus here lies in the second group of grammar errors. Real word spelling errors and grammar errors are listed as separate error corpora - see Appendix B.1 for grammar errors, Appendix B.2 for misspelled words and Appendix B.3 for segmentation errors. Here all errors are represented with the surrounding context of the clause they appear in (in some cases greater parts are included e.g. in the case of referential errors). Errors are indexed and categorized by the type of error and annotated with information about possible correction (intended word) and the error s origin in the core data.

Data Collection and Analysis 33 The analysis also includes descriptions of the overall distribution of error types and error density. Comparison is made between errors found in the different subcorpora and by age. Here it is important to bear in mind that the texts were gathered under different circumstances and that not all subjects attended in all the experiments (see Section 3.2). Error frequencies are related differently depending on the error type. Spelling errors that concern isolated words, are related to the total number of words. In the case of grammar errors, the best strategy would be to relate some error types to phrases, some to clauses or sentences and some to even bigger entities in order to get an appropriate comparison measure. However, counting such entities is problematic, especially in texts that contain lots of structural errors. The best solution is to compare frequencies of the attested error types that will reflect the error profile of the texts. The main focus in the analysis of the use of punctuation in this thesis is not the syntactic complexity of sentences, but rather if the children mark larger units than syntactic sentences and if they use sentence markers in wrong ways. The most intuitive procedure would be to compare the orthographic sentences, i.e. the real markings done by the writers, with the ( correct ) syntactic sentences. The main problem with such an analysis is that in the case of long sentences, often it will be hard to decide where to draw the line, since they are for the most part syntactically correct. Several solutions for delimitation in syntactic sentences may be available. 12 The subjects own orthographic sentences will be analyzed at that point by length in terms of the number of words and by the occurrence of adjoined clauses. Further, erroneous use of punctuation marks will be provided for. Analysis of the use of connectives as sentence delimiters would certainly be appropriate here, but we live this for future research. All error examples represent the errors found in the Child Data corpus. The example format includes the error index in the corresponding error corpora (G for grammar errors (Appendix B.1), M for misspelled words (Appendix B.2), and S for segmentation errors (Appendix B.3)) and as already mentioned, the text is glossed and translated into English with grammatical features (see Appendix A) attached to words and spelling violations followed by the correct form within parentheses preceded by a double right-arrow. 12 The macro-syntagm (Loman and Jörgensen, 1971; Hultman and Westman, 1977) and the T-unit (Hunt, 1970) are other units of measure more related to investigation of sentence development and grammatical complexity in education-oriented research in Sweden and America, respectively.

34 Chapter 3. 3.5 Error Coding and Tools 3.5.1 Corpus Formats In order to be able to carry out automatic analyses on the collected material, the hand written texts were converted to a machine-readable format and compiled with the computer written texts to form one corpus. All the texts were transcribed in accordance with the CHAT-format (see (3.5) below) and coded for spelling, segmentation and punctuation errors and some grammar errors. Other grammar errors were identified and extracted either manually or by scripts specially written for the purpose. Non-word spelling errors were corrected in the original texts in order to be able to test the text in the developing error detector that includes no spelling checker. The spelling checker in Word 2001 was used for this purpose. The original Child Data corpus now exists in three versions: the original texts in machine-readable format, a coded version in CHAT-format and a spell-checked version. This version free from non-words was used as the basis for the manual grammar error analysis and as input to the error detector in progress and other grammar checking tools that were tested (see Chapter 5). 3.5.2 CHAT-format and CLAN-software The CHAT (Codes for the Human Analysis of Transcripts) transcription and coding format and the CLAN (Computerized Language Analysis) program are tools developed within the CHILDES (Child Language Data Exchange System) project (first conceived in 1981), a computerized exchange system for language data (MacWhinney, 2000). This software is designed primarily for transcription and analysis of spoken data. It is, however, practical to apply this format to written material in order to take advantage of the quantitative analysis that this tool provides. For instance, the current material includes a lot of spelling errors that can be easily coded and a corresponding correct word may be added following the transcription format. This means that not only the number of words, but also the correct number of word types may be included in the analysis. Also analysis concerning for instance the spelling of words may be easily extracted. In practice, conversion of a written text to CHAT-format involves addition of an information field and division of the text into units corresponding to speaker s lines, since the transcript format is adjusted to spoken material. The information field at the beginning of a transcript usually includes information on the subject(s) involved, the time and location for the experiment, the type of material coded, the type of analysis done, the name of the transcriber, etc. Speaker s lines in spoken

Data Collection and Analysis 35 material correspond naturally to utterances. For the written material, we chose to use a finite clause as a corresponding unit, which means that every line must include a finite verb, except for imperatives and titles, that form their own speaker s lines. The whole transcript includes just one participant, as it is a monologue. The information field in the transcribed text example in (3.5) below taken from the corpus, in accordance with the CHAT-format, includes all the lines at the beginning of this text starting with the @-sign. Lines starting with *SBJ: correspond to the separate clauses in the text. Comments can be inserted in brackets in speaker s lines, e.g. [+ tit] indicating that this line corresponds to the title of the text. The intended word is given in brackets following a colon, e.g. & [: och] and. Relations to more than one word are indicated by the < and > signs, where the whole segment is included, e.g. <över jivna> [: övergivna] abandoned. Other signs and codes can be inserted in the transcription. 13 (3.5) @Begin @Participants: SBJ Subject @Filename: caan09mhw.cha @Age of SBJ: 9 @Birth of SBJ: 1987 @Sex of SBJ: Male @Language: Swedish @Text Type: Hand written @Date: 10-NOV-1996 @Location: Gbg @Version: spelling, punctuation, grammar @Transcriber: Sylvana Sofkova Hashemi *SBJ: de kom till en överjiven [: övergiven] by [+ tit] *SBJ: vi kom över molnen jag & [: och] per på en flygande gris *SBJ: som hete [: hette] urban. *SBJ: då såg jag nåt [: något] *SBJ: som jag aldrig har set [: sett]. *SBJ: en ö som var helt <i jen täkt> [: igentäckt] av palmer *SBJ: & [: och] i miten [: mitten] var en by av äkta guld. *SBJ: när vi kom ner. *SBJ: så gick vi & [: och] titade [: tittade]. *SBJ: vi såg ormar spindlar krokodiler ödler [: ödlor] & [: och] anat [: annat]. *SBJ: när vi hade gåt [: gått] en lång bit så sa [: sade] per. *SBJ: vi <vi lar> [: vilar] oss. *SBJ: per luta [: lutade] sig mot en. *SBJ: palmen vek sig *SBJ: & [: och] så åkte vi ner i ett hål. *SBJ: sen [: sedan] svimag [: svimmade jag]. *SBJ: när jag vakna [: vaknade]. *SBJ: satt jag per & [: och] urban mit [: mitt] i byn. *SBJ: vi gick runt & [: och] titade [: tittade]. *SBJ: alla hus var <över jivna> [: övergivna]. 13 Further information about this transcription format and coding, including manuals for download, may be found at: http://childes.psy.cmu.edu/.

36 Chapter 3. *SBJ: då sa [: sade] per. *SBJ: vi har hitat den <över jivna> [: övergivna] byn. *SBJ: & [: och] när vi kom hem så vakna [: vaknade] jag *SBJ: & [: och] alt [: allt] var en dröm. *SBJ: slut @End

Chapter 4 Error Profile of the Data 4.1 Introduction This chapter describes the empirical analysis of the collected data, starting with a general overview (Section 4.2) followed by sections describing the actual error analysis and distribution of errors in the data. Error analysis starts with descriptions of grammar errors (Sections 4.3), the main foci, and continues with analysis of real word spelling errors (Section 4.5) and punctuation (Section 4.6). The section on grammar errors concludes with a comparison of error distribution in the analyzed data with grammar errors found in other data already discussed in Chapter 2 (Section 4.4). 4.2 General Overview The Child Data of total 29,812 words consists of 134 compositions written by 58 children. 1 Further information is provided here on the corpus along with a discussion of the size of sub-corpora, the average length of individual texts and word variation. Also described here is the overall impression of the texts in terms of writing errors and also the nature of spelling errors (both non-words and real words). Text Size and Word Variation The different sub-corpora are divided by topic of the written tasks (see Table 4.1). The first three were written by 18 subjects. The last two, belonging to the Spencerproject, involved 40 children each. In terms of the total number of words, Deserted 1 The composition of Child Data is described in Chapter 3 (see Section 3.2).

38 Chapter 4. Village and the Spencer Expository texts are the largest sub-corpora (in bold face) and the Climbing Fireman corpus is the smallest one. In total, the average text size is 222.5 words. This corresponds to a rather short text, approximately to 20 lines of a typed text or nearly half a page. Only the texts of Deserted Village (in bold face) are on the average twice as long as the other texts. The Spencer-project texts are the shortest ones. Table 4.1: General Overview of Sub-Corpora CORPUS TEXTS WORDS WORDS/TEXT WORD TYPES Deserted Village 18 7 586 421.4 1 610 Climbing Fireman 18 4 505 250.3 1 040 Frog Story 18 4 907 272.6 763 Spencer Narrative 40 5 487 137.2 1 085 Spencer Expository 40 7 327 183.2 1 021 TOTAL 134 29 812 222.5 3 373 The reason for this difference in text length probably lies in the degree of freewriting and in the use of and familiarity with the writing aid. The texts of the Deserted Village corpus were produced in the subjects own everyday environment, in the classroom, time was not limited, and they wrote by hand. The texts of Climbing Fireman are also written in a familiar environment with relatively unrestricted time demands, but these were written on a computer. Although computers had been introduced and used previously by the subjects, they may still have felt unfamiliar with its use. The Frog Story texts are slightly longer than the Climbing Fireman texts, but the higher number of words was probably elicited by the experiment in which the subjects were required to write text for 24 pictures. The Spencer-project texts are also of a more experimental nature, produced in an environment not familiar to the subjects, with more restrictions on time and written by means of a previously unknown text editor (ScriptLog). Next, let us consider word variation. 3,373 word types were found in the whole corpus. The Frog Story texts have the smallest number of word types, not surprisingly since the scope of word variation is more determined by the pictures of the story the children were supposed to tell. In the other sub-corpora, the Deserted Village corpus has the highest word variation, whereas the other three each contain around 1,000 word types. Table 4.2 shows the texts grouped by age. We find that the sub-corpus of 9 year olds is almost the same size as all the texts written by 10 year olds, although the sub-corpus consists of less than half as many compositions. The 9 year old children produced on average texts which are three times longer (854 words per text) than

Error Profile of the Data 39 Table 4.2: General Overview by Age AGE SUBJECTS TEXTS WORDS WORDS/TEXT WORD TYPES 9-years 8 24 6 832 854.0 1 270 10-years 24 52 6 837 284.9 1 356 11-years 6 18 8 012 1 335.3 1 629 13-years 20 40 8 131 406.6 1 279 TOTAL 58 134 29 812 222.5 3 373 the 10 year olds, who wrote the shortest texts in the whole corpus. The sub-corpora of 11 and 13 year olds display similar sizes and are more than a thousand words larger than the texts of the younger children. The 11 year olds wrote the longest texts in the whole corpus (1,335.3 words per text), which is almost five times more than for the shortest texts of the 10 year olds. There is, in other words, much variation in the average length of text and especially the 11 year olds distinguish themselves by their much longer texts. 2 Word variation measured in the number of word types seems to be slightly higher for the 11 year olds. The other age groups each contain around 1,300 word types. Overall Impression and Spelling Errors The first thing to observe when reading the texts by the children involved in this study, is the high number of spelling errors and split compounds, the rare use of capitals at the beginning of sentences and the unconventional use of punctuation delimiters to mark sentence boundaries. The children literally write as they speak. They use a great deal of direct speech and many spoken word forms. The different writing errors above lexical level are presented and discussed in the subsequent sections. In this section the sub-corpora and age groups are discussed and compared with respect to the total number of spelling errors (both non-words and real words). Most of the errors concern misspelled words, i.e. words with one or more spelling errors, represented by 2,422 (8.1%) words in total (see the last two columns in Table 4.3 below). Segmentation errors are four times less frequent, with 377 (1.3%) words written apart (splits) and 240 (0.8%) words written together (run-ons). Among the different sub-corpora (Table 4.3), the most misspelled words, splits and run-ons are found in the hand-written texts of the Deserted Village corpus. 2 For the time being no standard deviation was counted.

40 Chapter 4. The Deserted Village corpus and the Frog Story corpus have the highest numbers of spelling errors, 15.6% and 14.3% respectively, of the total number of words in different sub-corpora (last row in the table). The texts of the Spencer-project, that were much shorter in size, include around 5% spelling errors, which is two or three times lower than in the other three sub-corpora. Considering the age differences (Table 4.4), as expected most of the errors occurred in the texts of the youngest 9 year olds with 1,475 (21.6%) errors in total. Only the number of splits is higher in the texts of 11 year olds. The oldest 13 year olds made five times fewer errors. The group of 11 year olds has a very high number of spelling errors with 813 (10.1%) errors, in comparison to the texts by 10 year olds that include 459 (6.7%) spelling errors. Table 4.3: General Overview of Spelling Errors in Sub-Corpora SUB-CORPORA Deserted Climbing Frog Spencer Spencer ERROR TYPE Village Fireman Story Narrative Expository TOTAL % Misspelled Words 924 422 568 209 299 2 422 8.1 Splits 146 69 93 37 32 377 1.3 Run-ons 113 26 39 32 30 240 0.8 TOTAL 1 183 517 700 278 361 3 039 % 15.6 11.5 14.3 5.1 4.9 10.2 Table 4.4: General Overview of Spelling Errors by Age AGE ERROR TYPE 9-years 10-years 11-years 13-years TOTAL % Misspelled Words 1 242 356 602 222 2 422 8.1 Splits 129 69 148 31 377 1.3 Run-ons 104 34 63 39 240 0.8 TOTAL 1 475 459 813 292 3 039 % 21.6 6.7 10.1 3.6 10.2 According to Pettersson (1989, p.164), children in the second year at primary school (9 years old) make on average 13 spelling errors in 100 words which is much less than our 9 year olds who have almost 22 errors. By the eighth year (14 years old), the number decreases to four errors, which seems to hold true for our 13 year olds. Last year students at upper secondary school make in average 1 spelling error in 100 words.

Error Profile of the Data 41 Summary The texts in Child Data are on average not longer than half a page with the exception of the hand written Deserted Village texts, that are an average of double that size. The length difference varies more by age. The 10 year olds wrote the shortest texts on average, whereas the texts written by 11 year olds are almost five times longer. Word variation is much lower in the Frog Story corpus than the other corpora. In the whole corpus, 10% of all words are misspelled or wrongly segmented and the highest concentrations of those errors are in the texts of Deserted Village, Frog Story and in the 9 year olds. Splits are also quite common in the 11 year olds texts. 4.3 Grammar Errors Previous research and analyses of grammar (reported in Section 2.4), suggest that Swedish writers in general make errors in agreement (both in noun phrase and predicative complement), verb form, and in the choice of prepositions in idiomatic expressions. The writing of children at primary school also includes dialectal inflections on words, dropped endings and substitution of nominative for accusative case in pronouns. This section presents the types of grammar errors in the present corpus of primary school writers and investigates whether the same types of errors occur and if or how much spoken language plays a role in their writing. Each error type is discussed and exemplified, introduced by a description of the structure of the relevant phrase types in Swedish, so that a reader who does not know Swedish will be able to understand why something is classified as an error. The number of errors is summarized in Section 4.3.10 along with a discussion of the relative frequency of the different error types in total and in comparison with sub-corpora and age. All the errors are listed in Appendix B.1. The grammar error types of this analysis are compared further to the errors found in some of the previous studies of grammar errors, presented in the subsequent section (Section 4.4). 4.3.1 Agreement in Noun Phrases Noun Phrase Structure and Agreement in Swedish A noun phrase in Swedish consists of a head, normally a noun, a proper noun or a (nominal) pronoun. In addition prenominal and/or postnominal determiners and modifiers may occur. The attributes come in a certain order and must agree with the head in number, gender, definiteness and case.

42 Chapter 4. Swedish distinguishes between singular (unmarked) and plural (normally a suffix) in the number system and number agreement is governed by the noun s grammatical number: Table 4.5: Number Agreement in Swedish SINGULAR PLURAL min bok my book mina böcker my [pl] book [pl] ingen byxa no trousers inga byxor no [pl] trousers [pl] Gender is represented by two categories, common and neuter. Many animate nouns are further categorized according to the sex, masculine or feminine (unmarked). Gender agreement is only found in singular and is not visible in plural. Table 4.6: Gender Agreement in Swedish SINGULAR PLURAL COMMON en gammal bil an old car några gamla bilar some old cars NEUTER ett gammal-t hus an old house några gamla hus some old houses Definiteness marking is quite complicated and is one of the factors in Swedish grammar that causes problems. The indefinite form is unmarked, whereas the definite form is (mostly) double marked, both by prenominal attributes and with a noun suffix. For adjectives (and participles) there are two different forms, normally called strong and weak forms. The strong form is used in indefinite noun phrases and in predicative use. The weak form of adjectives is used in definite noun phrases. The weak form is the same in all genders and numbers, except optionally when the noun denotes a male person. 3 The plural of the strong and weak forms coincide. Table 4.7: Definiteness Agreement in Swedish INDEFINITE DEFINITE SINGULAR COMMON en bok bok-en a book book [def] en gammal bok den gaml-a bok-en an old book the old [wk] book [def] en gammal man den gaml-e mann-en an old man the old [masc] man [def] NEUTER ett gammalt hus det gaml-a hus-et an old house the old [wk] house [def] PLURAL gaml-a hus de gaml-a hus-en old [wk] houses the old [wk] houses [def] 3 Notice that the masculine gender is only optional, which means that a noun phrase of the form den gaml-a mannen the old [wk] man [def] is correct as well.

Error Profile of the Data 43 Finally, case in the nominal system is represented by (unmarked) nominative and genitive which uses the suffix -s (personal pronouns are also declined by accusative case, see further under pronouns, Section 4.3.4). The basic constituent order in a noun phrase is determiner-adjective-noun, e.g. ett stort hus a big house, det stora huset the big house. The co-occurrence patterns of definiteness marking can be divided into three different types (Cooper, 1986, p.34): 4 1. Definite noun phrase, which reflects the double definiteness marking and requires definite prenominal attributes and definite noun: DET[+DEF] ADJ[+DEF] N[+DEF] den röd-a bil-en this/the red car de röd-a bilar-na this/the two red cars den här röd-a bil-en this red car de här röd-a bilar-na these red cars 2. Indefinite noun phrase, which requires indefinite prenominal attributes and indefinite noun: DET[ DEF] ADJ[ DEF] N[ DEF] en röd bil a red car någon röd bil some red car inga röda bilar no red cars 3. Mixed noun phrase, which requires definite prenominal attributes and indefinite noun. This type applies to demonstrative pronouns, possessive attributes and some relative clauses. DET[+DEF] ADJ[+DEF] N[ DEF] Demonstr. pronouns: denna röd-a bil this red car dessa röd-a bilar these red cars Possessive attributes: firmans röd-a bil the firm s red car deras röd-a bil their red car Relative clause: den röd-a bil (som) han köpte igår the red car that he bought yesterday 4 Cooper defines these types in terms of existent determiner types that require either definite or indefinite adjectives and nouns.

44 Chapter 4. The optional prenominal attributive adjectives can be recursively stacked, as in (4.1a). Numerals as quantifying attributes occur in both definite (4.1b) and indefinite noun phrases (4.1c). (4.1) a. en ny a new b. de två the two c. två two röd red bil car röda bilarna red cars röda bilar red cars A proper noun as the main word of a noun phrase behaves (almost) like a noun in definite noun phrase. Proper nouns are inherently definite and uncountable. The most common form is when the proper noun occurs on its own, without any modifiers, as in the first example in Table 4.8, but prenominal attributes may occur as shown in the other examples (Teleman et al., 1999, Part3:56): Table 4.8: Noun Phrases with Proper Nouns as Head DET ADJ N Peter Peter lilla Karin little Karin den snälla Anna the good/kind Anna den där tråkiga Karl that boring Karl min söta Maria my sweet Maria en ångerfull Karl-Erik a regretful Karl-Erik Pronouns as head of a noun phrase occur normally without modifiers, although pronouns with relative clauses are quite common (see further in Teleman et al., 1999): Table 4.9: Noun Phrases with Pronouns as Head ADJ PRO jag I hela jag all of me båda ni both of you hela den all of it själva hon she herself

Error Profile of the Data 45 A noun phrase need not have a noun (or pronoun) as head. In this case, an adjective occurs normally in that position. Noun phrases consisting of only a determiner also exist. The structure of the (in)definite noun phrase is the same as in a noun phrase with a noun as head. Table 4.10 gives an overview of noun phrases without (nominal) heads. Table 4.10: Noun Phrases without (Nominal) Head DEFINITE NOUN PHRASE DET ADJ denne this one den andra the other one den där nye väntande the one new waiting många andra many other det bästa the best INDEFINITE NOUN PHRASE DET ADJ någon someone en annan an another allt roligt all fun One further type of noun phrase will be relevant in this thesis, namely the partitive phrase which consists of a quantifier, the preposition av of and a definite noun phrase. The quantifier agrees in gender with the noun phrase (Teleman et al., 1999, Part3:69): Table 4.11: Agreement in Partitive Noun Phrase in Swedish COMMON en av cyklarna one [com] of bicycles [com] ingen av filmerna none [com] of movies [com] NEUTER ett av träden one [neu] of trees [neu] inget av äpplena none [neu] of apples [neu] Agreement Errors in Definiteness Definiteness agreement was violated in eight noun phrases and occurred in all three noun phrase types. Errors in definite noun phrases included three errors, all located in the head. In all instances the head noun is in the indefinite form, lacking the definite suffix as in (4.2a). In (4.2b) we see the correct form of the definite noun phrase with both the definite determiner/article and the definite suffix on the noun.

46 Chapter 4. (4.2) (G1.1.2) a. En one gång time ur from blev was stan. the-city den the [def] hemska awful [wk] pyroman pyromaniac [indef]) Once the awful pyromaniac was thrown out of the city. b. den the [def] hemska awful [wk] pyroman-en pyromaniac [def] utkastad thrown-out One of these three erroneous noun phrases is ambiguous in the context (see (4.3)), providing yet another correction possibility. The intended noun phrase could be definite as in (4.3b) or also indefinite as in (4.3c). (4.3) (G1.1.3) a. Jag I sätta put den the såg saw på on ett a mobbarn the-bullyier personen och person and TV TV på on då then program program den the [def] där where fråga varför. ask why en a metod method stol chair [indef] och and mot against andra other mobbing bullying människor people var was att to runt around I saw on a TV program where a method against bullying was to put the bullyier on the chair and other people around the person and then ask why. b. den the [def] c. en a [indef] stolen chair [def] stol chair [indef] There were three errors in definite noun phrases with indefinite head (type 3) which involved possessive and demonstrative attributes. In all cases, the head noun is in the definite form with a (superfluous) definite suffix as in (4.4a). The most obvious correction is to change the form in the noun to indefinite as in (4.4b), but it could also be that the possessive determiner is superfluous making the single definite noun as in (4.4c) more correct.

Error Profile of the Data 47 (4.4) (G1.1.4) a. Pär Pär gå go tittar looks hem. home på at sin his [gen] klockan watch [def] och and det it var was tid time för for Pär looks at his watch. It was time for the family to go home. b. sin his [gen] c. klockan watch [def] klocka watch [indef] familjen the-family att to A violation involving a demonstrative pronoun presented in (4.5) occurred probably due to the subject s regional origin. Nouns modified by denna this occur in definite form in some regional dialects. (4.5) (G1.1.6) a. Nu now ha have när when en a jag I kommer will rubrik om title about att to skriva write några problem some problems denna this [def] och... and uppsats-en essay [def] så so kommer will Now when I write this essay, I will have a heading about some problems and... b. denna this [def] uppsats essay [indef] jag I Two errors occurred in indefinite noun phrases and once more concerned the head noun being in definite form as in (4.6). Two corrections are possible here as well, one changing the form in the head noun as in (4.6b) or removing the determiner as in (4.6c). (4.6) (G1.1.7) a. Men but senare later deras lägenhet. their apartment ångrade regretted dom they sig, selves för for det it var was en a [indef] räkningen bill [def] But later they regretted it, because it was a bill for their apartment. b. en a [indef] c. räkningen bill [def] räkning bill [indef] på on

48 Chapter 4. Gender Agreement Errors Agreement errors in gender occurred in both definite and indefinite noun phrases and in partitive noun phrases and show up as a mismatch between the gender of the article and the rest of the phrase or as violations of the semantic gender of the adjective. One disagreement in article occurred in an indefinite noun phrase shown in (4.7a) and one in a partitive noun phrase (G1.2.2). (4.7) (G1.2.1) a. Pojken the-boy fick en got a [com] The boy got a frog baby. b. ett a [neu] grodbarn frog-child [neu] grodbarn. frog-child [neu] Two errors were related to the semantic gender, where masculine gender was wrongly used in the adjectival attributes of definite noun phrases. In one case, the masculine gender is used together with a plural noun (see (4.8a)). (4.8) (G1.2.4) a. nasse Nasse blev became arg angry han he gick went och and la lay sig himself med with dom the [pl] andre other [masc] syskonen. siblings [pl] Nasse got angry. He lay down with his brothers and sisters. b. dom the [pl] andra other [pl] syskonen siblings [pl] The second instance of semantic gender mismatch is more a question of asymmetry between the adjectives involved (see (4.9a)). The first adjective in the noun phrase is declined for masculine gender (hemsk-e awful [masc] ) and the second uses the unmarked form (ful-a ugly [def] ). Either both should be in the masculine form (as in (4.8b)) or both should have the unmarked form (as in (4.8c)).

Error Profile of the Data 49 (4.9) (G1.2.3) a. det it va was den the [def] hemske awful [wk,masc] fula ugly [wk] troll troll karlen ( trollkarlen) man [def] ( magician [def]) tokig Tokig som... that It was the awful ugly magician Tokig that... b. den the [def] c. den the [def] hemske awful [wk,masc] hemska awful [wk] fula ugly [wk] fule ugly [wk,masc] trollkarlen magician [def] trollkarlen magician [def] Number Agreement Errors Three noun phrases violated number agreement. One concerned a definite attribute in a definite noun phrase (see (4.10a)). It seems like the required plural determiner de the [pl] is replaced by the singular definite determiner det the [sg]. It could also be a question of an (un)intentional addition of the character -t subsequently making it a spelling error rather than a grammar error. But since a syntactic violation occurred with no new lemma formed, the error is classified as a grammar error and not as a real word spelling error. (4.10) (G1.3.1) a. Den där the there taskiga som mean that scenen med det tre tjejerna tyckte scene with the [sg] three girls [pl] thought går ifrån den go from the tredje tjejen. third girl jag att de var I that they were I thought that in the scene with the three girls they were mean to leave the third girl. b. de the [pl] tre three tjejerna girls [pl] The other two errors concern the head noun of a partitive attribute as shown in (4.11a). In both instances, the noun is in the singular definite form instead of the required plural definite form. Both errors were made by the same subject. This realization points more clearly to a typographical error. The determiner and the partitive preposition were probably inserted afterwards into the text, since the singular definite form that this error brings about is not at all part of the correct non-elliptic noun phrase (see (4.10b)), but may function perfectly well as a noun phrase on its own.

50 Chapter 4. (4.11) (G1.3.2) a. Alla männen all the-men och and pappa daddy gick went in into i ett av huset. in one of house [sg, def] b. All the men and daddy went into one of the houses. ett one (hus) (house [indef]) av husen of houses [pl, def] 4.3.2 Agreement in Predicative Complement Introduction A predicative complement is part of a verb phrase and specifies features about the subject or the object. An adjective phrase, participle or a noun phrase are the typical representatives. The predicative complement differs from other parts of a verb phrase in that the predicative agrees in gender and number (in the case of a noun phrase only in number) with the corresponding subject or object it refers to, as shown in Table 4.12. Table 4.12: Gender and Number Agreement in Predicative Complement SINGULAR COMMON boken är gammal the-book [com] is old [com] NEUTER huset är gammal-t the-house [neu] is old [neu] PLURAL husen är gaml-a the-houses [pl] are old [pl] The predicative normally combines with copula verbs (vara be, bli be/become, förbli remain ), naming verbs (e.g. heta be called, kallas be called ), raising verbs (e.g. verka seem, förefalla seem, tyckas seem ), and other similar verb categories (Teleman et al., 1999, Part3:340). Gender Agreement Errors Violations of gender agreement were rare. Altogether two errors of this type occurred. One concerned an adjective in the complement position and the other a past participle form. The adjective error occurred with neuter gender as shown below in (4.12a).

Error Profile of the Data 51 (4.12) (G2.1.1) a. Då then börja start Urban Urban lipa blubber och and sa: said Mitt hus my [neu] house [neu] Then Urban started to blubber and said: My house is wet. b. Mitt my [neu] hus house [neu] är blött. is wet [neu] är blöt. is wet [com] Here the neuter gender subject is connected to the adjective blöt wet [com] in common gender. The error could also be classified as a spelling error with omission of the final double consonant, but since it is also another form of the same adjective and a syntactic violation occurs, the error is classified as a grammar error. Number Agreement Errors In the case of number agreement, there was one error involving singular number and two errors involving plural number. As in (4.13a), the sentence structures that include number violations in the predicative complement are in general rather complex and the distance between the head and the modifier is not restricted to a single verb. In this case, it seems to be a question of a lack of linguistic competence since all three adjectives lack the plural ending. (4.13) (G2.2.3) a. Själv tycker self think ärlig honest [sg] metoder methods är. are jag I men but att that också also killarnas the-boys mer more metoder methods [pl] elak mean [sg] än than är are mer more öppen var ( vad) was ( what) open [sg] och and tjejernas the-girls s I think myself that the boys methods are more open and honest but also more mean than the girls methods are. b. killarnas the-boys mer more metoder methods [pl] elaka mean [pl] är are mer more öppna open [pl] och and ärliga honest [pl] men but också also

52 Chapter 4. 4.3.3 Definiteness in Single Nouns Introduction The grammatical violations in this section concern single nouns as the only constituents of a noun phrase. Bare singular nouns are (normally) ungrammatical without an article. The noun must be in definite form or preceded by an article as in (4.14b) or (4.14d). Example sentences in (4.14a) and (4.14c) are (normally) ungrammatical in Swedish, although they may occur as newspaper head lines for instance. (4.14) a. Polis Policeman arresterade studenten. arrested the-student. b. Polisen/En polis The policeman/a policeman arresterade studenten. arrested the-student. c. Polisen The policeman arresterade student. arrested student. d. Polisen The policeman arresterade studenten/en student. arrested the-student/a student. There are, however, grammatical sentences which include bare singular nouns. The acceptability of such sentences depends, according to Cooper (1984), on the lexical choice. Thus, changing the noun or the verb may influence the grammaticality of a sentence: (4.15) a. Det It är is jobbigt hard att inte to not se bil. see car [indef]. b. Det It är jobbigt att inte is hard to not ha have bil. car [indef]. Bare definite nouns are often used as anaphoric device, referring to an entity that has already been introduced or is well known in the speech situation. The noun is then in definite form as in (4.16) below. (4.16) a. Ta Take b. (den) (the) c. (den) (the) (den) (the) nya new bilen. car [def]. gamle kungen old king [def] tredje gången third time [def]

Error Profile of the Data 53 Errors in Definiteness in Single Nouns There were six cases of definiteness errors in single nouns. They all were realized as indefinite nouns. One instance from the corpus is shown in (4.17). Here the topic is introduced by an indefinite noun phrase (en ö an [indef] island [indef] ) in the first sentence, but then in the following sentence instead of the expected definite noun that would indicate a continuation of discussion of this topic, we find a single indefinite noun (ö island [indef] ). This noun lacks the definite suffix. (4.17) (G3.1.3) a. Jag såg I saw en ö. an island Vi we gick went I saw an island. We went to island. b. ön island [def] till ö. to island [indef] 4.3.4 Pronoun Case Features of Pronouns Personal pronouns in Swedish are declined in nominative, genitive and accusative case (see Table 4.13 below). Third person singular inanimate pronouns have the same form in both subject and object positions. For plural, the nominativeaccusative distinction de-dem is only used in writing. It is not used in speech, where both forms are pronounced as dom in the standard language. This spoken form is used (increasingly) in some types of informal writing. 5 Errors in Pronoun Case All five errors in pronoun case concern nominative case being used in the object position. Two cases involved errors in the accusative case of the pronoun han he, probably due to regional influence, 6 e.g.: (4.18) (G4.1.5) a. bara för just for man inte vill one not want vara med be with han he [nom] just because one doesn t want to be with him. b. honom him [acc] 5 Purists recommend however keeping the distinction de-dem and that dom should be used only for rendering spoken language (Teleman et al., 1999, Part2:270). 6 In certain dialects han he is also the object form.

54 Chapter 4. Table 4.13: Personal Pronouns in Swedish NOMINATIVE ACCUSATIVE GENITIVE SINGULAR 1ST PERSON jag mig min I me my 2ND PERSON du dig din you you yours 3RD PERSON ANIMATE MALE han honom hans he him his FEMALE hon henne hennes she her hers INANIMATE COMMON den den dess, dens it it its NEUTER det det dess it it its PLURAL 1ST PERSON vi er vår, vårt we us ours [com], [neu] 2ND PERSON ni er er, ert you you yours [com], [neu] 3RD PERSON WRITTEN de dem deras they them theirs SPOKEN dom dom deras they them theirs The rest concerned plural pronouns, as the one in (4.19). As mentioned above, the distinction between the nominative form de they and the accusative form dem them occurs only in writing. In speech dom is used in both cases. A scan of the writing profiles of all subjects showed that most of the subjects use only the spoken form. For that reason, the errors were included only if the subject used an incorrect written form and not just the spoken form. (4.19) (G4.1.1) a. bilarna the-cars bromsade så att braked so that det blev it became svarta black streck efter lines after The cars braked so there were black lines after them. b. dem them [acc] de. they [nom]

Error Profile of the Data 55 4.3.5 Verb Form Verb Core Structure A verb phrase consists of a verbal head that can form a verb phrase on its own or be combined with modifiers and appropriate complements. In this description no attention is drawn to the complements, just the actual core of the verb phrase. First, the types of verbs (finite and non-finite) are described, followed by presentation of the simple vs. compound tense structures and finally the infinitive phrase is described. Verbs are divided into finite and non-finite. A sentence must contain at least one verb in finite form to be considered grammatically correct. In Swedish, there are three forms of finite verbs (present, preterite and imperative) and four forms of non-finite verbs (infinitive, supine, present participle and past participle). Table 4.14: Finite and Non-finite Verb Forms TENSE FINITE NON-FINITE Infinitive: att jaga to hunt Imperative: jaga hunt Future: ska jaga will hunt Present: jagar hunt/hunts Preterite: jagade hunted Perfect: har jagat [sup] have hunted [sup] Present participle: den jagande the hunting Past participle: är jagad is hunted Among the non-finite verbs, infinitive and supine occur as the main verb in combination with a modifying (finite) auxiliary verb (see Future and Perfect respectively in Table 4.14 above). The infinitive form also occurs in infinitive phrases preceded by the infinitive marker att to. Present and past participle forms have more adjectival characteristics and function as attributes in a noun phrase or in predicative position after a copula verb. A core verb phrase may consist of one single finite verb and form a simple tense construction, or of a sequence of two or more verbs, composed of one finite verb plus a number of non-finite verbs to form a kind of compound tense (see Table 4.15 below). Compound tense structures like sequences of two or more verbs are usually referred to as verb chains or verb clusters and generally include some kind of auxiliary verb followed by the main (non-finite) verb. In Swedish we find the temporal and modal auxiliary verbs in verb cluster constructions.

56 Chapter 4. Table 4.15: Tense Structure SIMPLE STRUCTURE: Present: Katten The cat jagar chases möss. mice. Preterite: Katten The cat jagade chased möss. mice. COMPOUND STRUCTURE: Future: Katten The cat ska will [pres] jaga chase [inf] möss. mice. Perfect: Katten The cat har has [pres] jagat chased [sup] möss. mice. Past perfect: Katten The cat hade had [pret] jagat chased [sup] möss. mice. Future perfect: Katten The cat ska shall [pres] ha have [inf] jagat chased [sup] möss. mice. Secondary future perfect: Katten The cat skulle would [pret] ha have [inf] jagat chased [sup] möss. mice. Verb clusters with temporal auxiliary verbs in general follow two patterns, one expressing the past tense with the main verb in the supine (here only the verb ha have is used), and one for future tense with the main verb in the infinitive. In subordinate clauses, the temporal finite forms har has/have [pres] or hade had [pret] are often omitted in perfect and past perfect 7 and the verb core consists then only of the supine verb form (examples from Ljung and Ohlander, 1993, p.99): (4.20) a. Han he säger att says that han redan he already (har) (has) gjort done He says that he has done that already. b. Han sade att he said that han ofta he often (hade) sett (had) seen He said that he had often seen them. det. that dem. them Also the temporal infinitive ha have in the secondary future perfect can be omitted irrespective of sentence type. In these cases, a past tense modal auxiliary is followed directly by a supine form (Teleman et al., 1999, Part3:272): 7 The omission is most common in writing, up to 80% (Teleman et al., 1999, Part4:12), but occurs more and more in speech as well (Teleman et al., 1999, Part3:272).

Error Profile of the Data 57 (4.21) a. Nu now blev became det inte så illa that not so bad som as det kunde it could Now it got not so bad as it could have been. b.... fastän although det borde it should (ha) (have) skett happened för for (ha) (have) länge sedan. long ago... although it should have happened a long time ago. blivit. become [sup] A verb in the infinitive form is treated as part of an infinitive phrase preceded by an infinitive marker att to, which is necessary in certain contexts and optional in others. Auxiliary verbs are combined with bare infinitives (as shown and discussed above) thus lacking the infinitive marker as in (4.22a). An exception is the temporal komma will that requires the infinitive marker as in (4.22b) (Teleman et al., 1999, Part3:572): (4.22) a. Hon kan she can spela schack. play chess She can play chess. b. Hon she kommer att spela schack. will to play chess She will play chess. The bare infinitive is also used in nexus constructions as in (Teleman et al., 1999, Part3:597): (4.23) Han ansåg he considered tiden the-time He found the time to be ripe. vara mogen. be ripe Many main verbs take either a noun phrase or an infinitive phrase as complement (Teleman et al., 1999, Part3:570,596). With some main verbs, the infinitive marker is optional (Teleman et al., 1999, Part3:597). The tendency to omit the infinitive marker is higher if the infinitive phrase directly follows the verb (Teleman et al., 1999, Part3:598): (4.24) a. Vi slutade we stopped spela. play We stopped playing. b. Vi slutade we stopped avsiktligt deliberately att spela. to play We deliberately stopped playing.

58 Chapter 4. Infinitive phrases are found in subject position as well (4.25): (4.25) Att få to get segla sail jorden runt earth around hade alltid had always lockat tempted He had always wanted to get to sail around the world. honom. him Finite Main Verb Errors The use of non-finite verb forms as finite verbs, forming sentences that lack a finite main verb is the most common error type in Child Data. Errors of this kind concern both present and past tense. Most of them (87) occurred in the past tense as in (4.26a) and concern regular weak verbs ending in -a in the basic form that lacks the appropriate past tense ending. Nine errors occurred in present tense as in (4.27a) and primarily concern regular weak verbs ending in -a, also in addition to some strong verbs. (4.26) (G5.2.45) a. På natten in the-night vakna wake [untensed] jag av I from att that In the night I woke up from the fire-alarm going off. b. vaknade woke [pret] brandlarmet tjöt. fire-alarm howled (4.27) (G5.1.2) a. När when bränt burnt hon she och and kommer comes varför why ner down det låg it lay en a undrar wonders hon she handduk över towel over varför why det it spisen. the-stove lukta smell [untensed] When she comes down, she wonders why it smells so burnt and why a towel was lying over the stove. b. luktar smells [pres] så so The most probable cause for this recurrent error is the fact that in spoken Swedish regular weak verbs ending in -a may lack the past tense suffix and sometimes also lack the present tense suffix. For example, the past form of the verb vaknade woke [pret] is pronounced either as [va:knade] or reduced to [va:kna], which then coincides with the infinitive and imperative forms vakna to wake as in the erroneous sentence (4.26a) above.

Error Profile of the Data 59 In addition to the above errors in the form of the finite main verb, two instances involved strong verbs, both realized as the (non-finite) infinitive form. One error occurred in the present tense, and one (exemplified in (4.28)) in the past. (4.28) (G5.2.100) a. Nästa dag next day så var so was en ryggsäck borta och a rucksack gone and mera more grejer things försvinna disappear [inf] The next day a rucksack had gone and more things disappeared. b. försvann disappeared [pret] Then, there were two occurrences of errors using a supine verb form as predicate of a main sentence. Recall that the supine may occur on its own as predicate in subordinate clauses (see above). These errors occurred in main clauses, both with the same lemma and were committed by the same subject. One of these error instances has already been discussed in Section 3.3 (example (3.2) on p.29). The other is exemplified and discussed below: (4.29) (G5.2.88) a. det it låg lay [pret] massor lots av of saker things runtomkring around jag I försökt tried [sup] att to kom ( komma) came ( come) till fören to the-prow There were a lot of things lying around. I tried to go to the prow. b. försökte tried [pret] The sentence jag försökt att kom till fören I tried [sup] to go to the prow in isolation suggests that just an auxiliary verb is missing in front of the supine form, i.e. hade försökt had tried. However, the past form predicate of the preceding sentence suggests that in order to be consistent the predicate of the subsequent sentence should also be in past form. It could be that the subject believes that this word is spelled without the final vowel -e. The reason why this case is considered a grammar error is that it forms another form of the intended lemma. Thus, according to principle (i) in (3.4) it is a grammar error (see Section 3.3). Finally, ten error instances concerned past participle forms in the finite verb position, as in (4.30), all lacking the final -e in the preterite suffix.

60 Chapter 4. (4.30) (G5.2.92) a. dom letad they search [past part] They searched everywhere. b. letade searched [pret] överallt everywhere These past participle forms could occur due to the final letter s alphabetical pronunciation (letter d is pronounced [de] in Swedish). Following the classification principles in (3.4), these errors are considered grammar errors since an other form is used rather than the intended one is formed. 8 Verb Cluster Errors Grammar errors in verb clusters affect the form of the (non-finite) main verb and omission of auxiliary verbs. Main verb errors may involve a sequence of finite verbs and thus violate the rule of one finite verb in a clause. One error instance included secondary future perfect requiring a supine form as in (4.31a), where the main verb is realized as a past tense form of the intended verb. The cause of the error is not possible to determine, but an interesting observation is that the erroneous verb form is followed by a preposition beginning in the vowel i that is part of the omitted supine ending thus indicating a possible assimilation of these sounds. (4.31) (G6.1.7) a. Jag I skrattade laughed kom came [pret] och and igenom through undrade wondered det the lilla small hur how hålet. hole tromben the-tornado skulle would [pret] ha have [inf] I laughed and wondered how the tornado would have come through the small hole. b. skulle would [pret] ha have [inf] kommit come [sup] Other errors in the main verb of a verb cluster concerned structures requiring an infinitive verb form as in (4.32a), where the modal auxiliary verb ska will is followed by a verb in present tense, blir becomes. 8 Some of the participle forms like pratad told [past part] are not lexicalized in Swedish, but are quite possible to form in accordance with grammar rules of Swedish. They are included in the present analysis since they were not detected as non-words by the spelling checker in Word.

Error Profile of the Data 61 (4.32) (G6.1.1) a. Men kom ihåg but remember att that det it inte ska not will [pres] But remember that there will not be a real fire. b. ska will [pres] bli become [inf] blir becomes [pres] någon some riktig brand real fire There were two cases with omitted auxiliary verb. Both concerned the temporal verb ha to have and the predicate of the main sentences consisted then of only a supine verb form: (4.33) (G6.2.2) a. men pappa but daddy frågat asked [sup] mig me om jag ville if I wanted but daddy has asked me if I wanted to come along. b. hade had [pret] OR frågade asked [pret] frågat asked [sup] följa follow med. with Infinitive Phrase Errors In this category, we find errors in the verb form following the infinitive marker and in the omission of the infinitive marker after the auxiliary verb komma will. Constructions with main verbs that combine with an infinitive phrase as complement have not been included. As we will see later on (Section 5.5), there are constructions where there is uncertainty in the language as to whether the infinitive marker should be used or not. In general, the infinitive marker is tending to disappear more and more. For this reason it is not quite clear which of these cases should be classified as an error. Four verb form errors occurred where, instead of the (non-finite) infinitive verb that is required, we find the (finite) imperative as in (4.34) or present form as in (4.35) after an infinitive marker.

62 Chapter 4. (4.34) (G7.1.2) a. glöm forget inte not att stäng to close [imp] don t forget to close the door b. att stänga to close [inf] dörren the-door (4.35) (G7.1.1) a. Men hunden but the-dog klarar manages att inte slår to not hits [pres] But the dog manages not to hit himself. b. att inte slå to not hit [inf] sig himself Three cases concerned an omitted infinitive marker in the context of the temporal auxiliary verb komma will that (as explained above) is different from the other auxiliary verbs and requires the infinitive marker: (4.36) (G7.2.3) a. Nu now när when jag I kommer att skriva denna uppsatsen så will to write this essay so kommer will jag I ha have en a dom. them rubrik title om about några some problem problems och and vad what man one kan can göra do för att to förbättra improve Now when I write this essay, I will have a heading about some problems and what one can do to improve them. b. kommer will jag I att ha to have The error example (4.36) is even more interesting in that att to is used in the first construction with the verb kommer att skriva will write whereas it is omitted in the subsequent. 4.3.6 Sentence Structure Introduction The errors in this category concern word order, phrases or clauses lacking obligatory constituents, reduplications of the same word and constructions with redundant constituents.

Error Profile of the Data 63 The finite verb is normally considered as the core of a sentence and is surrounded by its complements (e.g. subject, direct and indirect object, adverbials). The distribution of such complements is defined both syntactically (i.e. defines the verb s construction scheme) and semantically (i.e. defines what role the different actants play in a sentence). Thus the verb governs the structure of the whole sentence in what constituents are to be included and in what place and what role they will play. In addition, the position of sentence adverbials plays an important role. Sentences in Swedish display two types of word order. Main clause order is characterized by the finite verb before the adverbial (dubbed fa-sentence in Teleman et al. (1999, Part4:7)), presented in Table 4.16. 9 Subordinate clause word order is characterized by adverbial before the finite verb (dubbed af-sentence in Teleman et al. (1999, Part4:7)) presented in Table 4.17. In addition to recognizing the distinct word orders in main and subordinate clauses, traditional grammar also makes a distinction between basic word order where the subject precedes the predicate (example sentence 2 in Table 4.16 and both sentences in Table 4.17) and inverted word order where the subject follows the predicate (example sentence 1 and 3 in Table 4.16). Table 4.16: Fa-sentence Word Order INITIAL FIELD MIDDLE FIELD FINAL FIELD Initiation Finite Verb Subject Adverbial* Rest of VP 1. Nu now 2. Per Per 3. Vem who skulle would skulle would skulle would Per Per Per Per nog inte probably not nog inte probably not nog inte probably not vilja träffa någon. like to meet someone vilja träffa någon nu. like to meet someone now vilja träffa nu? like to meet now? Table 4.17: Af-sentence Word Order INITIAL FIELD MIDDLE FIELD FINAL FIELD Initiation Subject Adverbial* Finite Verb Rest of Verb Phrase 1. eftersom because 2. vem who Per Per Per Per nog inte probably not nog inte probably not skulle would skulle would vilja träffa någon nu like to meet someone now vilja träffa nu like to meet now 9 Conjunctions that coordinate main or subordinate clauses are not included in the scheme. The asterix in the tables indicates that more constituents of this kind are possible.

64 Chapter 4. Word Order Errors Word order errors concern transposition of sentence constituents thus violating the fa-sentence or af-sentence word order constraints. Only five sentences with incorrect word order were found. The following error example (4.37a) violates the fa-sentence word order, since there are two constituents before the finite verb, a subject and a time adverbial. The finite verb is expected in the second position in the sentence. The correct form of the sentence can be formed in two ways: either introduced by the subject and placing the time adverbial last, as in (4.37b), or starting with the time adverbial, placing the subject directly after the finite verb, as in (4.37c). (4.37) (G8.1.3) a. Jag I den the dan day gjorde inget did nothing bättre. better I didn t do anything better that day. b. Jag I c. Den dan the day gjorde inget did nothing gjorde did bättre better jag inget I nothing den dan. the day bättre. better Redundancy Errors As mentioned above, the type and the number of constituents in a sentence is governed by the main verb. Any addition of other constituents influences the whole complement distribution, both syntactically and semantically. Words were duplicated directly (five occurrences) as in (4.38a) below with the reduplicated word in the same position as the intended one: (4.38) (G9.1.3) a. många som many that mobbar har bully have har have det oftast it most-often Many that bully have have it most often bad at home. b. många som many that mobbar har bully have det oftast it most-often dåligt hemma bad at-home dåligt hemma bad at-home Four occurrences included duplication with words between, i.e. when the same word occurs somewhere else in the sentence. In the example (4.39a) the subject jag I is repeated after the verb as if indicating inverted word order:

Error Profile of the Data 65 (4.39) (G9.1.7) a. jag fick jag hjälp I got I help I got I help with it. med with b. jag fick hjälp med det. I got help with it I got help with it. det. it The example in (4.40a) involves a case where the writer has fronted not only the object det that but also the verb particle åt for which also occurs in its normal position after the verb. Either the fronted verb particle can be removed as in (4.40b) or the one following the verb as in (4.40c). (4.40) (G9.1.8) a. Åt about åt. about det that går goes det it nog probably inte not att to gör ( göra) do [pres] ( do [inf]) About this not so much can probably be done about. b. Det går that goes c. Åt about det that det nog it probably går goes inte not att to det nog it probably gör så do that inte not att to mycket åt. much about gör så do that mycket. much så that mycket much In four cases, new words disturbed the sentence structure by their redundancy in the complement structure. In the following example, the pronoun det it is redundant and plays no role in the sentence: 10 10 There is also an error in word order between the constituents bara kan just can that should be switched, see G8.1.2 in Appendix B.1.

66 Chapter 4. (4.41) (G9.2.2) a. för cause då then fattar understand kan can man one inte not hjärna ingenting brain nothing något some ting thing bara just kan can gå go på to stan the-city det it då then because then one cannot anything just can go to the city it then the brain doesn t understand anything. b. för for då then kan can man one hjärna ingenting. brain nothing inte not något some ting, thing bara just gå go på to stan. the-city Då then fattar understand because then one cannot anything, just go to the city. Then the brain doesn t understand anything. Missing Constituents Altogether 44 sentences were incomplete in the sense that one (or more) obligatory constituent(s) were missing in the sentence. Omission of the noun in the subject position is the most frequent type of error in this category (10 occurrences), e.g.: (4.42) (G10.1.8) a. När when man one bara går just goes tror thinks där there att that ifrån from man one har has kompisar friends blir becomes ledsen sad när when man one When someone thinks that he has friends, he is sad when people just leave from there. b. blir becomes man ledsen one sad Missing prepositions are quite common (11 occurrences): (4.43) (G10.6.4) a. Hunden the-dog hoppade ner jumped down ett getingbo. a wasp-nest The dog jumped into a wasp s nest. b. i into

Error Profile of the Data 67 Some occurrences of missing verbs were also found: (4.44) (G10.4.3) a. Jag tycker I believe ger gives att that det it har has med with hon/han den saken som she/he the thing that uppfostran upbringing man tappade one lost om if man nu one now ger gives eller or inte not I believe that it is has to do with your upbringing if you give the thing he/she lost back or not. b. att göra to do Here is an example of a missing subjunction: (4.45) (G10.7.4) a. till for b. om if exempel instance den här the here killen the-boy gör does så igen so again for instance if this boy does so again, then... så... so Other omissions involve pronouns, infinitive markers, adverbs and some fixed expressions, as in: (4.46) (G10.8.4) a. sen then b. i in levde lived vi we lyckliga happy våra dagar our days Then we lived happily ever after. alla all våra dagar our days 4.3.7 Word Choice This error category concerns words being replaced by other words that semantically violate the sentence structure. They concern mostly replacements within the same word category, but changes of category also occur. Most of these substitutions involve prepositions and particles, but we also find some adverbs, infinitive markers, pronouns and other classes. In (4.47a) we see an example of an erroneous verb particle. Here the verb att vara lika to be alike requires the particle till to in combination with the noun phrase sättet the-manner and not på on as the writer uses.

68 Chapter 4. (4.47) (G11.1.7) a. vi we var were saker things väldigt lika very like på sättet alltså vi on the-manner in-other-words we tyckte fond om samma of same We were very alike in the our manner. In other words we were fond of the same things. b. lika like till sättet to the-manner Also the choice of prepositions is problematic. In (4.48a) the preposition ur from, which describes a completely different action than the required av off, was used. (4.48) (G11.1.2) a. Vi we sprang run kläderna. clothes allt all vad what vi we orkade ner could down till sjön to the-lake och and slängde ur threw out-of We run as fast as we could down to the lake and threw off our clothes. b. slängde av oss threw off us oss us Five errors concerned the conjunction och and used in the position of an infinitive marker. This error is speech related. In Swedish the pronunciation of the infinitive marker att [at] to is often reduced to [å], which is also the case for the conjunction och [ock] and, i.e. both att to and och and are often reduced and pronounced as [å]. As a consequence, these two forms and their syntactic roles can be mixed up in writing, as in the next example (4.49a). (4.49) (G11.3.1) a. det var it was b. att to onödigt unnecessary och skrika and scream It wasn t necessary to scream, daddy. pappa daddy The choice between the adverb vart whither and var where caused trouble for two subjects in three occurrences, an example is given in (4.50a). This may also be a dialectal matter, since in certain regions this form has the same distribution as var where.

Error Profile of the Data 69 (4.50) (G11.2.2) a. Men vart but whither ska will jag bo? I live But whither will I live? b. var where Also blends of fixed expressions occurred. In the following example, the writer mixes up the expressions så mycket jag kunde as much as I could and allt vad jag var värd for all I was worth : (4.51) (G11.5.3) a. jag sprang I whither så fort so fast så so mycket much I run so fast so much I was worth. b. allt vad all what jag var värd I was worth jag var I was värd worth Other word choice errors concerned pronouns, adjectives and nouns. 4.3.8 Reference Reference in Swedish Pronouns are used to refer to something already mentioned in the text (anaphoric reference) or something present in the utterance situation (deictic reference). The pronoun correlates then with the referring noun and has to agree in number and gender with it. Reference Errors Referential violations concern only anaphoric reference, referring to the previous text, both within the same clause and in a larger context. The errors were of two types, cases where the pronoun did not agree (six occurrences) and cases where the referent changed (two occurrences). In the case of agreement, four errors concerned wrong number as in (4.52a) and two cases were related to gender as in (4.53a).

70 Chapter 4. (4.52) (G12.1.1) a. Nästa next dag day gick went dem they upp up till to en a korg med saker i. Lena fick en basket with things in Lena got a djur. animals de they gått went Och Alexander fick and Alexander got och and gått went så hände so happened grotta cave där there fick got kattunge för kitten because dem they manen the-man var each sin his/her hade många had many ett spjut. sen gav den sej iväg när a spear then went it [sing] self away when något... something The next day they went to visit a cave. There they each got a basket with things in it. Lena got a kitten, because the man had many animals. And Alexander got a spear. Then it went away. When they went and went, something happened... b. de they (4.53) (G12.1.5) a. Vad what vad what heter is-called var was din your det han it he [masc] mamma? mother [fem] hette was-called Det stod It stood nu now igen? again helt completely still still i in huvudet the-head What is your mother s name? It was completely still in my head. What was he called now again? b. hon she In two cases, shift between direct quotes and narratives occurred. In one such error in (4.54a) the writer is first involved in the situation, referred to as vi we and then suddenly in the subsequent sentence the pronoun is changed to ni you [pl] switching the focus from the writer as part of a group to other people. (4.54) (G12.2.1) a. spring ut run out nu now vi har we have besökare när visitors when Run out, we have visitors! When we came out... b. vi we ni kom you [pl] came ut... out

Error Profile of the Data 71 4.3.9 Other Grammar Errors One error instance includes an adverb used as an adjective: (4.55) (G13.1.2) a. När when jag I var was liten small [adj] When I was a little smaller... b. lite a little [adv] mindre smaller mindre smaller Finally, three cases could not be classified at all. The sentences had very strange structure, either single words were incomprehensible or the whole sentence did not make any sense. In some cases this could be a question of several sentences being put together, in which case, the sentences are incomplete and/or lack any marking of sentence boundaries. During the analysis, some errors involving sequence of tense were discovered. These are not targeted in the present analysis and will be left for future analysis.

72 Chapter 4. 4.3.10 Distribution of Grammar Errors As discussed in the presentation of error types (Section 3.4), the units by which frequency of grammar errors could be estimated are different from type to type and are also difficult to count in text containing errors. For that reason, error frequency between error types will be compared and total numbers of errors will be related to the total number of words. Overall Error Distribution In the whole corpus of 29,812 words 262 instances of grammar errors were found. That corresponds to 8.8 errors per 1,000 words. The different errors are summarized in Table 4.18, grouped by sub-corpora and, in Table 4.19, by age. The total error distribution is also illustrated in Figure 4.1 below. The most recurrent grammar problem concerns the form of the finite main verb lacking tense ending on the main verb (42%). This problem seems to be characteristic of this particular age group, whose writing is close to spoken language. Most of these errors are found in the Deserted Village corpus (44) and among the 9 year olds (72). Frog Story texts also contain quite a high number of such errors. The rest of the corpora include around 10 such errors per corpus. Missing constituents is the second largest error category (16.8%). These errors tend to appear mostly among the older children, maybe because their text structure is more developed and complex than that of the younger children. Among the different sub-corpora, the Spencer Expository texts include most of these errors (20). Erroneous choice of words, mostly dominated by errors in the choice of preposition and verb particles, is the third most frequent category, representing 10.7% (28) of all grammar errors, and seems to be spread evenly both among sub-corpora and age groups. Agreement errors in noun phrase and extra words being inserted into sentences are also quite frequent (5.7% and 5.0% respectively). Agreement errors are quite equally spread in the corpora and occur most among the 9 year olds and 11 year olds. Redundancy errors display a similar distribution to that of the missing constituents, more errors were found among the older children and the Spencer Expository texts contain most errors of this kind. Other grammar error categories represent less than 4% each of all the grammar errors. Eight agreement errors in predicative complement occurred, mostly among the 13 year old subjects and in the Spencer Expository texts. The six definiteness errors were made only by 9 year olds and 11 year olds. Pronoun case errors occurred five times, found only in the texts of 10 year olds and 13 year olds, probably

Error Profile of the Data 73 because they were the only ones that made the written distinction between nominative and accusative in plural pronouns (de-dem they-them ). Seven cases of erroneous verb form after auxiliary verb occurred, mostly in the writing of 11 year olds and in the Deserted Village corpus. All errors but one in the verb form in infinitive phrase category were made by 11 year olds. Omission of the infinitive marker after the auxiliary verb komma will was rare, only three cases occurred among the 13 year olds in Spencer Expository texts. Eight referential errors occurred, mostly in the Deserted Village corpus and in the texts by 9 year olds. Five word order errors were found and they were equally distributed among sub-corpora and ages. Figure 4.1: Grammar Error Distribution

74 Chapter 4. Table 4.18: Distribution of Grammar Errors in Sub-Corpora SUB-CORPORA Deserted Climbing Frog Spencer Spencer ERROR TYPE Village Fireman Story Narrative Expository TOTAL % Agreement in NP 5 4 2 4 15 5.7 Agreement in PRED 2 6 8 3.1 Definiteness in single nouns 3 1 2 6 2.3 Pronoun Case 1 1 3 5 1.9 Finite Verb 44 13 34 10 9 110 42.0 Verb form after Vaux 3 1 1 2 7 2.7 Vaux Missing 2 2 0.8 Verb form after inf. marker 2 1 1 4 1.5 Inf. marker Missing 3 3 1.1 Word Order 1 1 3 5 1.9 Redundancy 1 2 1 3 6 13 5.0 Missing Constituents 7 2 8 7 20 44 16.8 Word Choice 9 5 2 3 9 28 10.7 Reference 3 1 2 2 8 3.1 Other 3 1 4 1.5 TOTAL 82 32 54 26 69 262 100 Errors/1,000 Words 10.8 7.1 11.0 4.7 9.3 8.8 Table 4.19: Distribution of Grammar Errors by Age AGE ERROR TYPE 9-year 10-year 11-year 13-year TOTAL % Agreement in NP 5 2 6 2 15 5.7 Agreement in PRED 1 1 1 5 8 3.1 Definiteness in single nouns 3 3 6 2.3 Pronoun Case 3 2 5 1.9 Finite Verb 72 11 14 13 110 42.0 Verb form after Vaux 1 1 3 2 7 2.7 Vaux Missing 1 1 2 0.8 Verb form after inf. marker 3 1 4 1.5 Inf. marker Missing 3 3 1.2 Word Order 1 3 1 5 1.9 Redundancy 1 5 3 4 13 5.0 Missing Constituents 3 13 13 15 44 16.8 Word Choice 10 4 6 8 28 10.7 Reference 3 3 2 8 3.1 Other 1 2 1 4 1.5 TOTAL 102 43 58 59 262 100 Errors/1,000 Words 14.9 6.3 7.2 7.3 8.8

Error Profile of the Data 75 Distribution Among Sub-Corpora In Table 4.18 we summarize the grammar errors found in the separate sub-corpora. Most of the grammar errors occurred in the Deserted Village corpus (82), followed by the texts from the Spencer Expository (68). However, if we consider the number of errors in comparison to the size of the sub-corpora and how often they occur per 1,000 words, the Frog Story corpus and the Deserted Village corpus have the highest numbers with 11 and 10.8 errors, respectively. The Spencer Narrative texts included only 26 grammar errors in total, that corresponds to only 4.7 errors per 1,000 words. As regards frequency of the various error types (see Figure 4.2), Frog Story and Deserted Village are distinguished from the other sub-corpora in that they have a much higher frequency of finite verb errors, with seven and six such errors per 1,000 words, respectively. They are half that number or less in the other sub-corpora. Other error types occur at most 1.6 times per 1,000 words. All the sub-corpora are dominated by errors in the finite verb, except for the Spencer Expository texts, where missing constituents are the most frequent error type. Errors in finite verbs are, however, the second most frequent category in these texts. Agreement errors in predicative complement are only found in the Climbing Fireman texts and in the Spencer Expository corpus. Further, errors in the texts of Spencer Narrative are spread over a much smaller number of different error types. Distribution Among Ages Looking at grammar errors by age (Table 4.19), we find that most of the grammar errors are found in the youngest 9 year olds (102) and less in the texts of 10 year olds (43). Error density varies from 14.9 errors per 1,000 words in the texts of 9 year olds to 6.3 errors for the 10 year olds. The 11 year olds and 13 year olds have a very similar distribution of 7.2 and 7.3 errors, respectively. The separate error types and their density are presented in Figure 4.3. Finite verb form errors are most characteristic for the 9 year olds, represented by five times more errors than in the other age-groups. In the other age-groups, finite verb errors and missing constituents are together the most frequent errors. Word choice errors are also highly ranked in all age-groups. Errors in agreement with predicative complement are concentrated in the texts of 13 year olds. Besides the finite verb form errors in 9 year olds, errors occur not more than two times per 1,000 words in all ages.

76 Chapter 4. Figure 4.2: Error Density in Sub-Corpora Figure 4.3: Error Density in Age Groups

Error Profile of the Data 77 4.3.11 Summary In total, 262 grammar errors were found in Child Data corresponding to an average of 8.8 errors per 1,000 words. The most common errors concern the form of finite verb, missing obligatory constituents, choice of words and agreement errors in noun phrases. Most frequent are errors found in the Frog Story and the Deserted Village corpora and among the 9 year olds. 4.4 Child Data vs. Other Data In this section, the grammar errors found in Child Data will be compared with studies of grammar errors discussed in Chapter 2 (Section 2.4). Only a comparison with the analyses of children s writing at school and the studies on adult writing from the grammar checking projects are included. It turned out that it was very difficult to compare the error types in the other studies, since they either did not report much data or they classified errors differently, without giving enough information on exactly which errors were included. The object of this part of analysis is to investigate the similarities and/or differences between the error types found in children and other writers in order to see which grammar errors to concentrate on in the development of a grammar checker aimed at children. 4.4.1 Primary and Secondary Level Writers Teleman s study and the analysis from the Skrivsyntax project are the two analyses of children s writing which report on grammar errors at the syntactic level. The reports do not provide any quantitative analyses concerning the frequency of error types. Instead the types of errors are reported and, in some cases, exemplified. Teleman s Examples Teleman s study (Teleman, 1979) includes examples of writing errors in texts by children from the seventh year of primary school (14 years old). The examples are mostly listed as fragments taken out of context, though some are presented with the surrounding context. Many of the examples concern word choice or are of content-related nature. Among grammar errors (Table 4.20), 11 Teleman (1979) lists examples of errors in the pronoun case, verb form, definiteness agreement, missing constituents (mostly the subject is missing), reference errors, word order 11 The column representing the correct forms of the exemplified errors are my own suggestions. Teleman (1979) s examples are just listed without any suggestions of possible correction.

78 Chapter 4. and tense shift. Other errors concerned incorrect use of idiomatic expressions, missing prepositions or the use of the conjunction och and instead of the infinitive marker att to. The influence of spoken language is evident in many of the examples. Tenseendings on verbs are dropped, accusative forms of pronouns are not used, in particular the pronunciation-like form dom ( they or them ) is used instead of the nominative (de) and accusative (dem) forms, which as mentioned earlier, are only distinguished in writing. Also the use of the conjunction och and instead of the infinitive marker att to indicates influence of the spoken language. Dialect influence occurs in the example of definiteness agreement with the determiner denna this followed by a definite noun. All the error types that Teleman found (except for one) occurred in our Child Data corpus as well. Only the case when two supine verbs follow each other was not found in the present Child Data corpus. However, there were additional types of errors in Child Data, such as other verb form errors than dropped tense-endings on finite verbs, or other occurrences of erroneous word choice than using prepositions or conjunctions in the place of the infinitive marker, or occurrences of superfluous constituents. Table 4.20: Examples of Grammar Errors in Teleman s Study ERROR TYPE EXAMPLE ERROR CORRECT FORM Pronoun form dom they [spoken form] de, dem they [nom], they [acc] han, hon he, she honom, henne him [acc], her [acc] Verb form fråga ask [inf] frågade asked [pret] Double supine fått sålt fått sälja got [sup] sold [sup] got [sup] sell [inf] Agreement in NP denna bilen this car [def] denna bil this car [indef] Agreement in PRED hennes förslag... förefaller mig orealistiskt orealistisk her suggestion [neu]... appears to unrealistic [neu] me unrealistic [com] Missing constituents Tog med honom till polisen. subject missing took along him to the-police Reference polisen... de policeman... they han he Word order ett till fall a more case ett fall till a case more Tense shift Då förstod Majsan varför han har hade varit varit rädd. then understood [pret] Majsan why had been [past perf] he has been [perf] afraid Choice of or missing bet på repet bit on the-rope bet i repet bit in the-rope prepositions fråga vissa saker ask some things fråga om ask about och instead of att få lov och göra något att göra get permission and do something to do

Error Profile of the Data 79 Skrivsyntax Among the seven error types distinguished in the error analysis of the Skrivsyntax project on writing of third year students of upper secondary school (Hultman and Westman, 1977, p.230), grammar errors were the most frequent. From the whole corpus of 88,757 words, 1,157 were classified as grammar errors. According to Hultman and Westman (1977), gender agreement errors were usual and relatively many examples of errors in pronoun case after preposition occurred in these texts. Errors in agreement between subject and predicative complement occurred quite frequently. Word order errors were also reported, mostly in the placement of adverbials. Other examples include verb form errors, errors in idiomatic phrases (the majority concern prepositions), subject related errors, and clauses with odd structure. Some examples of these grammar errors are displayed in Table 4.21. Table 4.21: Examples of Grammar Errors from the Skrivsyntax Project ERROR TYPE EXAMPLE CORRECT FORM Gender agreement bland det mest intolerabla och kortsynta den... formen formen på samlevnad among the [neu] most intolerant and the [com]... form [com,def] short-sighted form [com,def] of married life Agreement in PRED barnet är van vant child [neu,def] is used-to [com] used-to [neu] Pronoun case för alla de som dem for all they [nom] that them [acc] hjälpa de som dem help they [nom] that them [acc] Verb form Naturligtvis måste båda typerna av måste... finnas äktenskap finns of course must both types of marriage must... exist [inf] exists [pres] Hon har inte kunna frigöra sig har... kunnat she has not be-able [inf] free herself has... being-able [sup] Word order Ett äktenskap kräver att två personer skall älska bara varandra bara skall älska varandra hela livet ut a marriage demands that two people shall love only each-other only shall love each-other whole thelife out Idiomatic expressions löftet till trohet om promise to fidelity about grundtanken till äktenskapet i the-fundamental-idea to marriage in

80 Chapter 4. Other errors mentioned concern the structure of sentences and include, for instance, the omission of the infinitive marker att to, main clause word order in subordinate sentences, and sub-categorization of verbs. Also, reference errors are observed and are considered to be quite usual in the material. Some tense problems occurred. The error types encountered in Skrivsyntax show a general indication of the decreasing influence of spoken language on writing compared to earlier ages. The only examples of errors that may contradict this statement are errors in the use of the subjective form of the pronoun de they in object-position or in certain expressions after prepositions (should be dem them ). Verb form errors, on the other hand, include only erroneous use of existing written forms with no dropped tense-endings being reported. These errors, and errors in the choice of preposition, gender agreement, verbs and word order were also found in Child Data. Omission of the infinitive marker with certain verbs was only analyzed in the context of the verb komma will in the present study. Further, constituent structure seems to be more complex than in texts from Child Data, resulting in errors where the agreeing elements are separated by more words, thus being harder to discover for the writer, e.g. the gender agreement error in bland det mest intolerabla och kortsynta formen på samlevnad. Conclusion Although the Teleman and Skrivsyntax studies cannot be considered to be completely representative for the two age groups and despite a time span of more than twenty years between the studies and the present study, the error types that occur in children s writing are persistent. The writing of primary school children shows similarities to Child Data mostly in the use of spoken forms. Those types of errors seem to be (almost) not-existent in secondary level writers. Since no numbers or other indications of error frequency than by words are given, the relative frequency or distribution of errors is unclear. 4.4.2 Evaluation Texts of Proof Reading Tools As already mentioned, the evaluation studies that have been carried out as part of the development of the three Swedish grammar checking tools report on grammar errors found primarily in the writing of professional adult writers. Here, we look at the errors reported in two such studies and compare them to the grammar errors found in Child Data.

Error Profile of the Data 81 Error Profiles of the Evaluation Texts The performance test of Grammatifix reporting the ratio of detected errors (recall) was based on a newspaper corpus of 87,713 words (Birn, 2000). 12 The material included in total 127 grammar errors summarized in Table 4.22 below. 13 Among the error types, Other agreement errors contained complements, postmodifiers and anaphoric pronouns (i.e. reference errors) and the category Missing or superfluous endings consisted of e.g. genitive, passive or adverb endings. Verb form errors included mostly errors in verb clusters. It is not clear which types of errors belong to the category of Sentence structure errors, or what is included under the Other category (see further in Birn, 2000, p.39). Table 4.22: Grammar Errors in the Evaluation Texts of Grammatifix ERROR TYPE NO. % Agreement in noun phrase 22 17.3% Other agreement errors 9 7.1% Verb form 28 22.0% Choice of preposition 26 20.5% Missing or superfluous endings 21 16.5% Sentence structure 8 6.3% Word order 3 2.4% Other 10 7.8% TOTAL 127 100% Four error types clearly dominate, including errors in verb form, choice of prepositions, agreement in noun phrase and missing or superfluous endings. Other types occurred, at most, ten times. In Knutsson (2001), an evaluation of Granska s proof-reading tool is reported based on a text corpus of 201,019 words. The collection included texts of different genres, mostly news articles of different kinds, some official texts, popular science articles and student papers. The analysis concerned grammar, punctuation and some spelling errors. Table 4.23 below is summary of the grammar errors (see further in Knutsson, 2001, p.143). The relative frequency of error types was recounted. 12 Precision of the system, i.e. how good the system is at avoiding false alarms, was tested on a corpus of 1,000,504 words. It is not clear if the corpora includes different newspaper texts or if there was an overlap with the texts tested for recall of the system. According to the author, only the recall-corpus was pre-analyzed manually for grammar errors (see further in Birn, 2000). 13 Birn (2000) reports also 8 instances of splits. They are not included here, since the type belongs to the spelling error category.

82 Chapter 4. The error classification in Granska s corpus is more similar to the classification adopted in the present thesis. The category of Verb form errors, however, does not specify the different sub-categories. Altogether 272 grammar errors occurred in this evaluation corpus. Both Granska s corpus, which is more than double the size and the evaluation texts of Grammatifix display (almost) the same error rate with 1.35 errors per 1,000 words. Most errors were erroneous verb forms, followed in frequency by agreement errors in noun phrases and missing constituents. Some errors occurred in predicative complement agreement and pronoun form. The rest of the errors occurred less than ten times. Table 4.23: Grammar Errors in Granska s Evaluation Corpus ERROR TYPE NO. % Definiteness in single nouns 4 1.5% Agreement in noun phrase 69 25.4% Agreement in pred. compl. 16 5.9% Verb form 89 32.7% Pronoun form 14 5.1% Reference 1 0.4% Choice of preposition 11 4% Word order 8 2.9% Missing word 56 20.6% Redundant word 4 1.5% TOTAL 272 100% Comparison with Child Data The most obvious difference between the grammar errors from the evaluation texts and Child Data is the error rate in comparison to the size of the corpora. Although the Child Data corpus is the smallest, the total number of errors is almost the same as that in Granska s evaluation texts. Errors in Child Data are six times more frequent, with almost 9 errors per 1,000 words, than in the evaluation texts with an error density of less than 1.5 errors per 1,000 words - see Table 4.24.

Error Profile of the Data 83 Table 4.24: General Error Ratio in Grammatifix, Granska and Child Data GRAMMATIFIX GRANSKA CHILD DATA Number of words 87 713 201 019 29 812 Number of errors 127 272 262 Number of errors/1 000 words 1.45 1.35 8.8 As we have seen, error classification varies between the projects, making a comparison of all error types impossible. Verb form errors, noun phrase agreement, missing constituents (in Granska) and erroneous choice of prepositions (in Grammatifix) are the four most common error types, with frequencies in the range of 20% to 30% each. Recall that errors in Child Data are less evenly spread among the various types of errors. They are clearly dominated by errors in (finite) verb forms (42%), followed by missing constituents at half that frequency (16.8%). Erroneous choice of words is the third most common grammar error (10.7%). Agreement errors in noun phrase occurred in 15 cases (5.7%). Relating the errors of noun phrase agreement, verb form and choice of preposition reported by all groups to the size of the corpora presented in Table 4.25 below, we get a rough picture of error frequency for these three error types in comparison to Child Data. The corresponding error types that were selected in the Child Data corpus, include exactly all the errors in agreement in noun phrases and only the preposition related errors in the word choice category. Three error categories were selected as representative for verb form errors: finite main verb, verb form after auxiliary verb and verb form after infinitive marker. Table 4.25: Three Error Types in Grammatifix, Granska and Child Data GRAMMATIFIX GRANSKA CHILD DATA Errors/ Errors/ Errors/ ERROR TYPE No. 1,000 words No. 1,000 words No. 1,000 words Agreement in noun phrase 22 0.25 69 0.34 15 0.50 Verb form 28 0.32 89 0.44 112 3.76 Choice of preposition 26 0.30 11 0.05 10 0.34 Table 4.25 is also rendered as a graph in Figure 4.4 below. These figures show that children made more errors than the adult writers in all three error types. The difference is marginal for errors in noun phrase agreement and choice of preposition. For verb form errors, the difference is eightfold. Children made almost four such errors in 1,000 words, compared to the adults less than 0.5.

84 Chapter 4. The distribution of errors over the three error categories is the same for Child Data and Granska, with fewest errors in choice of preposition and most in verb form. In the Grammatifix corpus, erroneous choice of preposition is quite frequent, with almost the same rate as in Child Data. Here, errors in noun phrase agreement are few. Figure 4.4: Three Error Types in Grammatifix (black line), Granska (gray line) and Child Data (white line) Conclusion The error classifications in the projects differ making comparison on a more detailed level impossible. The overall error rate reveals similar values for the adult corpora, whereas errors are considerably more frequent in Child Data. A comparison of the three most common error types in the adult corpora with the same types in Child Data displays a considerable difference in the frequency of verb form errors, whereas the difference is not as substantial for the other two types. Although not all error types could be compared, this observation indicates that there is a difference not only in the overall error rate, but also in the types of errors.

Error Profile of the Data 85 4.4.3 Scarrie s Error Database As mentioned in Section 2.4, corrections of professional proof-readers at two Swedish newspapers were gathered into a Swedish Error Corpora Database (ECD) in the Scarrie project. This database now contains nearly 9,000 error entries. In total, 1,374 of these errors were classified as grammar errors, corresponding to approximately 16% of all errors (Wedbjer Rambell et al., 1999). Error Profile of the Error Database The error classification in ECD is very refined, the division of error types is, initially, based on the type of phrase involved, rather than the violation type. As Wedbjer Rambell et al. (1999) state, noun phrase errors are the most frequent, followed by verb sub-categorization problems, errors in prepositional phrases and problems within verb clusters. Within the noun phrase category, agreement errors are the most common error type (27.8%), followed by definiteness in single nouns (22.3%) and case errors (14.2%). Verb valence, the second largest grammar problem category, includes problems with the infinitive phrase as the most frequent (24.7%), moreover, over 90% of all verb valence errors concern the infinitive marker att to (one third occur after the verb komma will ). The choice of preposition and missing preposition errors are the top list error subtypes in the prepositional phrase category (36% and 26.6% respectively). Finally, in verb clusters, errors involving the auxiliary verb being followed by infinitive (33.3%), main verbs in the finite form (30.6%) and temporal auxiliary verb followed by supine (18.0%) are the most common errors. Comparison to Child Data The advantage of the fine division of error types and on-line availability of Scarrie s ECD, makes more extensive and precise comparison of the studies possible. In total, eleven error types are compared with the errors in Child Data, presented in Table 4.26. The errors, missing auxiliary verb and infinitive marker, which were quite few, are not included, nor are all the word choice or Other category errors. The large size of the newspaper corpus (approximately 70,000,000 words) in Scarrie results in a ratio of 0.009 of errors per 1,000 words. In the Child Data corpus, the ratio is 8 errors per 1,000 words for the listed error types. The big gap in error density is obvious and further analysis will concern comparison of how frequent the errors are over these selected categories and, luckily, show what types of errors characterize the corpora.

86 Chapter 4. Table 4.26: Grammar Errors in Scarrie s ECD and Child Data SCARRIE CHILD DATA ERROR TYPE NO. % NO. % Agreement in noun phrase 176 25.7% 15 6.4% Agreement in pred. compl. 48 7.0% 8 3.4% Definiteness in single nouns 68 9.9% 6 2.6% Pronoun form 21 3.1% 5 2.1% Finite verb form 34 5.0% 110 46.8% Verb form after Vaux 57 8.3% 7 3.0% Verb form after inf. marker 4 0.6% 4 1.7% Word order 57 8.3% 5 2.1% Missing or redundant word 132 19.2% 57 24.3% Choice of preposition 76 11.1% 10 4.3% Reference 13 1.9% 8 3.4% TOTAL 686 100% 235 100% Errors/1,000 Words 0.009 7.8 Figure 4.5 shows the relative error frequency of the selected error types in Scarrie s corpus and Figure 4.6 shows the Child Data corpus. The main difference is that the top error type for Child Data, represented by errors in the finite verb form, is not a very common error in Scarrie s corpus. The other three top error types in Child Data and the three top error types in Scarrie are represented by the same categories, but in a slightly different order. In Scarrie s corpus, noun phrase agreement errors are the most frequent, followed by missing and redundant constituents and then the choice of preposition. In Child Data, agreement errors in noun phrase are much less frequent than omission or addition of words in sentences, but erroneous choice of preposition is also the least frequent of these three categories. Errors in verb forms overall have much lower frequency in Scarrie s corpus. Errors in verb form after an auxiliary verb is the fifth most common error type in Scarrie s corpus and the most frequent among the verb errors. Errors in finite verb form are even less frequent and errors in verb form after an infinitive marker are quite rare. In Child Data, errors in verb form after an auxiliary verb are much less frequent than in the finite verb, the most common error. Errors in verb form after an infinitive marker are also rare. As already mentioned, agreement errors in noun phrases have a higher frequency distribution in Scarrie s ECD than in Child Data. Agreement errors in predicative complement positions seem to be slightly more common in Scarrie s texts, likewise for definiteness errors in bare nouns.

Error Profile of the Data 87 Figure 4.5: Error Distribution of Selected Error Types in Scarrie Figure 4.6: Error Distribution of Selected Error Types in Child Data

88 Chapter 4. There were few word order errors in Child Data. These seem more common in Scarrie s ECD, being as common as errors in verb form after auxiliary verb. The opposite holds for reference errors, which were quite rare in Scarrie s texts and more common in Child Data. Pronoun form errors display a similar distribution in both corpora. Conclusion Comparison of error frequency over the selected error types in the two corpora shows both differences and similarities. The largest difference is in the verb form errors. In Scarrie s texts, verbs following an auxiliary verb are the main problem, whereas in Child Data it is the finite verb form, the most common error in the whole corpus. Other differences concern word order and definiteness in bare nouns, more common in Scarrie s corpus, and reference errors, more common in Child Data. Agreement errors in predicative complements seem to be slightly more common in Scarrie s corpus. Some of the differences could certainly be circumstantial, due to the difference in the size of the corpora, but certainly not in the most common error types. Child Data s profile is characterized by errors in finite verb form and omissions or additions of words. Scarrie s texts are dominated by errors in noun phrase agreement and omission or addition of words. Agreement errors in noun phrases are the third most common error type in Child Data. Errors in choice of preposition and pronoun form obtained similar frequency distributions in the corpora. 4.4.4 Summary The nature of grammar errors in Child Data is more similar to the errors found in Teleman s primary school children than the secondary level writers of the Skrivsyntax project. The different error classification in the grammar checking projects made deeper analysis difficult. Errors are, in general, more frequent in Child Data, but a closer look at three error types indicates that for some error types the difference is marginal whereas for others, children make many more errors. A finegrained comparison with some selected error types from Scarrie s ECD confirms this difference with different error frequency distribution in certain error sub-types. On the other hand, the most common error types in Scarrie s corpus are, other than finite verb form errors, also the most frequent in Child Data.

Error Profile of the Data 89 4.5 Real Word Spelling Errors 4.5.1 Introduction This section is devoted to spelling errors which form existing words. These errors are particularly interesting from the computational point of view, because they normally require analysis of context larger than a word and are most often not discovered by a traditional spelling checker developed for the detection of isolated words. Since this error category is not the main focus of the present study, the analysis aims more at providing an overall impression of what errors occur and what grammatical consequences the new word formations create rather than analysis of the spelling error types. First, the spelling violation types that are typical in Swedish are presented in (Section 4.5.2), followed by an analysis of segmentation errors (Section 4.5.3) and misspelled words (Section 4.5.4). The total number of errors and their distribution is discussed at the end of this section (Section 4.5.5). 4.5.2 Spelling in Swedish As mentioned already in the classification of error categories in Chapter 3 (Section 3.4), spelling errors occur as violations of the orthographic norms of a language. In Swedish, these errors concern operations on letters and segmentation of words. Compounds in Swedish are always written as one word. Since this is such a productive category, compounds are often a source of erroneous segmentation. They are most often spelled apart forming more than one word, but the opposite occurs as well when words are written together as if they were a compound. Other spelling violations occur when letters in words are missing, are replaced by other letters, moved to other positions of the word, or when extra letters appear. Apart from these basic operations, Swedish has consonant gemination, often the cause of spelling errors (cf. Nauclér, 1980). Words can differ simply in single or double consonants and have completely separate meanings, as in glas glass and glass ice-cream. The spelling errors in this study are divided first into segmentation errors and misspellings. Segmentation errors are then further divided into writing words apart as erroneous separation of compound elements (splits) and writing words together as erroneous combination of words into compounds (run-ons). The error taxonomy of misspellings is based on the four basic error types of omission, insertion, substitution and transposition, usually applied in research on spelling (e.g. Kukich, 1992; Vosse, 1994), and extended with two additional error categories related to consonant doubling as separate categories. The spelling taxonomy consists then of two

90 Chapter 4. categories with segmentation errors divided in two sub-categories and misspellings in six sub-categories: 1. Segmentation Errors: (a) splits - a word written apart, with a space in between (b) run-ons - words written together as one 2. Misspellings: (a) omission - a letter is missing (b) double consonant omission - single consonant instead of double consonant (c) insertion - an extra letter is added (d) double consonant insertion - double consonant instead of single consonant (e) substitution - a letter is replaced by another letter (f) transposition - two or more letters have changed positions A word can be in violation in just one such spelling operation on letters or spaces, or several spelling violations may occur. The categories are exemplified in the Table 4.27 below. All the errors in the table are represented by real word spelling errors, found in the current corpus and some also with multiple violations. First the error category is presented, followed by an example of it and its correct form. The last column in the table includes the error index in the corresponding Appendix where the error instance(s) may be found (misspelled words from Appendix B.2 with the index starting in M and segmentation errors from Appendix B.3 with the index starting in S). Table 4.27: Examples of Spelling Error Categories ERROR TYPE ERROR CORRECT WORD INDEX SINGLE ERRORS: Split djur affär animal store djuraffär animal-store S1.1.28 Run-on tillslut close till slut eventually S8.1.3-12 Omission bror brother beror depends M4.2.1 Double omission koma coma komma to come M4.2.33-36 Insertion örn eagle ön the island M1.1.51 Double insertion matt faint mat food M1.2.3 Substitution bi bee by village M1.1.9-11 Transposition förts been taken först first M6.4.1-2 MULTIPLE ERRORS: Split and Double brand manen fire mane brandmannen fire-man S1.1.21-22 omission Substitution and kran kvistar tap twigs grankvistar fir-twigs S1.1.59 Split Double omission fören the stem förrän until M8.1.1-4 and Substitution Omission and tupp rooster stup precipice M1.1.46 Double insertion

Error Profile of the Data 91 Some spoken forms in Swedish are accepted as spelling variants and will not be included as errors in this analysis. These are listed in Table 4.28 below. Table 4.28: Spelling Variants SPOKEN FORM WRITTEN EQUIVALENCE dom de they sen sedan then sa sade said la lade laid nån någon someone nåt något somewhat nåra några some [pl] nånstans någonstans somewhere sån sådan such [com] sånt sådant such [neu] såna sådana such [pl] våran vår ours [com] vårat vårt ours [neu] mej mig me [acc] dej dig you [acc] sej sig him/her/itself [acc] stan staden city [def] dan dagen day [def] 4.5.3 Segmentation Errors The different types of segmentation errors are listed in Table 4.29 together with the number of different word types and how many were misspelled. Splits are further divided in accordance with what part-of-speech they concern. Distribution in sub-corpora and among participant ages for segmentation errors is discussed in Section 4.5.5. Table 4.29: Distribution of Real Word Segmentation Errors CATEGORY NUMBER WORD TYPES MISSPELLED RUN-ONS: 13 4 0 SPLITS: Nouns 126 90 6 Adjectives 49 37 0 Pronouns 5 2 1 Verbs 8 8 0 Adverbs 53 21 5 Prepositions 2 2 0 Conjunctions 3 1 0 TOTAL SPLITS 246 160 12

92 Chapter 4. Very few real word spelling errors occurred as words written together (runons), since these most often result in non-words. Cases that formed an existing word included just four word types. The most recurrent real word run-on was the preposition phrase till slut eventually that, when written together, forms the verb tillslut close, see (4.56): (4.56) (S8.1.12) a. Vi we åkte went tillslut på bio. close to cinema We went eventually to the cinema. b. till slut eventually Splits, on the other hand, are usually realized as real words since they are compounded of two (or more) lemmas. As seen in Table 4.29, most of the splits concern noun compounds. In six cases, these were misspelled, resulting in real words as in (4.57). Thus, the compound brandmännen firemen is split, while a vowel substitution occurs in the second part of the compound. Both parts are finally realized as lexicalized strings which then slip through a spellchecker unnoticed: (4.57) (S1.1.23) a. brand fire menen the-harms ryckte turned ut och out and släckte put-out The firemen turned out and put out the fire. b. brandmännen fire-men elden. the-fire Two instances among the noun splits were not compounds, as for instance in (4.58) below. Here, the definite suffix is separated from the noun stem: (4.58) (S1.1.118) a. ni får gärna bo you [pl] may gladly live har have nåt something hos at att bo i. to live into oss under us during tid en ni inte time [definite suffix] you [pl] not You are welcome to live at our place during the time you don t have anywhere to live. b. tiden the-time Also, adjectives are quite often split with the parts realized as existent words. A recurrent error (27 occurrences) is the segmentation of the modifying intensifier

Error Profile of the Data 93 jätte giant as in (4.59). This is supposed to be written together (see Teleman et al., 1999, Part2:185-188). (4.59) (S2.1.18) a. då then blev become jag jätte glad I giant happy Then I was extremely happy. b. jätteglad extremely-happy Splits in adverbs are recurrent as well, often concerning certain words, as seen in the number of word types. Some of them were also misspelled, as for instance in (4.60), where ändå anyway is split and the first part includes vowel substitution and realizes as the indefinite determiner en a : (4.60) (S5.1.46) a. men olof but Olof var glad was happy en a då then But Olof was happy anyway. b. ändå anyway Eight cases concerned split verbs. One of these included a morphological split, where the past tense suffix was separated from the verb stem: (4.61) (S4.1.7) a. Han ring he call de [pret] till to mig sen me afterwards och and sa said He called me afterwards and said the same thing. b. ringde called samma sak. same thing Also, some splits in pronouns, prepositions and conjunctions occurred. Among the conjunctions, three cases of the conjunction eftersom because were segmented: (4.62) (7.1.1) a. Efter after som that han frös he was-cold Because he was cold and... b. eftersom because och... and

94 Chapter 4. All these segmentation errors resulting in real words are presented in Appendix B.3. They are classified first by the type of violation that occurred and then by partof-speech. 4.5.4 Misspelled Words In general, multiple misspellings occurred in just a few cases, most of the words involved single violations. Substitution and double consonant omission are the most frequent spelling violations. Nouns, pronouns and verbs are the most frequent categories for violations. Certain types of words seem to be more problematic than others regarding spelling. For instance, there is real confusion concerning the spelling of the pronoun de they. Recall that this pronoun is pronounced as [dom], as is the accusative form dem them. Both forms can be spelled as dom, an accepted spelling variant, as well. In sixteen cases, four subjects used the accusative form dem them as in (4.63a): (4.63) (M3.1.49) 16 occurrences, 4 subjects a. Dem hade ett privatplan them had a private-plane They had a private-plane. b. De They Two children substituted the vowel in the pronoun, as a consequence, it was realized as the noun dam lady, as in (4.64a): (4.64) (M3.2.13) 14 occurrences, 2 subjects a. dam bodde i en by lady lived in a village They lived in a village. b. dom/de they Another confusion exists between the pronouns det it and de they. In speech, det is usually reduced to [de], thus coinciding with the plural pronoun de they in writing. In 33 cases, 15 subjects used de instead of det it :

Error Profile of the Data 95 (4.65) (M3.1.20) 33 occurrences, 15 subjects a. ja men nu är de läggdags yes but now is they bed-time b. det it sa said mormor grandmother Yes, but now it is time to go to bed, grandmother said. The opposite occurred in nine cases, where six subjects wrote the singular det it instead of the plural pronoun de they : (4.66) (M3.1.4) 9 occurrences, 6 subjects a. Det kom till en övergiven by it came to a abandoned village They came to an abandoned village b. De They Other rather recurrent spelling errors concern the pronoun vad what, the adverb var where, the infinitive verb form vara to be and the past form of the same verb var was/were, all of which can be pronounced [va]. First, the forms are often erroneously substituted for one another. In six cases, the form var is used instead of the correct pronoun vad what as in: (4.67) (M3.6.22) 6 occurrences, 4 subjects a. Men var är det för ljud? but where is it for sound But what is it for sound? b. vad what Then in eight cases the form vad is used instead of the past verb form var was/where : (4.68) (M4.6.8) 8 occurrences, 3 subjects a. Hans älsklingsfärg vad grön. his favourite-colour what green His favourite-colour was green. b. var was

96 Chapter 4. Two children also used vad for the adverb form var where in three cases: (4.69) (M3.6.25) 3 occurrences, 2 subjects a. Hjälp det brinner vad nånstans. help it burns what somewhere Help! Fire! Where abouts? b. var where Further, these words are also realized as the corresponding (reduced) pronunciation form va, that in turn coincides with the interjection va what in writing. Most of these cases concerned the past verb form var was/where as in: (4.70) (M4.5.4) 33 occurrences, 8 subjects a. Klockan the-watch va what ungefär approximately 12 12 när when The time was about 12 when I woke up b. var was jag vaknade I woke Some cases included the infinitive verb form vara to be : (4.71) (M4.5.39) 8 occurrences, 5 subjects a. dom vill inte va kompis they want [pres] not what friend med with They don t want to be friends with him/her. b. vill want [pres] inte not vara be [inf] han/hon. he/she Here is an example of the use of the adverb var where reduced as va: (4.72) (M6.5.3) 3 occurrences, 1 subject a. sen undra han va dom bodde then wonder he what they lived Then he wondered where they live. b. var where

Error Profile of the Data 97 Two instances of va corresponded to the pronoun vad what as in: (4.73) (M3.5.4) 2 occurrences, 1 subject a. Madde vaknade av mitt skrik, Madde woke from my shout, hon she fråga va ask what Madde woke up from my shout. She asked what was wrong. b. vad what det var för nåt. it was for something Other spelling that also was related to spoken reduction concerned the pronoun jag I, normally pronounced [ja], which, when written as pronounced, corresponds to ja yes. Three instances of the use of jag as ja occurred: (4.74) (M3.5.3) 3 occurrences, 2 subjects a. Vilken fin klänning ja har what pretty dress yes have b. jag I What a pretty dress I have. Also, five instances concern the conjunction och and, usually pronounced as [å], which in writing coincides with the noun å river. (4.75) (M8.1.11) a. Vi we bor live i in samma hus same house jag och I and Kamilla Kamilla å river We live in the same house me and Kamilla and her dog. b. och and hennes hund. her dog All these misspelled words resulting in real words are listed in Appendix B.2. They are classified first by the part-of-speech of the intended word and then by the part-of-speech of the realized word. The type of spelling violations that occur are notified in the margin.

98 Chapter 4. 4.5.5 Distribution of Real Word Spelling Errors From the examples above, it is clear that the children s spelling is quite unstable. In general there is a high degree of confusion as to which form to write in which context and many spoken forms are used. The totals of misspelled words, splits and run-ons are summarized in Table 4.30 below, where the texts are divided into sub-corpora, and in Table 4.31, where the texts are grouped by age. The errors are divided further into non-words and real words and the relative frequency of errors compared to the total number of words is presented. As already discussed in the general overview in Section 4.2, all spelling errors (i.e. both non-word and real word) amount to 10.2% of all words. Most common are misspelled words, followed by splits, which are more recurrent than run-ons. The same distribution applies for real word spelling errors. In total, (the last column in the last row in the tables) these amount to 2.3% of all words, which is three times less than non-word spelling errors (7.9%). Put in other words, real word spelling errors amount to 29% of all spelling errors. 14 Real word spelling errors are also dominated by misspelled words (1.5%). Splits are more usual as real words (0.8% in comparison to non-word splits 0.4%), whereas run-ons are almost not-existent as real words (0.04%). Most of the misspelled words as real words occur in the Deserted Village corpus and among the 9-year olds. Real word splits are also most frequent in the Deserted Village corpus, closely followed by the Frog Story corpus. In the case of age, the texts of 11-year olds contained most of the erroneous splits (non-word splits are most common among 9-year olds). Real word run-ons are very rare so not much can be said about their distribution in sub-corpora or by age group. 14 Recall that the corresponding rate Kukich (1992) refers to is: 40% of all misspellings result in lexicalized strings.

Error Profile of the Data 99 Table 4.30: Distribution of Real Word Spelling Errors in Sub-Corpora SUB-CORPORA Deserted Climbing Frog Spencer Spencer ERROR TYPE Village Fireman Story Narrative Expository TOTAL MISSPELLED WORDS: non-word 743 351 484 173 239 1 990 % 9.8 7.8 9.9 3.2 3.3 6.7 real word 181 71 84 36 60 432 % 2.4 1.6 1.7 0.7 0.8 1.4 SPLITS: non-word 48 28 32 14 9 131 % 0.6 0.6 0.7 0.3 0.1 0.4 real word 98 41 61 23 23 246 % 1.3 0.9 1.2 0.4 0.3 0.8 RUN-ONS: non-word 108 25 37 28 29 227 % 1.4 0.6 0.8 0.5 0.4 0.8 real word 5 1 2 4 1 13 % 0.07 0.02 0.04 0.07 0.01 0.04 TOTAL: non-word 899 404 553 215 277 2 348 % 11.9 9.0 11.3 3.9 3.8 7.9 real word 284 113 147 63 84 691 % 3.7 2.5 3.0 1.1 1.1 2.3 Table 4.31: Distribution of Real Word Spelling Errors by Age AGE ERROR TYPE 9-year 10-year 11-year 13-year TOTAL MISSPELLED WORDS: non-word 994 292 524 180 1 990 % 14.5 4.3 6.5 2.2 6.7 real word 248 64 78 42 432 % 3.6 0.9 1.0 0.5 1.4 SPLITS: non-word 71 18 35 7 131 % 1.0 0.3 0.4 0.1 0.4 real word 58 51 113 24 246 % 0.8 0.7 1.4 0.3 0.8 RUN-ONS: non-word 102 32 58 35 227 % 1.5 0.5 0.7 0.4 0.8 real word 2 2 5 4 13 % 0.03 0.03 0.06 0.05 0.04 TOTAL: non-word 1 167 342 617 222 2 348 % 17.1 5.0 7.7 2.7 7.9 real word 308 117 196 70 691 % 4.5 1.7 2.4 0.9 2.3

100 Chapter 4. 4.5.6 Summary Real word spelling errors are three times less frequent than non-word spelling errors in the Child Data corpus. Misspelled words are the most common type of error, reflecting a clear spelling confusion for some word types. Splits are, in general, more common as real word errors, the opposite being the case for run-ons. Most errors occurred in general in the Deserted Village corpus and among the 9-year olds, but 11-year olds made most of the erroneous segmentation errors (splits). 4.6 Punctuation 4.6.1 Introduction Beginning writers, as mentioned in Chapter 3 (Section 3.4), usually use punctuation marks to delimit larger textual units than syntactic sentences, joining for instance (main) clauses together without any conjunctions. The main purpose of the present analysis of punctuation is to investigate the erroneous use of punctuation both manifested as omissions, thus giving rise to joined sentences, and as substitutions and insertions. The length of the orthographic sentences marked by the subjects and especially the number of (main) clauses without conjunctions joined in them (adjoined clauses) will give us a picture of how often sentence boundaries are omitted and to what degree sentences correspond to syntactic sentences. Analysis of erroneous use of end-of-sentence punctuation and commas will reveal in what other places one might expect them. As orthographic sentences are considered sequences of words that start with a capital letter and end in a major delimiter (cf. Teleman, 1974). Also included in that category are, sequences that do not completely follow the writing conventions of a capital letter at the beginning and a major delimiter at the end, but indicate the writer s intention of such marking. These include sentences ending in a major delimiter followed by a small letter, or the opposite when the major delimiter is missing but the beginning of the next sentence is indicated by a capital. Within the orthographic sentence, occurrences of main sentences attached to a main clause without conjunction are counted as adjoined clauses (cf. Näslund, 1981; Ledin, 1998). These reveal whether or not the writer joins syntactic sentences to larger units, or in other words omits sentence boundaries. The analysis of punctuation is important for decisions on how to handle texts written by children computationally. Do they delimit their text in syntactic sentences? Are there any other units they delimit instead? What is then the nature of such delimitation? How frequently are sentences joined together and sentence boundaries omitted?

Error Profile of the Data 101 4.6.2 General Overview of Sentence Delimitation The content related marking of text, rather than syntactic, is also evident in the texts in this study. In the following example (4.76) written by a nine year old, most of the sentence boundaries correspond to syntactic units and are delimited in accordance with the writing conventions using capital letters at the beginning and major delimiters at the end. Two adjoined clauses can be observed in the third and the fifth sentences, joining main sentences together without conjunctions. Two vertical bars indicate where one would expect a major delimiter between the adjoined clauses (spelling or other errors are ignored in the English version). 15 (4.76) Den brinnande makan Det var en gång en pojke som hette Urban. En dag tänkte Urban göra varma makor. Då hände en grej som inte får hända huset brann upp för att makan hade tat eld. Då kom Urban ut med brinnande kalsingar och sa: Det brinner!!!!!!!!!!!!!!!!!!!!!! Brandkåren kom och spola ner huset då börja Urban lipa och sa : Mitt hus är blöt. The burning sandwich There was once a boy who was called Urban. One day Urban planned to make hot sandwiches. Then a thing happened that should not happen. The house burnt down because the sandwich started to burn. Then Urban came out with burning underwear and said: Fire! The fire-brigade came and hosed down the house. Then Urban started to blubber and said: My house is wet. In other texts, punctuation marks are used to delimit larger units as in the following text (4.77), written by a ten year old: (4.77) Den där scenen med dammen som tappade sedlarna tycker jag att den där flickan måste vara fattig så att hon tar sedlarna. Den där scenen med det tre tjejerna tyckte jag att de var taskiga som går ifrån den tredje tjejen det tycker jag att tjejen tar upp det på mötet med fröken och sedan tar fröken upp det på de andra tjejernas möte med fröken det kan hjälpa ibland. That scene with the lady that lost the money, I think that that girl must be poor so it is her who takes the money. That scene with the three girls, I thought that they were mean when they left the third girl. I think that the girl will take that up at the meeting with the teacher and then the teacher will take it up at the other girls meeting with the teacher. That can help sometimes. In this text, only two full stops occur. The first delimitation concerns a single sentence, correctly initiated by a capital letter and terminated by a full stop. The 15 The exemplified text represents the spell-checked versions, where the non-word misspellings have been corrected (see further in Section 3.5).

102 Chapter 4. sentence is quite long, however, and commas could facilitate reading. The second full stop terminates a whole paragraph that consists of at least three sentences. Some texts did not include any delimiters or other indicators of sentence boundaries at all, as in (4.78) also written by a ten year old. Again, vertical bars indicate the missing punctuation marks. (4.78) så här börja det jag var på mitt land och bada då var jag liten plötsligt kom en snok i för sig så hugger inte snokar i vatten men jag blev alla fall jätte rädd för jag kunde inte simma då och snoken jagade mig längre och längre ut då ko min bror med en gummi båt och tog upp mig då blev jag jätte glad It started like this. I was in the country and went for a swim. I was little then. Suddenly a grass snake came. Actually grass snakes do not bite in the water, but I was very scared, because I could not swim then and the grass snake chased me further and further out. Then my brother came with a rubber-boat and lifted me up. Then I was very happy. In the following text (4.79) written by an eleven year old we see examples of long sentences, where several clauses are put together either by inserting conjunctions or as adjoined clauses. Especially the first orthographic sentence is quite long, consisting of first three sentences joined by the conjunction och and followed by three adjoined clauses. Conjunctions are marked in boldface and omitted sentence boundaries are indicated by two vertical bars: (4.79) Ljus Det var en gång en pojke som hette Karl och gillade att leka med elden och en dag började det brinna i en hö-skulle ute på landet och den stackars pojken var bakom elden som hade sträckt ut sig tio meter bakom hö-skullen då kom det ett åskmoln och blixten slog ner i ladugården som tog eld kale som blev jätte rädd och sprang till närmaste hus som låg 9 kilometer bort det tog en timme att koma ditt och då ringde han fel numer av bara farten. När han kom fram skrek han i örat på brand männen att det brann på Macintosh vägen 738c och brand menen rykte ut och släkte elden. SLUT Light There was once a boy who was called Karl and liked to play with fire and one day a fire started in a hayloft out in the country and the poor boy was behind the fire that had spread ten meters behind the hayloft. Then came a thundercloud and the lighting struck in the cowshed that caught fire. Kalle who became very scared and ran to the nearest house that was 9 kilometers away. It took an hour to get there and then he called the wrong number because he was in such a rush. When he got through he yelled in the ear to the fire-men that there was fire at Macintosh Road 738c and the fire-men turned out and put out the fire. END It is a typical pattern in the whole Child Data corpus, that sentences are put together to build larger units, either as adjoined clauses where sentences follow each other without any conjunctions or long sentences are built with conjunctions

Error Profile of the Data 103 as in the above text (4.79) or in the example below (4.80), written by a nine year old: (4.80) på morgonen när vi vakna och jag skulle gå ut att hämta cyklarna märkte jag att vi inte va på toppen av berget utan i en by jag väckte pappa och skrek att han Va för tung och att vi åkt ner från berget och åkt så långt att vi inte visste va vi va. In the morning when we woke up and I was about to go out to get the bicycles, I noticed that we were not on the top of the mountain but in a village. I woke Daddy up and yelled that he was too heavy and that we had fallen down from the mountain and fallen so far that we didn t know where we were. 4.6.3 The Orthographic Sentence In order to investigate more closely how sentence delimitation is used and to what extent it corresponds to syntactic sentences, we analyze the length of orthographic sentences and the number of adjoined clauses. In Tables 4.32 and 4.33 we present the number of orthographic sentences and their length in number of words, along with the number of adjoined clauses and their frequency per 1,000 words. Table 4.32: Sentence Delimitation in the Sub-Corpora ORTHOGRAPHIC ORTHOGRAPHIC ADJOINED ADJOINED CLAUSES/ CORPUS SENTENCE SENTENCE LENGTH CLAUSE 1,000 WORDS Deserted Village 422 18.0 298 39.3 Climbing Fireman 408 11.0 75 16.6 Frog Story 536 9.2 70 14.3 Spencer Narrative 313 17.5 98 17.9 Spencer Expository 392 18.7 73 10.0 TOTAL 2 071 14.4 614 20.6 Table 4.33: Sentence Delimitation by Age ORTHOGRAPHIC ORTHOGRAPHIC ADJOINED ADJOINED CLAUSES/ CORPUS SENTENCE SENTENCE LENGTH CLAUSE 1,000 WORDS 9-years 476 14.4 216 31.6 10-years 487 14.0 122 17.8 11-years 651 12.3 210 26.2 13-years 457 17.8 66 8.1 TOTAL 2 071 14.4 614 20.6

104 Chapter 4. The average length of an orthographic sentence was 14.4 words. The shortest sentences are found in the Frog Story and Climbing Fireman corpora. Among age groups, orthographic sentence length is very similar; only the 13-years old have a greater average length of orthographic sentences. Although this measure does not reveal anything about what units are actually delimited, there seems to be a tendency for mean sentence length to increase with age. Additional analysis is needed in order to reveal if the increase in length of orthographic sentences with age is because children become worse at delimiting sentences or because their sentences have more complex structure (presumably the latter). In comparison, the primary school children in the study by Ledin (1998, p.21) obtained similar length of orthographic sentences for the lower age children, 12.9 words. Although older children had on average 10.0 words, which then contradicts the hypotheses. Also, the orthographic sentence length for adults in Hultman and Westman (1977) study averaged 14.7 words, 16 whereas secondary level students had longer sentences with average of 16.8 words. The frequency of adjoined clauses reflects how often (main) sentences are joined and brings more light to the nature of text delimitation. A common hypothesis is that adjoined clauses are less frequent by age, often considered to be a phenomenon related to primary school writers (Ledin, 1998). This seems to hold for our data too. The 13-year olds in the present study had four times fewer adjoined clauses per 1,000 words than the 9-year olds. The other two age groups also put quite a large number of clauses together without conjunctions. In the subcorpora, adjoined clauses are four times more frequent in the hand-written texts of Deserted Village than in the Spencer Expository corpus. The average value is 20 adjoined clauses per 1,000 words in the whole corpus. In comparison to the Ledin (1998, p.25) study, lower age primary school children had 10.2 adjoined sentences per 1,000 words overall, but 28.9 in narrative writing. The older children had on average 8.2 adjoined sentences per 1,000 words. In a study by Näslund (1981) (reported in Ledin (1998)), final year primary school children had on average 9.0 adjoined sentences and upper secondary students 5.1 per 1,000 words. Not surprisingly, analysis showed that sentence length increases by age, whereas the number of adjoined clauses decreases with age. Although the analysis did not identify what other units are marked, it indicates clearly that the younger children more often join sentences together into larger units. 16 The average value is based on orthographic sentence length of adult texts in five genres, (see Hultman and Westman, 1977, p.223)

Error Profile of the Data 105 4.6.4 Punctuation Errors Errors related to the use of major delimiters, summarized in Tables 4.34 and 4.35, concern omission of sentence boundaries (Omission), extra delimiters (Insertion) in front of a subordinate clause or a conjunction and periods placed in lists and adjective phrases, or put at other syntactically incorrect places in a sentence. Table 4.34: Major Delimiter Errors in Sub-Corpora SUB-CORPORA Deserted Climbing Frog Spencer Spencer ERROR TYPE Village Fireman Story Narrative Expository TOTAL % Omission 310 75 116 109 82 692 92.6 Insertion in front a subclause 9 9 12 1 16 47 6.3 Insertion other 4 2 1 1 8 1.1 TOTAL 323 86 129 110 99 747 Table 4.35: Major Delimiter Errors by Age AGE ERROR TYPE 9-years 10-years 11-years 13-years TOTAL % Omission 264 134 220 74 692 92.4 Insertion in front a subclause 16 6 12 15 49 6.5 Insertion other 2 1 4 1 8 1.1 TOTAL 282 141 236 90 749 The most common error is the omission of the sentence end-markers, often in the case of adjoined clauses. In (4.81) we see an example of a period inserted between a subordinate clause and its main clause: (4.81) Medan Oliver while Oliver sprang ran. Hade Erik vekt had Erik woken en a uggla som owl that nu now jagade chased While Oliver ran, Erik had woken up an owl that now chased him. honom. him Some cases of a period being placed in enumerations occurred as in (4.82): (4.82) Där nere i det there down in the Hammstern the-hamster Hilde Hilde höga high gräset låg grass lay. ödlan the-lizard Dalmatinen Tess. Grisen kalle knorr the-dalmatian Tess the-pig Kalle Knorr Graffitti Graffitti katten the-cat fillipa Fillipa och... and Down there in the high grass lay the Dalmatian Tess, the pig Kalle Knorr, the hamster Hilde, the lizard Graffitti, the cat Fillipa and...

106 Chapter 4. Further, the erroneous use of comma was analyzed, but only when syntactic violations occurred or when omitted in enumerations. Commas were, in general, very rare and when used were often misplaced. Commas occurred in front of a conjunction in an enumeration in (4.83): (4.83) De they hade med had with kulgevär, och rifles and sig: themselves ammunition ammunition ett spritkök, a spirit-stove m.m etc ett tält a tent, och and massa a lot of mat, food några some They had with them a spirit-stove, a tent and lots of food, some rifles and ammunition, etc. In some instances a comma was placed in front of a finite verb: (4.84) Linda Linda, brukade ofta used-to often vara i stallet. be in the-stable Linda often used to be in the stable. Often comma was used where one would expect a full stop: (4.85) Nasse Nasse kunde could inte not sova sleep, plötsligt suddenly hörde heard Nasse Nasse nån someone som that öppnade opened dörren. the door Nasse could not sleep. Suddenly Nasse heard someone open the door. Error frequencies are summarized in Tables 4.36 and 4.37 below. Error types include missing comma in enumerations or adjective phrases (Omission), an extra comma in front of a conjunction, enumeration or in other cases (Insertion), and commas being used instead of a major delimiter to mark a sentence boundary (Substitution). Table 4.36: Comma Errors in Sub-Corpora SUB-CORPORA Deserted Climbing Frog Spencer Spencer ERROR TYPE Village Fireman Story Narrative Expository TOTAL % Omission 41 2 10 3 3 59 33.5 Insertion 5 13 1 4 7 30 17.0 Substitution 5 22 2 30 28 87 49.4 TOTAL 51 37 13 37 38 176

Error Profile of the Data 107 Table 4.37: Comma Errors by Age AGE ERROR TYPE 9-years 10-years 11-years 13-years TOTAL % Omission 22 5 28 4 59 33.5 Insertion 12 8 5 5 30 17.0 Substitution 16 15 12 44 87 49.4 TOTAL 50 28 45 53 176 Overall, commas were mostly placed in sentence boundaries or were omitted. In the Deserted Village corpus commas were mostly omitted, whereas in the other texts, they were often used to mark a sentence boundary. 9-year olds and 11- year olds tend to omit commas, whereas 13-year olds use commas mostly to mark sentence boundaries. 4.6.5 Summary The delimitation of text varies both by age and corpora and indicates clearly that, especially younger children, often join clauses into larger units. Orthographically, the 13-year olds form the longest units with the smallest number of adjoined clauses. Most adjoined clauses occur among the youngest group, 9-year olds, and in the hand-written corpus of Deserted Village. The erroneous use of major delimiters is mostly represented by omission or insertion in front of subordinate clauses, lists, etc. Commas are mostly missing or used to mark sentence boundaries. 4.7 Conclusions All the grammar errors that were expected as typical for Swedish writers, including noun phrase agreement, predicative complement agreement, verb form and the choice of prepositions in idiomatic expressions, are represented in Child Data, but not all are very frequent. Especially frequent are errors in verb form, mostly in the finite main verb (other verb form errors were much less frequent). Errors in predicative complement agreement are not very common, whereas noun phrase agreement errors are more frequent. Erroneous choice of preposition is included in the category of word choice errors, represented by ten occurrences. More characteristic for this population are, besides the omission of tense-endings on finite verbs, errors in omission of obligatory constituents in sentences and word choice errors. Some impact of spoken language on writing is reflected (again) in finite verb forms, pronoun forms and also some cases of dialect forms within noun phrase.

108 Chapter 4. Comparison with grammar errors in other studies shows, not surprisingly, most similarities with the writing of primary school children. In comparison to adult writers, there are differences both in how frequent errors are and in error distribution. Grammar errors in Child Data are much more frequent than among adult writers, with approximately 5 to 8 errors for children and 1 error for adults per 1,000 words. Errors in verb form, noun phrase agreement, missing or redundant words and choice of preposition are the most common error types for all populations, including the Child Data population. The difference lies in the error frequency distribution. A closer look at the different sub-types of the verb form category shows that the discrepancy is due to the frequent dropping of tense-endings on finite verbs in the Child Data. Such errors are not very common in the newspaper articles of the Scarrie corpus, where errors in verbs after an auxiliary verb are the most common verb error. The grammar error profile of Child Data and its comparison with adult writers suggests then not only inclusion of the four central grammar error types in a grammar checker for primary school writers, but the treatment of errors in finite verb form in particular. Another observation more related to error correction is that in many cases more than one solution is possible, a fact exemplified in the analysis. Also, at the lexical level spoken forms are common. The spelling of many word forms indicate confusion as to what form should be used in which context. Among real words, misspelled words were most common, followed by splits that were more common in general as real words. Run-ons as real words were very rare. The overall spelling error frequency seems to be representative for the age group. Errors in punctuation are mostly represented by omission, there are cases where marking is put at syntactically incorrect places. There was quite a high frequency of adjoined clauses, especially among the younger children, indicating that subjects join syntactic units to larger units and do not delimit text in (only) syntactic sentences. The analysis does not reveal what other larger units are selected instead, if any. On the other hand, this observation clearly indicates that a grammar checker cannot rely on sentence marking conventions and consider capitals or sentence delimiters as real markings of the beginning or an end of a syntactic sentence. We should be aware that marking of sentence boundaries might be omitted in texts written by children, or even misplaced. The following conclusion can then be drawn from the analysis of Child Data for further work on development of a grammar error detector for primary school children:

Error Profile of the Data 109 include at least detection of errors in verb form (especially finite verb), agreement in noun phrase, redundancy and missing constituents, and some word choice errors (such as use of prepositions), be aware that there may be more than one solution for correcting an error, do not rely on the use of capitals or sentence delimiters as indicators of syntactic sentence boundaries, rather be aware that sentence marking can be missing or misplaced and several (main) clauses can be joined together.

110

Part II Grammar Checking

112

Chapter 5 Error Detection and Previous Systems 5.1 Introduction Constructing a system that will provide the user with grammar checking requires not only analysis of what error types are to be expected, but also an understanding of what possibilities there are to detect and correct an error. In the previous chapter, an analysis was presented of the grammar errors found in texts written by children and the central errors for this group of users were identified. The purpose of this chapter is to explore the second requirement and analyze the errors in terms of how they can be detected. The questions that arise are: What errors can be detected by means of syntactic analysis and what do require other levels of analysis? How much of the text needs to be examined in order to find a given error? Can it be traced within a sequence of two or three words, a clause, a sentence or a wider context? I will also investigate available technologies and establish: What grammar errors are covered by the current Swedish grammar checkers? Where do they succeed and where do they fail on Child Data? The chapter starts with a description of the requirements and functionalities of a grammar checker and the performance it has to achieve (Section 5.2), followed by the analysis of possibilities for detecting the errors in Child Data (Section 5.3). Then some grammar checking systems are described, paying special attention to Swedish tools (Section 5.4), followed by a performance test of the Swedish systems on Child Data (Section 5.5). Conclusions are presented in the last section (Section 5.6).

114 Chapter 5. 5.2 What Is a Grammar Checker? 5.2.1 Spelling vs. Grammar Checking Writing aids for spelling, hyphenation, or grammar and style are part of today s authoring software. Spelling and hyphenation modules were the first proofing tools developed. They are traditionally built to handle errors in single isolated words. Grammar checkers are a fairly new technology, not only aiming at syntactic correction as one would expect from their name, but often also including correction of graphical conventions and style, such as punctuation, word capitalization, number and date formatting, word choice and idiomatic expressions. Thus, whereas a spelling checker detects and handles errors at word-level, all detection of errors that is dependent on the surrounding context has been moved up to the level of grammar checking (cf. Arppe, 2000; Sågvall Hein, 1998a). 1 The various proofing tools exist both as separate modules developed by different companies that can be attached to an editor (e.g. Microsoft proofing tools are delivered by different suppliers) or the spelling and grammar checkers are integrated into a single system (see further in Section 5.4). 5.2.2 Functionality Proofing tools, in general, give those involved in the process of writing support in the rather tedious, time-consuming stage of revision (or rewriting), 2 and are helpful in finding the types of errors humans easily overlook (cf. Vosse, 1994). Their functionality can be defined in terms of detection, diagnosis and correction (or suggestion for correction) of errors. Identifying incorrect words and phrases is the most obvious task of a grammar checker. The position of an error in the text can be located either by marking exactly the area where the error is, or by marking the error with surrounding context (e.g. marking only the erroneous noun vs. marking the whole noun phrase). Detection of an error can be enough feedback to the user, if the user understands what went wrong. Diagnosis of the error is important when the user needs an explanation, especially if the tool handles several related error types. In the long run, diagnosis is of real use to every user in order to promote understanding of the error marked (see Domeij, 1996; Knutsson, 2001). Finally, presenting one (or more) suggestions for revision of the error can enhance a user s understanding of the problem in addition to providing an easy way to correct the error. 1 Proofing tools without syntactic correction, correcting style and graphical convention also exist (cf. Domeij, 2003, p.14). 2 Recall that editing activities on a computer occur usually during the whole process of writing and not only at the end. The writer may switch several times between writing phases, see Section 2.3.2.

Error Detection and Previous Systems 115 The functionalities of such a system must be achieved with high precision. Systems should not mark correct strings as incorrect. A system that provides detection of many errors, but also marks a large amount of correct text as erroneous can be regarded more negatively by a user than a system that detects fewer errors but makes fewer false predictions (cf. Birn, 2000). 5.2.3 Performance Measures and Their Interpretation Performance Measures Within the field of information extraction and information retrieval, measures of recall, precision and F-value have been developed for measuring the effectiveness of algorithms (van Rijsbergen, 1979). Recall measures the proportion of targeted items that are actually extracted from a system, also referred to as coverage. Precision measures the proportion of correctly extracted information, also referred to as accuracy. Overall performance of a system can be measured by the F-value, which is a measure of balance between recall and precision. When for instance the recall and precision have approximately the same value, the F-value is the same as the mean score of recall and precision. Also the main attributes by which the performance of a grammar checker is evaluated are related to its effectiveness and functionality. The attributes of evaluation of writing tools have been discussed and developed within the frames of the TEMAA (A Testbed Study of Evaluation Methodologies: Authoring Aids) (Manzi et al., 1996) and EAGLES projects (Expert Advisory Group on Language Engineering Standards) (EAGLES, 1996) with respect to a product s design specifications and user requirements. They consist of recall that, in this case, estimates how many of the targeted errors are actually detected by the system (i.e. grammatical coverage) and precision that measures the proportion of real errors detected and reveals how good a system is at avoiding false alarms (i.e. flagging accuracy). The higher the coverage and accuracy of the system are, the better. A third attribute of proofing tools concerns suggestion adequacy, which is related to the system s suggestions for correction. These validation parameters usually vary depending on the system s own strategies (Paggio and Underwood, 1998; Paggio and Music, 1998). The exact definitions for the evaluation measures used in the present study are presented in Section 5.5.

116 Chapter 5. Methods and Interpretation of Evaluation Besides the above mentioned measures, the whole method of evaluation and interpretation of results is important. A system s performance can be evaluated against an error corpus consisting of a collection of (sentence) samples with the errors targeted by the system (e.g. Domeij and Knutsson, 1999; Paggio and Music, 1998). Or more recently tests with text corpora were also made (e.g. Birn, 2000; Knutsson, 2001; Sågvall Hein et al., 1999) that contain both the erroneous (ungrammatical) and correct (grammatical) word sequences. The capability of a system to handle correct text is better tested with the last method, where the proportion of grammaticality is higher. At least three factors may influence the outcome of an evaluation of a system s performance: the kind of syntactic constructions present in the evaluation sample, the number of errors in them and who was the writer (beginner, student, professional, second language learner, etc.). This means different text genre and degree of the writer s own writing skills in expressing himself may display different syntactic coverage that also influence the possibility of occurrence of an error type. The size of the corpus needed for evaluation can be dependent on the error frequency in a writing population or the type of error evaluated. As discussed in Section 4.4, adults in the analyzed corpora made in average one grammatical error per 1,000 words. In order to cover a satisfactory quantity of syntactic constructions and errors in them, the evaluation corpus must be quite large. Grammar errors in the children s corpus are on average eight times more common than for adults, which could mean that for evaluation for this population a smaller corpus will probably be sufficient since grammar errors are more frequent. Thus, different populations of writers can have different requirements on what is needed for evaluation. Similarly, the frequency of different error types varies in general in that some error types are more common than others. For instance, a larger corpus is probably needed to cover errors in word order than errors in noun phrase agreement that are in general more recurrent. The method used and the factors that may influence the outcome of an evaluation of a system have to be taken into consideration when interpreting results, especially in a comparison between systems. The evaluated text genre, size of the corpus, error type and the nature of the writer should be related.

Error Detection and Previous Systems 117 5.3 Possibilities for Error Detection 5.3.1 Introduction Current grammar checking systems are restricted to a small set of all possible writing errors. The fact is that not all possible syntactic structures are covered and many errors above the single word level cannot be found without semantic or even discourse interpretation (cf. Arppe, 2000). In this section I discuss which errors in Child Data can be found by means of syntactic analysis and which require higher levels of analysis, such as semantics or discourse analysis. If syntactic analysis is sufficient, then an examination will follow of how much context is required for detection and, further, if the error can be identified locally by selection restricted to word sequences (i.e. partial parsing) or if analysis of complete clauses and/or sentences is necessary (i.e. full parsing). The different error types will be divided in accordance with both previous methods of classification (see Section 3.3.3) and the error taxonomy that was used to distinguish real word spelling errors from grammar errors in (see Section 3.3.4). That is, errors will be divided into whether they form structural errors and violate the syntactic structure of a clause or non-structural errors concerning feature mismatch, also whether new lemmas or other forms of same lemma are formed and finally, if words are omitted, inserted, substituted or transposed. Further, the violation types of errors will be considered in relation to which means must be used for detection of them. Previous analysis of this kind was provided within the Scarrie project with the assumption that, in general, partial parsing can be used to handle non-structural errors whereas other methods should be applied for structural errors (in the Scarrie project local error rules were used). They also identified error types that could not be handled by either of those two methods. The study reports further on the problem with this division since many errors could be handled by both methods (see Wedbjer Rambell, 1999c). The discussion in the analysis is brief, referring to previously discussed examples in the analysis of errors in Chapter 4 or directly to the index number in the error corpora presented in Appendix B. The section concludes with a summary of detection possibilities for errors in Child Data. The summary will serve as a specification for the final part of implementation described in Chapter 6. 5.3.2 The Means for Detection Agreement in Noun Phrases Detection of agreement errors in noun phrases requires a context of precisely the noun phrase and the errors can thus in general be detected by noun phrase parsing.

118 Chapter 5. All noun phrase errors are non-structural and in Child Data they are concentrated into one constituent realized as other form of the intended word lemma. Syntactically, most of the noun phrases follow one of the three noun phrase types (see Section 4.3.1) and three cases are in the partitive form. The feature sets have to include, besides definiteness, number and grammatical gender, also definitions of the semantic masculine gender in the adjectives. In this case, not only agreement with the noun has to be fulfilled, but also requirements on consistent use. That is, in one case (G1.2.3; see (4.9) on p.49) a (masculine) noun is modified by two adjectives where one of them has the masculine weak form and the other the common gender weak form. Both adjectives should follow one of the patterns, i.e. semantic or grammatical gender. Further, the feature mismatch in partitive noun phrases concerns not only the agreement between the quantifier and the noun, but also the number of the head noun (e.g. G1.3.2; see (4.11) on p.50). Another important thing to bear in mind is correct interpretation of the spelling variants. For instance the errors in G1.2.2 and G1.2.4 (see (4.8) on p.48) include the determiner de the [pl] spelled as the allowed variant dom, which in turn is homonymous with the noun dom judgment/verdict. It is important that the lexicon of the system has this information. Agreement in Predicative Complement In order to detect errors in agreement between the subject or object of a sentence with its complement, a context larger than a noun phrase is required. The errors are non-structural, realized as other forms of the same lemma and can still be handled by partial parsing identifying the parts that have to agree, i.e. the noun phrase, the verb types used in such constructions and the modifying adjective phrase. In Child Data, these errors concern agreement mismatch between the subject and an adjective or participle as the predicative complement. Syntactically, many of the subject noun phrases include embedded clauses (often with other predicates) that increase the complexity and the distance between the subject and the predicative complement, and probably require more elaborate analysis. Further, in G2.2.3 (see (4.13) on p.51) several predicative complements are coordinated, detection of all of them requires analysis of coordination. Finally, we have the case of G2.2.6 (see (5.1) below) where the head noun syskon is ambiguous between the singular reading sibling [sg] and the plural form siblings [pl], which complicates analysis.

Error Detection and Previous Systems 119 (5.1) (G2.2.6) nasse är Nasse is en gris som a pig that syskon sibling [neu,sg/pl] har has är is/are massor av syskon. lots of siblings [pl] smutsig dirty [com,sg] nasse Nasse Nasse is a pig that has a lot of brothers and sisters. Nasse is pink. But Nasse s brothers and sisters are dirty. är skär. is pink Men nasses but Nasse s Identifying the subject, the copula verb and the adjective syskon är smutsig is enough to signal that an error in predicative complement agreement occurred. However, the diagnosis can fail if the noun is only interpreted as singular. The tool would then signal that a mismatch in gender occurred, suggesting a form change in the adjective to smutsigt dirty [neu,sg]. But if the author refers to massor av syskon lots of siblings, then the noun should be interpreted as plural and the checker should then indicate a number mismatch and suggest the plural form smutsiga dirty [pl] instead. In any case, the most sound solution is to suggest two corrections due to the ambiguous nature of the nouns and let the user decide. Definiteness in Single Nouns Definiteness errors in single nouns in Child Data are represented by bare singular nouns (e.g. ö island in (5.2a)) that lack the definite suffix (i.e. ön island [def] ) and form other form of the intended lemma (see also (4.17) on p.53). Considered then as non-structural errors they could be detected by means of partial parsing. Marking bare singular nouns as ungrammatical can also be helpful for finding instances where, instead of a missing suffix, the indefinite article is missing. That is, if the noun phrase in the first sentence in (5.2a) was represented only as in (5.2b) (such errors were not found in Child Data). However, cases where bare singular nouns are grammatical exist. 3 (5.2) (G3.1.3) a. Jag såg I saw en ö. an island Vi we gick went I saw an island. We went to island. b. Jag såg ö. I saw island [indef] I saw island. till ö. to island [indef] In order to decide whether a bare singular noun is ungrammatical due to omission of article or noun suffix or if it is grammatical, a context wider than a sentence 3 Bare singular nouns can be grammatical in one context (e.g. ha bil have car ) and ungrammatical in another (e.g. se bil see car ), see further in Section 4.3.3.

120 Chapter 5. is needed in addition to some kind of lexical or semantic analysis in order to see if the noun was or was not introduced/specified earlier or the construction is grammatical (i.e. lexicalized). Pronoun Case Pronoun case errors in Child Data concerned the accusative case of pronouns and are realized as other forms of the same lemma, that is, using the nominative case form instead of accusative. These errors concern feature mismatch and are classified as non-structural errors. However, exactly as in the case of agreement errors in predicative complement, a more complex syntactic analysis is required to identify the requirements on certain positions in a clause. Some hint on the way to identify these could be the preposition preceding the pronoun, which would then require only partial parsing. Three such errors in Child Data consist of a nominative pronoun preceded by a preposition (e.g. G4.1.5; see (4.18) on p.53). Verb Errors Errors in verb form can be located directly at the verbal core, consisting then of one single finite verb, or a sequence of two or more verbs, or a verb preceded by an infinitive marker. They can be both structural (an auxiliary verb is missing) and non-structural (another form of the verb was used). All verb errors should be detectable by means of partial parsing. Optional constituents such as adverbs, noun phrases, and coordination of verbs should be taken into consideration. The errors in finite verb form found in Child Data in many cases coincide with the imperative form of these verbs (see e.g. G5.2.45 in (4.26) on p.58). Imperative as a finite verb form should be distinguished from the infinitive verb form in order to be able to detect such errors in finite verbs. Errors in verbal chains are represented in Child Data by two finite verbs in a row (e.g. ska blir will [pres] become [pres] ; (4.32) on p.61), in one case with the embedded infinitive as secondary future perfect (i.e. skulle ha kom would [pret] have [inf] came [pret] ; (4.31) on p.60). They also occur as bare supine in main clause, lacking the auxiliary verb (e.g. G6.2.2; see (4.33) on p.61). All such errors can be detected by parsing just the verbal cluster. In the case of missing auxiliary verbs, the crucial point is to be sure that the omission occurs in a main clause, which then requires identification of the type of clause. Errors in infinitive phrases concern infinitive markers followed by a verb in finite form (e.g. att stäng to close [imp] ; (4.34) on p.62), or missing infinitive marker with the auxiliary verb komma will (e.g. G7.2.3; see (4.36) on p.62). Both these error types can be located by partial parsing. In the case of an omitted

Error Detection and Previous Systems 121 infinitive marker in the context of the auxiliary verb komma will, it is important not to confuse it with the main verb komma come. Word order All word order errors are structural errors, involving transposition of sentence constituents. In general, detection of word order errors requires identification of the main verb and analysis either of the preceding or following constituents, which in turn requires identification of the beginning and ending of a sentence. In theory, some errors in the placement of adverbials can be traced by partial parsing, for instance in certain subordinate clauses. In Child Data, punctuation and capitalization conventions are often not followed and sentences may be joined together (see Section 4.6). This means that word order analysis cannot completely rely on such conventions until we find some way to locate sentence boundaries. In addition, the word order errors found in Child Data are rather complex, involving for instance more than one initial constituent before the finite verb in a main clause (see Section 4.3.6). This means that the possibility of success in locating word order errors in Child Data by such simple techniques as partial parsing is minimal. Redundancy Redundancy errors also represent structural errors, manifested as insertions of superfluous constituents into sentences. Immediate repetition of words (e.g. G9.1.3; see (4.38) on p.64) should be possible to detect by means of partial parsing. Occurrences of repeated constituents at different places in a given sentence (e.g. G9.1.7; see (4.39) on p.65) would require analysis of the complement structure, often of the whole sentence. The same applies for new constituents being inserted (e.g. G9.2.2; see (4.41) on p.66). Missing Constituents Sentences lacking a constituent also represent structural errors. Some of them may be detected by partial parsing, but most require more complex analysis. Among the errors in Child Data, discovering a missing subject or object would require analysis of the complement structure of the main verb, which means that such information must be stored somewhere (e.g. in the lexicon of the system). Finding an omission of a finite verb requires a search for a finite verb in a sentence, assuming that it is not an exclamation, a title, or other construction without finite verbs. Finding omissions of particles or prepositions requires knowledge of the verbs

122 Chapter 5. sub-categorization frame, or the structure of fixed expressions. Other types require not only syntactic analysis but also semantics and/or world knowledge as in (5.3), where negation on the main verb is missing. (5.3) (G10.5.1) a. tuni Tuni hade had jätte great ont pain i knät in knee men but hon ville she wanted sluta för det. stop for that Tuni had much pain in her knee, but she did not want to stop because of that. b. men hon ville but she wanted inte not sluta för det. stop for that Word Choice Word choice errors as substitutions of constituents also represent structural errors. These errors are realized as completely new words with distinct meaning from the intended one, new lemmas. Some of them can probably be solved by storing for instance information on the use of particles and prepositions with certain verbs (e.g. G11.1.2; see (4.48) on p.68), or word usage in fixed expressions (e.g. G11.1.7; see (4.47) on p.68), in the dictionary. Others will probably require analysis of semantics or even world knowledge before they can be detected, as the one in (5.4). (5.4) (G11.6.3) a. Jag I tittade looked på at på tröjarmen. on jumper-arm Virginia Virginia som that torkade wiped av off sin her näsa nose som that var was blodig bloody I looked at Virginia who wiped her bloody nose on the sleeve of her jumper. b. tröjärmen jumper-sleeve Reference Referential issues concern structural violations as substitutions of constituents, realized as new lemmas. All the errors in Child Data concerned anaphoric reference. Reference errors in general are discourse oriented. Anaphoric reference requires identification of the antecedent that agrees with the subsequent pronoun. The distance of the antecedent may be a preceding sentence, but it could also be farther away. Partial parsing techniques probably can be used for identifying antecedents. The crucial problem is how far in the discourse to search for antecedents.

Error Detection and Previous Systems 123 Real Word Spelling Errors Spelling errors resulting in existent words always form new lemmas, that means that they are realized as completely new words. They mostly violate the structural requirements as substitutions of constituents, but can also accidentally cause nonstructural violations, for instance agreement errors in noun phrases. The majority of misspellings slip through any syntactic analysis, resulting in syntactically correct strings. For instance, an error resulting in a word which is the same part-of-speech as the intended word, as in (5.5a), will be very hard to track down without any semantic information. In this example, the word as written coincides not only with the part-of-speech of the intended word but also the intended inflection. The intended word is presented in (5.5b): (5.5) (M1.1.33) a. den the [def] här here gamla manen old mane [def] This old man took care of us. b. mannen man [def] har tagit has taken hand hand om about oss. us Moreover, words resulting in other parts of speech are hard to trace syntactically. In (5.6a) a pronoun becoming a verb in supine form will not be detected without an additional level of analysis because the form of the verb that follows the preceding auxiliary verbs is syntactically correct: (5.6) (M3.3.10) a. den the killen boy [def] eller tjejen or girl [def] måste ha must have the boy or girl must have some problem b. nåt some nått reached [sup] problem problem Only a few real word spelling errors in Child Data cause syntactic violations and can to some extent be detected by means of syntactic analysis. Here is an example of a pronoun realized as a noun subsequently forming a noun phrase with an agreement error in gender and definiteness:

124 Chapter 5. (5.7) (M2.2.3) a. det the här here brevet är det letter is the [neu,def] ända end [com,indef] This letter is the only one I can give you today. b. det enda the only jag kan ge I can give dig you idag. today Here is an example of a pronoun becoming a verb, where as a consequence three verbs in a row were found in a sentence, first the two correctly spelled verbs form a grammatical verb cluster and then the misspelled pronoun forming a passive past verb form (5.8a). In this case, the feature structure of a verb cluster is violated and the error can be detected by partial parsing. (5.8) (M3.3.8) a. jag fick I could låna borrow hanns was-managed I could borrow his cell-phone. b. hans his mobiltelefon. cell-phone In (5.9a), the predicate of the sentence forms a noun and could be detected as a sentence lacking a finite verb: (5.9) (M4.2.32) a. då then ko cow min my bror brother then came my brother b. kom came Splits mostly violate complement conditions. For instance in (5.10a) the split will be analyzed as two successive noun phrases: (5.10) (S1.1.16) a. En brand a fire man man klättrade climbed A fire-man climbed up to us. b. brandman fire-man upp till oss. up to us Splits can also violate agreement, such as in (5.11a), the first part of the split has gender different from the second part, which results in the article (en a [com] )

Error Detection and Previous Systems 125 and the first part of the split (djur animal [neu] ) not agreeing. The correct form is shown in (5.11b): (5.11) (S1.1.28) a. Desere Desere jobbade i en worked in a [com] djur animal [neu] Desere worked in an animal-store. b. en a [com] djuraffär petshop [com] affär store [com] Punctuation at Sentence Boundaries Erroneous use of punctuation to mark sentence boundaries, probably requires full parsing or at least analysis of complement structure following the main verb. For instance, in order to detect the missing boundary in (5.12a) (indicated by dash), the system has to know that the verb gilla like is transitive and thus combines with only one object and cannot also take the pronoun dom they as a complement. That is, just locating the arguments following the verb, marked in boldface in the example, with the diagnosis of too many complements signals that something is wrong with the sentence. The correct form is presented in (5.12b). (5.12) a. Vissa some i in lämnade vissa left some filmen the-movie utanför. outside gillade liked inte not varann each-other dom they bråkade quarrelled och and Some (people) in the movie did not like each other. They quarrelled and left some (people) out. b. Vissa some vissa some i in filmen the-movie utanför. outside gillade liked inte not varann. each-other Dom they bråkade quarrelled och and lämnade left 5.3.3 Summary and Conclusion In accordance with the above discussion, it is clear that only some errors in Child Data can be detected by partial syntactic analysis alone, most of the errors require a higher level of analysis, a full parsing or even discourse analysis. The error types, their classification in accordance with what violations they cause, and comments on the possibility of detection are summarized in Table 5.1 below.

126 Chapter 5. Errors requiring only partial parsing for detection (in bold face in the table) concern (mostly) non-structural errors, including noun phrase agreement, verb form errors and some structural errors such as omissions within a verb core. Further, some pronoun case errors, constrained for instance by preceding constituents (e.g. a preposition), could be traced by partial parsing. In addition, some word order errors would in general be possible to detect by means of partial parsing, but since those found in Child Data display rather high complexity, detection possibility is minimal without more elaborated analysis. Finally, repeated words could be detected by partial parsing (i.e. among the redundancy errors). Table 5.1: Summary of Detection Possibilities in Child Data ERROR TYPE ERROR CLASS VIOLATION COMMENT GRAMMAR ERRORS: Agreement in NP non-structural substitution: other form partial parsing Agreement in PRED non-structural substitution: other form complex partial parsing Definiteness in single nouns non-structural substitution: other form partial parsing and discourse structural omission partial parsing and discourse Pronoun case non-structural substitution: other form some by partial parsing OR complex partial parsing Finite Verb Form non-structural substitution: other form partial parsing Verb Form after Vaux non-structural substitution: other form partial parsing Vaux Missing structural omission partial parsing Verb Form after inf. marker non-structural substitution: other form partial parsing Inf. marker Missing structural omission partial parsing Word order structural transposition some by partial parsing Redundancy structural insertion some by partial parsing OR full parsing Missing Constituents structural omission at least complement structure Word Choice structural substitution: new lemma full parsing + semantics and world knowledge Reference structural substitution: new lemma discourse analysis OTHER: Real Word Spelling Errors structural substitution: new lemma full parsing + semantics and world knowledge non-structural substitution: new lemma partial parsing Missing Sentence Boundary structural omission at least complement structure

Error Detection and Previous Systems 127 Two of the non-structural error types require a more complex partial parsing (in italics) and specification of a larger context in order to be able to detect them. These include agreement errors in predicative complement and pronoun case errors. Definiteness errors in single nouns could also in general be detected by partial parsing, but (probably) require discourse analysis in order to diagnose them correctly. The rest of the grammar errors are all structural and require at least analysis of complement structure or full parsing of sentences or even discourse analysis. In many cases also semantic and/or world knowledge interpretation is required. Among the real word spelling errors very few can be traced only by means of syntactic analysis. Most of them need semantics or even understanding of world knowledge in order to be identified. Missing sentence boundaries often cause syntactic violations in verb subcategorization. In conclusion, this summary suggests that not only non-structural errors can be detected by means of partial parsing, but also some structural violations. This division is certainly more dependent on whether the error is located in a certain portion of delimited text or not. For instance, some of the omission violations located in certain types of phrases can be detected by means of partial parsing (e.g. missing auxiliary verb). The most clear choice of which error types are certain to be detected by means of partial parsing are the agreement errors in noun phrases and errors located in verbs (i.e. concerning verb form and omission of verb or infinitive marker). These are also among the most frequent (central) errors in Child Data and invite implementation as I will show in Chapter 6. Among the other most frequent error types in Child Data, redundant constituents in clauses can probably be detected only when words are repeated directly. Other types of extra inserted constituents into clauses, omissions of words or word choice errors are structural errors that require more complex analysis and cannot be detected by just partial parsing.

128 Chapter 5. 5.4 Grammar Checking Systems 5.4.1 Introduction After the analysis of what possibilities there are for detecting the errors in Child Data presented in the previous section, the question arises as to what error types are already covered by current technologies and with what success. As pointed out in Section 5.2, research and development of grammar checking techniques is rather recent and started around the 1980 s with products mainly for English 4 but also for other languages, e.g. French (Chanod, 1993), 5 Dutch (Vosse, 1994), Czech (Kirschner, 1994), Spanish and Greek (Bustamente and León, 1996). In the case of Swedish, the development of grammar checkers did not start until the latter half of the 1990 s with several independent projects. Grammatifix developed by the Finnish company Lingsoft AB was introduced on the Swedish market in November 1998, and since 2000 it has been part of the Swedish Microsoft Office Package (Arppe, 2000; Birn, 2000). Granska is a grammar checking prototype, being developed by the research group of the Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology (KTH) in Stockholm (Carlberger and Kann, 1999). Another Swedish prototype was developed at the Department of Linguistics at Uppsala University, between 1996 and 1999, within the EU-project Scarrie (Sågvall Hein, 1998a; Sågvall Hein et al., 1999). 6 This section continues with a short review of methods and techniques used in some non-swedish systems (Section 5.4.2). Then, follows an overview of the Swedish approaches to grammar checking (Section 5.4.3) and a discussion of the techniques used in these systems, error types covered and their reported performance (Section 5.4.4). 5.4.2 Methods and Techniques in Some Previous Systems Many of the grammar checking systems on the market are commercial products. Technical documentation is often minimal or even absent. One exception is the grammar checking system Critique (known until 1984 as Epistle) (Ravin, 1993; 4 For instance Perfect Grammar was integrated in Word for Windows 2.0 in late 1991 and Grammatik 5 part of WordPerfect for Windows 5.2 and Word for Mac 5.0 in 1992 were among the first on the market (see further in Vernon, 2000). 5 Vanneste (1994) compared the utilities of other French products: Grammatik (French), Hugo Plus and GramR. 6 Skribent (http://www.skribent.info/) and Plita (Domeij, 1996, 2003) are other proofing tools on the Swedish market that include detection of violations against the graphical conventions and style, but not any syntactic error detection.

Error Detection and Previous Systems 129 Richardson, 1993) developed within the Programming Language for Natural Language Processing (PLNLP) project (Jensen et al., 1993b). 7 This project aimed at the development of a large-scale natural language processing system covering not only syntax, but also the various levels of semantics, discourse and pragmatics. 8 During the project the PLNLP-formalism was used in several domains of natural language applications. Besides the text-critiquing system, devices targeting for instance machine translation, sense disambiguation via on-line dictionaries, analysis of conceptual structure in paragraphs as a unit of thought, etc. were developed. English was the main language, but also languages such as Japanese, French, German, Italian, Portuguese, etc. were involved (Jensen et al., 1993b). Critique is based on the English parser (PEG) of this system (Jensen, 1993), utilizing the PLNLP-formalism of Augmented Phrase Structure Grammar (ACFG) 9 (implemented in Lisp), and producing a complete analysis for all sentences (even ungrammatical) on the basis of the most likely parse (Heidorn, 1993). Thus, in order to be able to detect errors, the syntactic analysis in PEG was developed to analyze not only grammatical sentences, but all sentences obtained an analysis. This was achieved by applying relaxation to rules when parsing failed on the first try, or a parse fitting procedure identifying the head and its constituents (e.g. in fragments) (see further in Jensen, 1993; Jensen et al., 1993a; Ravin, 1993). The system targets about 25 grammar error types and 85 stylistic weaknesses. The grammar errors are divided into five error categories: number agreement, pronoun case, verb form, punctuation and confusion/contamination of expressions (Ravin, 1993, pp. 68-70). Critique was planned to be developed for other languages besides English and now also a French version exists (Chanod, 1993). The insight gained by the PLNLP-project by providing analysis of all sentences seemed to have influenced other grammar formalisms such as Constraint Grammar (Karlsson et al., 1995) or Functional Dependency Grammar (Järvinen and Tapanainen, 1998). The methods of rule relaxation and parse fitting had an impact on the development of other (Swedish) grammar checking systems. Another quite well documented and frequently cited project is the Dutch system CORRie (Vosse, 1994). It applies the same idea of analyzing ill-formed sentences as well as well-formed ones and using augmented context-free grammar for 7 The development of Critique was done in collaboration with IBM and was later taken over by Microsoft. The tool is now used as a module for English grammar checking in Microsoft Word (cf. Jensen et al., 1993b; Domeij, 2003). 8 Mostly syntax and semantics are covered by the system, but also approaches involving analysis of discourse and pragmatics have been targeted. 9 The ACFG is considered to be more effective in contrast to CFG, since features and restrictions on them can be associated directly to corresponding categories/symbols resulting in a considerably decreased number of rules.

130 Chapter 5. that purpose. The system aimed primarily at correcting spelling errors resulting in other existing words but included analysis of misspellings, compounds, spelling of idiomatic expressions, hyphenation. CORRie s parser and its formalism inspired the development of the proofing tools developed in the Scarrie project (see below). 5.4.3 Current Swedish Systems There are at present three known proofing tools for Swedish aimed at syntactic error detection including Grammatifix, the grammar and style module part of Swedish Microsoft Word since 2000, the Granska prototype under development at NADA, KTH and the ScarCheck prototype developed at the Department of Linguistics at Uppsala University in the Scarrie project. For each system I describe below the architecture, the different error types covered, the technique used for grammar checking (to the extent available information exists) and the system s reported performance. Grammatifix Lingsoft s 10 commercial product Grammatifix, was introduced on the Swedish market in November 1998, and has been since 2000 part of Microsoft Word since 2000. Parts of this proof-reading tool are based on research and technology from the 1980:s, when work on a morphological surface-parser had started. The work on error detection rules began in 1997 (Arppe, 2000). The lexical analysis in Grammatifix is based on the morphological analyzer SWETWOL, designed according to the principles of two-level morphology (Karlsson, 1992) and utilizing a lexicon of about 75,000 word types. At this nondisambiguated lexical-lookup stage, each word may obtain more than one reading. The part-of-speech assignment is to a large extent disambiguated in the next level of analysis, by application of the Swedish Constraint Grammar (SWECG) (Birn, 1998), 11 a surface-syntactic parser applying context-sensitive disambiguation rules (Arppe et al., 1998). As Birn (2000) points out, full disambiguation is not a goal since the targeted text contains grammar errors. Errors are detected by partial parsing after assigning the tags @ERR and @OK to all strings and then applying error detection rules defined in the same manner as the constraint grammar rules used for syntactic disambiguation, with negative conditions often related to just portions of a sentence. These error rules select the tag @ERR when an error occurs. The error 10 Lingsoft s homepage is http://www.lingsoft.fi/ 11 Birn (1998) gives a short presentation of the formalism. The CG-formalism was originally developed by Karlsson (1990). Karlsson et al. (1995) give a description of the basic principles and the CG-formalism.

Error Detection and Previous Systems 131 detection component consists of 659 error rules and a final rule that applies the tag @OK to the remaining words (Birn, 2000). Relaxation on rules is included in the error detection rules and not in the phrase construction rules, so it regards certain word sequences as phrases despite grammar errors in them (Arppe et al., 1998). Grammatical errors are viewed by this system as violations of formal constraints between morphosyntactic categories (Arppe et al., 1998). Two types of constraints are distinguished: intra-phrasal, e.g. phrase-internal agreement, and inter-phrasal, e.g. constituent order in a clause. Grammatifix not only detects errors, but also provides a diagnosis with explanation of the error and a suggestion for correction when possible. The tool addresses 43 error types, where 26 concern grammar, 14 punctuation and formatting, and 3 stylistic issues. The grammar error types include agreement errors in noun phrases and subject complements, errors in pronoun form after preposition, errors in verbs, in word order and others (Arppe et al., 1998; Arppe, 2000). The grammar error types are listed and compared to the types in the other Swedish systems in Section 5.4.4. The linguistic performance of the system was tested separately for precision and recall based on corpora of different size from the newspaper Göteborgs Posten (Birn, 2000, pp.37-39). For precision, the newspaper corpus consisted of 1,000,504 words and resulted in a rate of 70% precision (374 correct alarms and 160 false alarms). The analysis of recall was based on a text extract of 87,713 words and resulted in an overall recall rate of 35% including also error types not covered by the tool (135 errors in the text and 47 errors detected). Counting only the error types targeted by Grammatifix, the recall is 85% (55 errors in the text and 47 errors detected). 12 The Granska Project The proof-reading tool Granska is being developed at the Department of Numerical Analysis and Computer Science, KTH (the Royal Institute of Technology) in Stockholm. The first prototype was developed in 1995, running under Unix. Then followed a more elaborate version with graphical interface in the Windows operating system. This version included detection of agreement errors in noun phrases. The current version of Granska is a completely new program written from scratch, starting in 1998, in the project Integrated language tools for writing and document handling. 13 Granska is an integrated system that provides spelling and grammar 12 The error profile of the corpus used for analysis of Grammatifix s grammatical coverage (recall) is reported in Chapter 4, Section 4.4. 13 See more about the project at: http://www.nada.kth.se/iplab/langtools/

132 Chapter 5. checking that run at the same time and can be tested in a simple web-interface. 14 The system recognizes and diagnoses errors and suggests correction when possible. Granska combines probabilistic and rule-based methods, where specific error rules and local applied rules detect ungrammaticalities in free text. The underlying lexicon includes 160,000 word forms, generated from the tagged Stockholm-Umeå Corpus (SUC) (Ejerhed et al., 1992) of 1 million words, and completed with word forms from SAOL (Svenska Akademiens Ordlista, 1986). The lexical analyzer applies Hidden Markov Models based on the statistics of word and tag occurrences in SUC. Each word obtains one tag with part-of-speech and feature information. Unknown words are analyzed with probabilistic wordending analysis (Carlberger and Kann, 1999). A rule matching system analyses the tagged text searching for grammatical violations defined in the detection rules and produces error description and a correction suggestion for the error. When needed, additional help rules are applied more locally, used as context conditions in the error rules. Other accepting rules handle correct grammatical constructions in order to avoid application by error rules, i.e. avoiding false alarms (Knutsson, 2001). Granska s rule language is partly object-oriented with a syntax resembling C++ or Java and is meant to be applied not only for grammar checking, but also partial parses as identification of phrase and sentence boundaries. Further, with Granska it is possible to search and directly edit in text, e.g. changing tense for verbs, moving constituents within a sentence. Also the tagging result may be improved, when the guess is wrong, so new tagging of a certain text area may be applied (see further in Knutsson, 2001). The rule collection of the system consists of approximately 600 rules (Domeij et al., 1998) divided into three main categories: orthographic, stylistic and grammatical rules. Half of the rules detect grammar errors including noun phrase and complement agreement, errors in pronoun form after preposition, errors in verbs, errors in preposition in fixed expressions, word order and other errors (Domeij and Knutsson, 1999; Knutsson, 2001). The grammar error types are listed and compared to the types covered by the other Swedish systems in Section 5.4.4. A validation test of Granska is reported in Knutsson (2001, pp.141-150), based on a corpus of 201,019 words and shows an overall performance of 52% in recall and 53% in precision (418 errors in the texts, 216 correct alarms and 197 false alarms). In this text sample, including both published texts written by professional writers and student papers, 15 Granska is best at detecting errors in verb form with 14 Granska s Internet demonstrator is located at: http://www.nada.kth.se/theory/ projects/granska/demo.html 15 The error profile of the validated corpus of Granska was already reported in Chapter 4, Section 4.4.

Error Detection and Previous Systems 133 a recall of 97% and precision of 83%, and agreement errors in noun phrase with a recall of 83% and precision of 44%. The Scarrie Project Within the framework of the EU-sponsored project Scarrie, 16 prototypes of proof reading tools for the Scandinavian languages Danish, Norwegian and Swedish were developed. The project ran during the period December 1996, to February 1999. WordFinder Software AB 17 was the coordinator of the project and Department of Linguistics at Uppsala University and the newspaper Svenska Dagbladet were the other Swedish partners. Interface and packaging were outside the project and planned to be taken care of by WordFinder after the project s completion. Professional writers at work in particular newspaper and publishing firms were the intended users. The Swedish version of the prototype provides both spelling and grammar checking run at the same time, searching through the text sentence by sentence. The system recognizes and diagnoses errors, giving information about error type and error span. No suggestions for correction are given. 18 The system lexicon is based on a corpus of 220,000 newspaper articles published in 1995, and 1996 from the Swedish newspapers Svenska Dagbladet (SvD) and Uppsala Nya Tidning (UNT). The SvD/UNT corpus consists of more than 70 million tokens and 1.5 million word types. The resulting lexical database, ScarrieLex, consists of a one-word lexicon of 257,136 single word forms and a multi-word lexicon of 4,899 phrases (Povlsen et al., 1999). The spelling module is based on the Dutch software CORRie (Vosse, 1994)(see Section 5.4.2), whereas the grammar checking module ScarCheck was developed as new software (Sågvall Hein, 1998b; Starbäck, 1999). 19 The grammar checker is based on a previously developed parser, the Uppsala Chart Parser (UCP), a procedural, bottom-up parser, applying a longest path strategy (Sågvall Hein, 1981, 1983). 20 16 The Scarrie project homepage: http://fasting.hf.uib.no/ desmedt/scarrie/ 17 The homepage of Wordfinder Software AB is http://www.wordfinder.com 18 A demonstrator of the Scarrie s prototype is located at: http://stp.ling.uu.se/ ljo/ scarrie-pub/scarrie.html 19 The spelling and grammar checking in the Danish and Norwegian prototypes is solely based on the Dutch software CORRie (Vosse, 1994). 20 The original version of the chart-parser was first implemented in Common Lisp (see Carlsson, 1981) and then converted to C. The resulting Uppsala Chart Parser Light (UCP Light) (see Weijnitz, 1999) is a smaller and faster version at the cost of less functionality, starting at syntax level and requiring a morphologically analyzed input. UCP Light is used in the web-demonstrator (Starbäck,

134 Chapter 5. The parsing strategy of erroneous input is based on constraint relaxation in the context-free phrase structure rules and application of local error rules (Wedbjer Rambell, 1999b). The grammar is in other words underspecified to a certain level, allowing feature violations and parsing of ungrammatical word sequences. The local error rules are part of the same grammar and are applied to the result of the partial parse. Alternative parses are weighted yielding the best parse. A chart-scanner collects and reports on errors (Sågvall Hein, 1999). ScarCheck targets more than thirty error types concerning grammar, including agreement errors in noun phrase and complement, errors in verb phrase and verb valence errors, errors in conjunctions, pronoun case, word order and others (Sågvall Hein et al., 1999). Again, the different grammar error types are listed and compared to the errors of the other two Swedish systems in Section 5.4.4. The performance evaluation of the grammar checking system was based on a newspaper corpus of 14,810 words with an overall recall of 83.3% and precision of 76.9% (first run). Six grammar errors occurred in the corpus represented by errors in nominal phrase, verb phrase and word order (Sågvall Hein et al., 1999). 21 5.4.4 Overview of The Swedish Systems Detection Approaches The approaches for detection of errors in unrestricted text differ in the Swedish systems, not only in the technology used, that varies from chart-based methods in Scarrie, application of constraint grammars in Grammatifix, to probabilistic and rule-based methods in Granska, but also in the way that strategies are applied. Grammatifix and Granska identify erroneous patterns by partial analysis, whereas Scarrie produces full analysis for both grammatical and ungrammatical sentences. Grammatifix leaves ambiguity resolution to the syntactic level and applies relaxation on error rules in order to be able to parse erroneous phrases. Granska disambiguates starting at the lexical level assigning only one morphosyntactic tag to each word and then applying explicit error rules in the search for errors, including locally applied rules and rules to avoid marking of grammatically correct word sequences as ungrammatical. Scarrie parses ungrammatical input implicitly by relaxation of the parsing rules (not in error rules as Grammatifix does) and explicitly by additional error rules applied locally on the parsing result. The thing common to all the tools is that they define (wholly or to some extent) explicit error rules describing the nature of the error they search for. Furthermore, 1999). (Email correspondence with Leif-Jöran Olsson, Department of Linguistics, Uppsala University - 21/11/01) 21 Also two errors in splits are reported.

Error Detection and Previous Systems 135 the tools either proceed with error detection sentence by sentence, requiring recognition of sentence boundaries, or they can rely in their rules on for instance capitalization conventions, and search for words beginning in capital letters (cf. Birn, 2000). The Coverage of Error Types In this section I present the different grammar error types covered by Grammatifix, Granska and Scarrie and the similarities and/or differences between the systems selection of error types. Table 5.2 (p.137) shows the results of this analysis, based on the available error specifications of the different projects 22 and completed with personal observations from tests run with these tools. For every listed error type an example sentence from the projects error specifications (if present) was chosen to exemplify the targeted error. The source of this example is listed in the last column of the table. A similar analysis is discussed in Arppe (2000), 23 where he concludes that the selection of error types targeted by the Swedish grammar checking tools is quite similar in many aspects. Differences occur in the subsets of errors or some specializations. The analysis in the present thesis shows that all the tools check for errors in noun phrase agreement concerning definiteness, number and gender in both the form of the noun and the adjective. They also detect errors in the agreement between the quantifier/pronoun and noun in partitive noun phrases and in the masculine form of the adjective. Violations in number and gender agreement with predicative complement are also included in all three tools and so is pronoun case, which all tools check in the context after certain prepositions. Also, the same kinds of word order errors are covered by all the tools, except that Scarrie also checks for inversion in the main clause. Errors in verbs was the group that was most difficult to compare, because detection approaches differ in some aspects. The tools all check for occurrences of finite verbs (too many, missing or no predicate at all) and the form of non-finite verbs (after auxiliary verb or infinitive marker). Only Grammatifix does not check for finite verbs after an infinitive marker. They check further for missing or extra inserted infinitive marker in the context of main verbs. They also look for more 22 Grammatifix: Arppe et al. (1998); Arppe (2000) and the specification in Word 2001; Granska: Domeij and Knutsson (1998, 1999) and the Internet demo: http://www.nada.kth.se/ theory/projects/granska/demo.html, Scarrie: Sågvall Hein et al. (1999) and examples listed in the Internet demo: http://stp.ling.uu.se/ ljo/scarrie-pub/scarrie. html. 23 The present comparison is independent of the analysis reported in Arppe (2000). He also compared the punctuation and stylistical error types.

136 Chapter 5. style-oriented errors in the use of passive verbs (double or after certain verbs) and supine form (double or without ha have ). Scarrie also checks if a supine form is used in the place of an imperative. All the tools check for the use of the superlative form möjligast most-possible in combination with an adjective. Some other differences concern errors in the use of prepositions, where Grammatifix and Granska detect errors in the harmony of prepositions in certain context, only Granska checks for preposition use in idiomatic expressions. Further, Granska checks tense within a sentence. Double negation is not targeted by Scarrie. Granska and Scarrie also detect missing subject errors. Granska also checks more stylistical issues such as contamination of expressions and tautology, which are not included in the table. 24 24 Splits and run-ons are also targeted by some of these tools, but since these are not syntactic errors they were not included in this comparison.

Error Detection and Previous Systems 137 Table 5.2: Overview of the Grammar Error Types in Grammatifix (GF), Granska (GR) and Scarrie (SC) The comparison was done on 08/10/01 and revisited on 30/10/02. X indicates observations from error specificatios, (x) indicates my own observations. ERROR TYPE GF GR SC EXAMPLE SOURCE NOUN PHRASE: Definiteness agreemendomstolen X X X Det är i samhällets utvecklingen bort från detta som Arbets- GF inte hängt med It is in the society s [poss] development [def] away from this that the Labour court has not kept up Number agreement X X X Natten bär sin skuggor. SC The night carries its [sg] shadows [pl] Gender agreement X X X En eventuellt segerfest får vänta. SC A [com] possible [neu] victory-party [com] has to wait. Gender agreement: X X (x) Ett av de gula blommorna hade slagit ut. GF quant. and noun One [neu] of the yellow flowers [com] had come out. Gender agreement: X (x) (x) Då frestade han ditt kött och sände dig den rödhårige kvinnan. GF masculine form of adjective Then he tempted your flesh and sent you the red-haired [masc] woman. PREDICATIVE COMPLEMENT: Number agreement X X X Tävlingen blev väldigt besvärliga. SC The competition [sg] became very difficult [pl] Gender agreement X X X Då hade läget i byn redan blivit outhärdlig för gruppen. GF At that point the situation [neu] in the village had already become unbearable [com] for the group. PRONOUN: Case after preposition X X X Vi sjöng för de. GF We sang for they [nom]. VERBS: Verb form after auxiliary verb Verb form after inf. marker X (x) X Hur trygghet inte längre kan var statisk utan ligga i SC förnyelsen, utvecklingen och förändringen. How safety cannot any longer be [pres] static but lie in renewal, development and change. (x) X Han har lovat att i alla fall skall slå Turkiet. SC He has promised that in any case will [pres] beat Turkey. Number of finite verbs X (x) X I Ryssland är betalar nästan ingen någon skatt. GF In Russia almost noone is [pres] pays [pres] any tax. Missing finite verb X X X Det bli viktigt. GF That will-be [inf] important. Missing verb X X X Ingen koll. GR No control. Missing inf. marker X X X Vi kommer spela en låt av Ebba Grön. GR We will play a song by Ebba Grön. Extra inf. marker X (x) X Sverige började att klassa kärnkraftsincidenter enligt den internationella standarden. CONTINUED ON NEXT PAGE Sweden started to classify nuclear incidents in accordance with the international standard. SC

138 Chapter 5. ERROR TYPE GF GR SC EXAMPLE SOURCE PREPOSITIONS: Wrong preposition in fixed expressions Supine instead of imperative X Betänkt också de anläggningskostnader som tillkommer. SC Consider [sup] also the construction-costs that will be added. Supine without ha X X (x) De kunde fått bilderna på begravningsgästerna från danska GF polisen. They could get pictures of the funeral-gests from the Danish police. Double supine X X X Vi hade velat sett en större anslutningstakt, säger Dennis. GF We had wanted [sup] seen [sup] a greater rate of joining, says Dennis. Double passive X X X Saken har försökts att tystas ner. GF The thing has been tried [pass] to be quietened [pass] down. S-passive after certain X (x) X Huset ämnar byggas SC verbs The house intends to be built [pass]. Tense harmony X Jag höll mig inne tills stormen har bedarrat. GR I kept [pret] myself inside until the storm has abated [perf]. Preposition harmony with two-part conjunctions WORD ORDER: Placement of adverb/negation Word order in subordinate interrogative clause Word order in main clause with inversion X med utgångspunkt från GR with starting-point from X (x) Det är utbildning som idag inte erbjuds vare sig i Lund eller Malmö. It is education that today is not offered either in Lund or Malmö. X X X Man kan tro inte sina öron. SC One can believe not one s ears. X X X Jag undrar vad gör de unga männen i Finland. GF OTHER: Missing subject (x) (x) SC Missing inf. marker X X X Jag klarar av gå. GF with preposition I can manage walk. Repeated words X (No example given in the specification.) I wonder what do the young men in Finland do. X Nu man kan testa de kommande versionerna av programvaran. Now one can try the future versions of the program. Double negation X (x) Det kan bli svårt att få jobb om man inte har varken pengar GF eller familj att stöda en. It can be hard to get work if one does not have neither money or family to support one. Construction möjligast X (x) (x) Hon körde med möjligast stora snabbhet. GF + adjective She drove with the most possible great speed. GF SC

Error Detection and Previous Systems 139 So far, comparison has concerned the different types of errors covered, but the truth is that the detection of errors also depends on the syntactic complexity defined in the separate error types. For instance, detection for errors in the verb form after an infinitive marker can differ depending on whether other (optional) constituents are inserted between the infinitive marker and the verb. In (5.13), all the sentences violate the rule of a required infinitive verb form after an infinitive marker. In (5.13a) and (5.13b) the targeted verb is preceded by an adverbial realized as a prepositional phrase, which disturbed both Granska and Scarrie in the detection of this error. 25 (5.13) a. ALARM GRANSKA SCARRIE Han he har have lovat promised att to i alla fall in any case No No skall will [pres] slå beat [inf] Turkiet. Turkey He has promised to will beat Turkey in any case. b. Han he har have vill want [pres] lovat promised slå beat [inf] att to Turkiet. Turkey i alla fall in any case He has promised to wants beat Turkey in any case. c. Han har lovat he have promised Turkiet. Turkey d. Han he att skall slå to will [pres] beat [inf] He has promised to will beat Turkey. har have slå beat [inf] lovat promised Turkiet. Turkey att to He has promised to wants beat Turkey. e. Han har he have lovat promised att slår to beats [pres] He has promised to beat Turkey. vill want [pres] Turkiet. Turkey No No Yes Yes No Yes Yes Yes The error is detected only when the verb follows directly after the infinitive marker in the cases (5.13d) and (5.13e). In the sentence in (5.13c) the verb also 25 The errors are not detected even if simple adverbials such as inte not, aldrig never or sen later are inserted.

140 Chapter 5. follows directly after the infinitive marker, but Granska does not detect it as an error although the verb is tagged as a verb in present tense form. Another example of how important syntactic coverage is for error detection is shown in (5.14), where Scarrie had problems detecting the agreement error between the subject and the adjective form in the predicative complement due to a possessive modifier in the head noun in the subject in (5.14b). Granska detects both errors but Grammatifix does not react at all. (5.14) a. ALARM SCARRIE S DIAGNOSIS Hus är vacker. wrong number in the adjective house [pl, neu] is beautiful [sg, com] in predicative complement House is beautiful. b. Mitt my [sg, neu] vacker. beautiful [sg,com] My house is beautiful. hus house [sg, neu] är is no reaction In conclusion, the three Swedish systems cover both grammatical and more style-oriented errors. The coverage is similar in many aspects. In relation to the most common errors in Child Data, they all cover the non-structural errors that are, as discussed in the previous section, reserved to certain delimited text patterns. The structural errors that require more complex analysis are included only to a small extent. They all detect the same errors in noun phrase agreement and most of the errors in verb form. Exceptions are the verb form errors after an infinitive marker that are not included in Grammatifix, errors concerning the use of supine verb form instead of the infinitive are only included in Scarrie while tense harmony is only checked by Granska. Errors in finite verb form, that were the most frequent error type in Child Data, are (probably) covered by the Missing finite verb category that all the tools cover. Among the errors of redundant or missing constituents in clauses, only Grammatifix checks for repeated words. All the tools check for missing infinitive marker in the context of a preceding preposition. Granska and Scarrie also detect missing subject. Other categories of redundant or missing constituents in clauses are not covered. Word choice errors are only covered by Granska to the extent of prepositions in fixed expressions. Other types are not included. As discussed in the previous section, structural errors of this kind require in general more complex analysis in order to be identified, except when they are limited to certain parts that can be delimited clearly (e.g. in a verb cluster).

Error Detection and Previous Systems 141 The present overview of error types covered by these tools does not reveal the actual grammatical coverage and precision of detection. As shown above, there is a question of the extent of error coverage, since for instance insertion of some optional constituents or presence/absence of certain constituents has influenced whether or not an error was identified. I provided a test of these tools performance directly on Child Data, which is reported in the subsequent Section 5.5. Performance All the systems were validated for the linguistic functionalities they provide for as reported above in the descriptions of the separate projects (Section 5.4.3), summarized in Table 5.3 below. The validation tests carried out by the developers are based on corpora of different size and composition, and different sets of errors were found. As discussed in Section 5.2.3, the size and genre of the evaluated texts and the writer s experience may influence the outcome of such analysis and the results should be interpreted carefully. The size and composition of the tested texts influence the occurrence of syntactic constructions giving rise to errors and should also be related to how frequent errors in the tested population are. Table 5.3: Overview of the Performance of Grammatifix, Granska and Scarrie TOOL CORPUS SIZE RECALL PRECISION Grammatifix newspaper articles 87 713 35% newspaper articles 1 000 504 70% Granska newspaper articles, 201 019 52% 53% official texts, student papers Scarrie newspaper articles 14 810 83% 77% Grammatifix and Scarrie were tested solely on newspaper texts written by professional writers, which is probably enough in the case of Scarrie since it was developed for professional writers. On the other hand, Grammatifix as a module in a word processor not aimed at any special groups, should be tested on texts of different genre written by different populations. Granska s evaluation was tested upon texts of different genre consisting of published newspaper and popular science articles, official texts and student compositions. This corpus is more balanced and perhaps reflects the real performance of the system. In addition, certain types of errors that dominate in the corpus depending on the genre are reported (Knutsson, 2001).

142 Chapter 5. Further, a fairly large amount of data is needed in order to be able to test a reasonable number of errors. The validation corpus used for Scarrie was small in this aspect, including only six of the defined errors and yielding quite high rates in both recall and precision. In the case of Granska, the size of the corpus is much bigger and, as discussed, better balanced. Grammatifix included the largest corpora for the test of precision and a smaller corpus for the test of recall and obtained the lowest recall. As a commercial product with high expectations on precision, the error coverage of the system was probably cut down. This means that the system probably is able to detect more errors and receive a better recall rate than the current 35%, but if the result is lower precision due to the number of false flaggings increasing, the detection of those unsafe errors is not included and remains undetected. The recall rates in the systems vary from 35% to 83% and the precision rates are between 53% and 77%. Evaluation of individual error types is only reported for Granska, with the best results for verb form errors and agreement errors in noun phrase. 5.4.5 Summary The Swedish approaches to grammar checking apply techniques for searching (more or less) explicitly for ungrammaticalities in text. Errors are found either by looking for specific patterns in certain contexts in the text that match the defined error rules, or by using selections in a relaxed parse by a chart-scanner. The approaches seem to be dependent on how fine or broad a specific error type is defined, so that the same error is not overlooked in other contexts. The choice of what types of errors are detected is based on a more or less ambitious analysis of errors in writing, often for certain group of writers (e.g. professional writers, writers at work). However, the risk is still there that some other type of error in the same pattern may be overlooked. The coverage of error types is very similar between the systems. Performance was evaluated separately on different text data so the results are hard to compare.

Error Detection and Previous Systems 143 5.5 Performance on Child Data 5.5.1 Introduction Having examined what error types are covered by the current Swedish systems Grammatifix, Granska and Scarrie, their performance will be tested on the Child Data corpus. Recall that the error frequency is different in texts written by children than adult writers targeted by the Swedish grammar checkers and also that the error distribution is (slightly) different in Child Data. The purpose of testing the tools performance on Child Data is crucial in the view of handling text with higher error density and of (slightly) different kind than they were designed for. Discussion in the previous section on the error types covered by these systems points out that many of the errors in Child Data are targeted. Among the most common error types in Child Data, all (or most) of the error types related to verb form and agreement in noun phrase are targeted by the tools and some (quite few) of the errors concerning redundant or missing constituents in clauses and word choice errors, a group of errors that needs more elaborated and complex analysis for detection (see discussion in Section 5.3). The tools are not, however, designed in the first place to detect errors in children s texts and will most probably perform worse on these texts. The question is how low will the performance be, where exactly will they fail, and what consequences do the results have for Child Data. This section continues with a description of the evaluation procedure (Section 5.5.2) and the individual systems detection procedures (Section 5.5.3). Then the detection results on Child Data are presented type by type (Section 5.5.4). Finally, a summary of the results and discussion on overall performance is presented (Section 5.5.5). 5.5.2 Evaluation Procedure As discussed in Section 5.2.3, evaluation of authoring tools normally concerns detection, diagnosis and correction functionalities, either on single sentences or on whole text samples. For the case of investigating how good a system is at detecting targeted errors, sentence samples usually will do, but a corpus is better for measuring how good a system is overall. In my analysis the whole Child Data corpus in the spell-checked version was used as input, free from the non-word spelling errors, 26 since the main purpose of the evaluation analysis is to see the checker s performance in detection of grammar errors. The Child Data corpus 26 See Section 3.3 for discussion on how this was achieved.

144 Chapter 5. represents texts that are new to all three systems and a writing population which is not explicitly covered by any of them. Since not all the systems give suggestions for correction, the present performance test will only analyze detection and diagnosis performance. Detection performance is investigated in terms of the number of correct and false alarms. Correct alarms include all detected errors, divided further into whether a correct or an incorrect diagnosis was made. False alarms are divided further into detection of correct word sequences diagnosed as errors, and detections that happen to include other error categories than grammar errors, e.g. a spelling error, a split, or a sentence boundary. To exemplify, the agreement mismatch between the common gender determiner en a [com] and the neuter-gender compound noun stenhus stone-house [neu] in (5.15a) concerns the gender form of the determiner, which is a correct alarm with a correct diagnosis. Now, identifying this noun phrase segment and classifying this as an error in number agreement as in (5.15b) would then be considered as a correct alarm with an incorrect diagnosis. That is, the erroneous segment is correctly detected, but the analysis of what type of error it concerns is wrong. The example in (5.15c) represents a false alarm, where the correct (grammatical) form of the noun phrase was detected and diagnosed as an error in gender agreement. Finally, in (5.15d) we see an example of a false alarm that includes a segmentation error (not a grammar error). The noun in the noun phrase is split and the determiner and the first part of the split noun are identified as a grammar error with an agreement violation in gender. These instances of grammatically correct text selected as ungrammatical due to a split, spelling error, etc. are classified as false alarms with other error. I ve chosen to separate these detections from the real false alarms, since they represent text fragments not entirely free from errors, although the errors are of a different nature than grammar/syntactic ones. These findings can be interesting, since as Knutsson (2001) points out, such an alarm could be a hint to some writers that can see that the actual error lies in the split noun. It could however also give rise to a new error if the user chooses to change the gender of the determiner and writes: en sten hus a [com] stone [com] hus [neu].

Error Detection and Previous Systems 145 (5.15) a. ALARM DIAGNOSIS CLASS OF ALARM en stenhus gender agreement correct alarm with a [com] stone-house [neu] error correct diagnosis b. en a [com] stenhus stone-house [neu] number agreement error correct alarm with incorrect diagnosis c. ett a [neu] stenhus stone-house [neu] gender error agreement false alarm d. ett a [neu] sten stone [com] hus house [neu] gender error agreement false alarm with other error The set of all detected errors is represented then by all correct alarms with correct or incorrect diagnosis and the set of false alarms consists of false flaggings without any error and false flaggings containing other errors than grammatical ones. The systems grammatical coverage (recall) and flagging accuracy (precision) has been calculated in accordance with the following definitions: (5.16) a. recall = correct alarms all errors * 100 b. precision = correct alarms correct alarms + false alarms * 100 I will also consider the overall performance of the systems expressed in F- value, a combined measure of recall and precision. F-value is calculated as presented in (5.17), where the β parameter has the value 1, since both recall and precision are equally important in the analysis. 27 (5.17) F-value = (β2 + 1) recall precision β 2 (recall + precision) 5.5.3 The Systems Detection Procedures Grammatifix Grammatifix is included as a module in Microsoft Word, working along with a spell checking module. The user may choose to disregard grammar checking and just check the text for spelling or include both checkers. The tool then checks the text sentence by sentence first for spelling and then grammar. Further adjustments of grammar checking are possible, where the user may choose among the different 27 The parameter β obtains different values dependent on whether precision is more important (β > 1) or whether recall is of greater value (β < 1). When both recall and precision are equally important the value of β is 1 (β = 1).

146 Chapter 5. error types defined in Grammatifix (including style, punctuation and formatting and grammar errors) and also set the maximum length of a sentence in number of words. The tool also provides a report on the text s readability, including the sum of tokens, words, sentences and paragraphs The mean score of these is counted providing an index of readability. One diagnosis of the error is always given, and usually a suggestion for correction. Granska The web-based demonstrator of Granska includes no interactive mode, and spelling and grammar are corrected independently, based on the tagging information. The user may choose a presentation format of the result that includes all sentences with comments on spelling and grammar or only the erroneous sentences. Further adjustments include the choice to display error correction, the result of tagging and if newline is interpreted as end of sentence or not. The last attribute is quite important for children s writing, where punctuation is often absent or not used properly and the use of new line is also arbitrary, i.e. occurrence of new line in the middle of a sentence is not unusual. In some cases, Granska yields also more than one suggestion for error correction and there is a possibility of constructing new detection rules. Long parts in a text without any punctuation or new line (usual in children s writing) are probably hard to handle by the tool, which just rejects the text without any output result. Scarrie Also the web-demonstrator of Scarrie does not include any interactive mode. Individual sentences (or a longer text) can be entered, with requirements on end-ofsentence punctuation. Both spelling and grammar are corrected and the result of detection is displayed at the same time. Errors are highlighted and a diagnosis is displayed in the status bar. The system gives no suggestion for correction. 5.5.4 The Systems Detection Results In this section I present the result of the systems performance on Child Data. For every error type I first present to what extent the errors are explicitly covered according to the systems specifications and then I proceed system by system and present the detection result for the particular error type and discuss which errors are actually detected and which were incorrectly diagnosed, characteristics of errors that were not found, and false alarms. A short conclusion ends every error

Error Detection and Previous Systems 147 type presentation. Exemplified errors from Child Data refer either to previously discussed samples or directly to the index numbers of the error corpus presented in Appendix B.1. A system s diagnosis is presented exactly as given by the particular system. All detection results are summarized and the overall performance is presented in Section 5.5.5. Agreement in Noun Phrases Most of the errors in Child Data concern definiteness in the noun and gender or number in determiner in the noun phrases, errors that, according to the error specifications, are explicitly covered by all three tools. They all also check for errors in masculine gender of adjective and agreement between the quantifier and the noun in partitive constructions. The latter type found in Child Data concerns the form in the noun rather than the form of the quantifier (see (4.11) on p.50). Grammatifix detected seven errors in definiteness and gender agreement. One of the errors in the masculine form of adjective was only detected in part and was given a wrong diagnosis. The error concerns inconsistency in the use of adjectives (previously discussed in (4.9) on p.49), either both adjectives should carry the masculine gender form or both should have the unmarked form. The error detection by Grammatifix is exemplified in (5.18), where we see that due to the split noun, the error was diagnosed as a gender agreement error between the common-gender determiner den the [com] and the first part of the split troll troll [neu] that is neuter. An interesting observation is that when the split noun is corrected and forms the correct word trollkarlen magician [com,def] Grammatifix does not react and the error in the adjectives is not discovered. Grammatifix checks only when the masculine form of an adjective occurs together with a non-masculine noun, but not consistency of use as is the case in this error sample. (5.18) ALARM GRAMMATIFIX S DIAGNOSIS det va den hemske Check the word form den the it was the [com,def] awful [masc,wk] [com,def]. If a determiner modifies a noun with neuter fula troll karlen gender, e.g. troll troll the determiner should also have neu- ugly [wk] troll [neu,indef] man [com,def] ter gender det the [neu,def] ( trollkarlen) ( magician [com,def]) tokig Tokig som... that It was the awful ugly magician Tokig that... In general, simple constructions with determiner and a noun are detected, whereas more complex noun phrases were missed. Three errors in definiteness form of the noun were overlooked (G1.1.1, G1.1.2 - see (4.2) p.46, G1.1.3 - see

148 Chapter 5. (4.3) p.46). Concerning gender agreement, one error involving the masculine form of an adjective was missed (G1.2.4 - see (4.8) p.48). None of the errors in number agreement were detected, one with a determiner error (G1.3.1 - see (4.10) p.49) and two with partitive constructions (G1.3.2 - see (4.11) p.50, G1.3.3). Grammatifix made altogether 20 false assumptions, where 16 of them involved other error categories, mostly splits (12 false alarms), such as the one in (5.19): (5.19) ALARM GRAMMATIFIX S DIAGNOSIS det var ett stort sten Check the word form ett a it was a [neu] big [neu] stone [com] [neu]. If a determiner modifies a noun with commongender, e.g. sten stone hus house [neu] [com] should also the determiner have common-gender en a [com] It was a big stone-house. The overall performance for Grammatifix s detection of errors in noun phrase agreement amounts then to 53% for recall and 29% for precision. Granska detected six errors in definiteness and two in gender agreement, one in a partitive noun phrase (G1.2.2). In three cases, where the error concerned the definiteness form in the noun, Granska suggested instead to change the determiner (and adjective), correcting G1.1.7 as den räkningen the [def] bill [def] instead of en räkning a [indef] bill [indef] (see (4.6) p.47). The same happened for error G1.1.8 where en kompisen a [indef] friend [def] is corrected as den kompisen the [def] friend [def] and the opposite for G1.1.2 where the definite determiner and adjective in den hemska pyroman the [def] awful pyromaniac [indef] are changed to indefinite forms instead of changing the form in the noun to definite (see (4.2) p.46). Two errors in definiteness agreement (G1.1.1, G1.1.3 - see (4.3) p.46), none of the errors in masculine form of adjective (G1.2.3 - see (4.9) p.49, G1.2.4 - see (4.8) p.48) and all errors in number agreement were left undiscovered by Granska. Grammatical coverage for this error type results then in 53% recall. Some false alarms occurred (25), where 17 included other error categories, with splits as the most represented (9 false alarms), resulting in a slightly lower precision rate of 24% in comparison to Grammatifix. Scarrie detected six errors in definiteness agreement, one in gender agreement in a partitive noun phrase, two in the masculine form of adjective and one in number agreement. In the case of number agreement, the error in det tre tjejerna the [sg] three girls [pl] (G1.3.1 - see (4.10) p.49) is incorrectly diagnosed as an error in the noun instead of in the determiner.

Error Detection and Previous Systems 149 Exactly as in Grammatifix, Scarrie detected the error in G1.2.3 due to the split noun and gave the same incorrect diagnosis (see (5.18) above). The missed errors include two errors in definiteness in the noun, one with a possessive determiner (G1.1.4 - see (4.4) p.47) and one with an indefinite determiner (G1.1.7 - see (4.6) p.47). One error concerned gender agreement with an incorrect determiner with a compound noun (G1.2.1 - see (4.7) p.48). Finally, two errors in number of the noun in partitive constructions were not detected (G1.3.2 - see (4.11) p.50, G1.3.3). Many false alarms occurred (133) and 50 of them concerned other error categories, mostly splits (33 false alarms) as in (5.20): (5.20) ALARM SCARRIE S DIAGNOSIS han he tittade looked i into ett a [neu] jord ground [com] wrong gender hål hole [neu] He looked into a hole in the ground. Others involved spelling errors (10 false alarms) as in (5.21), where the pronoun vad what is written as var and interpreted as the pronoun each that does not agree in number with the following noun. (5.21) ALARM SCARRIE S DIAGNOSIS Själv tycker jag att killarnas metoder wrong number self think I that the-boys methods [pl] är are mer more öppen open och and ärlig honest men but också also mer more elak mean än than var ( vad) each [sg] ( what) tjejernas the-girls [pl] metoder methods är. are I think myself that the boys methods are more open and honest but also more mean than the girls methods are. Some false flaggings also concerned sentence boundaries (7 false alarms) as in (5.22):

150 Chapter 5. (5.22) ALARM SCARRIE S DIAGNOSIS pojken gick till fönstret och ropade wrong form in adjective the-boy went to the-window and shouted på at grodan the-frog men but vad what dumt silly [neu] hunden the-dog [com] har had fastnat stuck i in burken the-pot där there grodan the-frog var. was The boy went to the window and shouted at the frog, but how silly, the dog got stuck in the pot where the frog was. But mostly, ambiguity problems occurred (83 false alarms) as in (5.23a) and as in (5.23b): (5.23) a. ALARM SCARRIE S DIAGNOSIS dessutom besides luktade smelled det it/the [neu] wrong gender saltgurka. pickle-gherkin [com] b. Jag I Besides it smelled like pickled gherkin. trampade walked rakt right på on den it och and skar cut upp up wrong number hela whole min my vänstra left [pl,def] fot. foot [sg,indef] The coverage for this error type in Scarrie is 67%, but the high number of false alarms results in a very low precision value of only 7%. In conclusion, only Scarrie detected more than half of the errors in noun phrase agreement, but at the cost of many false alarms. Grammatifix and Granska displayed similarities in detection of this error type, detecting almost the same errors and also their false alarms are not that many. Scarrie s coverage is different from the other tools and the high number of false alarms considerably decreased the precision score for detection of this error type. All tools failed to find the erroneous forms in the head nouns of the partitive noun phrases (G1.3.2 - see (4.11) p.50, G1.3.3), that are most likely not defined in the grammars of these systems.

Error Detection and Previous Systems 151 Agreement in Predicative Complement All the tools cover errors in both number and gender agreement with predicative complement. These types of errors in Child Data are however represented in most cases by rather complex phrase structures and will then at most result in three detections. Grammatifix detected only one instance of all the agreement errors in predicative complement (G2.2.6) and yielded an incomplete analysis of this particular error. It failed in that only the context of a sentence is taken into consideration. Due to ambiguity in the noun between a singular and plural form, Grammatifix detected this error as gender agreement, but should suggest plural form instead, which is clear from the preceding context (see (5.1) and the discussion on detection possibilities in Section 5.3, p.119). Grammatifix obtained very low recall (13%) for this error type. Three false alarms (one with a split), results in a precision value of 25%. The three simple construction of agreement errors in the predicative complement were all detected by Granska (G2.1.1 - see (4.12) p.51, G2.2.3 - see (4.13) p.51, G2.2.6 - see (5.1) p.119). In the case of G2.2.6 discussed above, the plural alternative is suggested. In error G2.2.3, the predicative complement includes a coordinated adjective phrase with errors in all three adjectives. Granska detected the first part: (5.24) ALARM GRANSKA S DIAGNOSIS Själv self tycker think jag I att that killarnas the-boys [pl] metoder methods [pl] ärlig honest [sg] var ( vad) was ( what) men but är are också also mer more tjejernas the-girls s mer more öppen open [sg] elak mean [sg] metoder methods är. are och and än than I think myself that the boys methods are more open and honest but also more mean than the girls methods. If öppen open [sg] refers to metoder methods [pl] that is an agreement error killarnas metoder är mer öppna the boys [pl] methods [pl] are more open [pl] Granska obtained then a coverage value of 38% for this error type with 5 false alarms (including one in split and one with a spelling error) the precision rate is also 38%. In the case of Scarrie, no errors in predicative complement agreement were detected, only 13 false flaggings occurred, which leaves this category with no results

152 Chapter 5. for recall or precision. The false alarms occurred due to incorrectly chosen segments as in the following examples. In (5.25a) we have a compound noun phrase, where only the second part is considered and interpreted as a singular noun that does not agree with the plural adjective phrase as its predicative complement. In (5.25b) the verb is pratade spoke [pret] interpreted as a plural past participle form and is considered as not agreeing with the preceding singular noun hon she [sg]. (5.25) a. ALARM SCARRIE S DIAGNOSIS Han he och and hans his hund dog [sg] var were/was mycket very wrong number in adjective in predicative complement stolta över den. proud [pl] over it b. då then He and his dog were very proud over it. sa said jag I till to dom them och and våran our lärare teacher wrong number in adjective in predicative complement att that så so hon she [sg] pratade spoke [pl] mobbade harassed det. that henne her blev was mobbad harassed läraren the-teacher och and då then och and med with slutade stopped efter after dom them dom they det that som that med with Then I said to them and our teacher that she was harassed and after that the teacher spoke to them that harassed her and then they stopped with that. In conclusion, only Granska detected at least the simplest forms of agreement errors in predicative complement. The other tools had problems with selecting correct segments, especially Scarrie with its high number of false alarms. Pronoun Form Errors All three tools check explicitly for pronoun case errors after certain prepositions. Three of the four error instances in Child Data are preceded by a preposition. Grammatifix found two errors in the form of pronoun in the context of different prepositions (G4.1.1 - see (4.19) p.54, G4.1.3). No false flagging occurred. Granska found three errors in the context of the prepositions efter after and med with (G4.1.1 - see (4.19) p.54, G4.1.4, G4.1.5 - see (4.18) p.53), that gives a

Error Detection and Previous Systems 153 recall rate of 60%. However, many false alarms (24) occurred involving conjunctions being interpreted as prepositions (17 flaggings) or prepositions in a sentence boundary where punctuation is missing (5 flaggings), resulting in a very low precision value of 11%. In (5.26a) we see an example of a false alarm with the conjunction för because and in (5.26b) with a preposition ending a sentence followed by a personal pronoun as the subject of the next sentence: (5.26) a. ALARM GRANSKA S DIAGNOSIS Vi we skulle would åka go in in i into hamnen the-port för for hon she [nom] sin mamma. her mother skulle would berätta tell något something för for Erroneous pronoun form, use object-form för henne for her [acc] We would go into the port because she should tell something to her mother. b.... och and byn the-village jag I vi we kom came då then va ( var) what ( were) tänka think i in på at den the jag I [nom] Erroneous pronoun form, use object-form i mig in me [acc] berätta ( berättade) tell ( told) om about byn the-village och and dom they sa said att that det it va ( var) what ( was) deras their by. village and I came to think at the village we were in. I told about the village and they said that it was their village. Scarrie also found three error instances (G4.1.1 - see (4.19) p.54, G4.1.3, G4.1.4), all with different prepositions. False flaggings occurred also due to ambiguity problems, as for example in (5.27) and (5.28). (5.27) ALARM SCARRIE S SUGGESTION Jag I gick walked och and gick walked tills until jag I hörde heard wrong form of pronoun Pappa daddy skrika scream kom come kom come I walked and walked until I heard daddy scream: Come! Come!

154 Chapter 5. (5.28) a. ALARM SCARRIE S SUGGESTION Erik frågade om han kunde få ett wrong form of pronoun Erik asked if OR about he could get a barn. child Erik asked if he could get a child. b. Tänk think om if OR about jag bott I lived Think if I lived with daddy. hos pappa. with daddy wrong form of pronoun Scarrie obtains a recall of 60% but with 17 false alarms, attains a precision rate of only 15% for errors in pronoun case. In conclusion, as seen in the above examples, the tools search for errors in the pronoun form after certain types of prepositions, but due to ambiguity in them they fail more often than they succeed in detection of these errors. Finite Verb Form Errors Errors in finite verbs concern non-inflected verb forms, which is also the most common error found in Child Data. All of the tools search for missing finite verbs in sentences and, judging from the examples in the error specifications, it seems that they detect exactly this type of error. Grammatifix detected very few instances of sentences that lack a finite verb. Altogether four such errors are recognized and in one of them Grammatifix suggested correcting another verb. In total, seven false alarms occurred, detecting verbs after an infinitive marker as in (5.29) or after an auxiliary verb as in (5.30). (5.29) ALARM GRAMMATIFIX S DIAGNOSIS dom la sig ner för att ta The sentence seems to lack a they lay themselves down for to take [inf] tense-inflected verb form. If such a construction is necessary can you try to change ta take. skydd shelter under during natten the-night They lay down to take shelter during the night.

Error Detection and Previous Systems 155 (5.30) ALARM GRAMMATIFIX S DIAGNOSIS det kan ju bero på att föräldrarna The sentence seems to lack a it can of-course depend on that the-parents tense-inflected verb form. If such a construction is necessary inte bryr sig dom kanske inte ens can you try to change behöva not care themselves they maybe not even need. vet att know that på to sitt their behöva need man har one has barn children hjälp av help from prov för dom test for they för for en a del bit kan can sina föräldrar their parents lyssnar inte listen not ju of-course It can depend on that the parents do not care. They probably do not even know that you have a test, because they do not listen to their child, because some can need help from their parents. It seems that Grammatifix cannot cope with longer sentences. For instance, breaking down the last example in (5.30b) from det kan ju bero på... the error marking is not highlighted anymore. Since many errors with non-finite verbs as the predicates of sentences occurred in Child Data, Grammatifix obtains a low recall value of 4%. False alarms were relatively few, which gives a precision rate of 36%. Granska also checks for errors in clauses where a finite verb form is missing. It detected nine errors in verbs lacking tense-endings altogether, resulting in a recall of just 8%. Nine false flaggings occurred, mostly with imperatives, which gives it a precision score of 44%. Some other alarms concerned exclamations such as Grodan! Frog! or Tyst! Silence!, or fragment clauses, where no verb was used (29 alarms). These are excluded from the present analysis. Scarrie explicitly checks verb forms in the predicate of a sentence and detected 17 errors in Child Data with two diagnoses - wrong verb form in the predicate or no inflected predicative verb. Altogether, 13 false flaggings occurred due to marking correct finite verbs. One false alarm included a split as shown below in (5.31). Scarrie has the best result of the three systems for this error type with 15% in recall and 57% in precision. (5.31) ALARM SCARRIE S DIAGNOSIS Han ring de till mig sen och sa samma wrong verb form in predicate he call [pret] to me later and said same sak. thing He phoned me later and said the same thing.

156 Chapter 5. In conclusion, the tools succeeded in detecting at most 17 cases of errors in finite verb form. The tools have a very low coverage rate for this frequent error type. The worst detection rate is for Grammatifix the best for Scarrie. Verb Form after Auxiliary Verb All the tools include detection of errors in the verb form after auxiliary verbs. In Child Data, only one of these erroneous verb clusters included an inserted adverb and one occurred in a coordinated verb. Grammatifix does not find any of these errors. Four instances of erroneous verb form after auxiliary verbs were detected by Granska. The remaining three which were not detected are presented in (5.32) and concern G6.1.2, a coordinated verb in (5.32a), G6.1.5, a verb with preceding adverb (5.32b) and G6.1.6, an auxiliary verb followed by a verb in imperative form (5.32c). (5.32) a. Ibland sometimes får can [pres] man one bjuda offer [inf] på on sig själv oneself och and låter let [pres] henne/honom her/him vara med! be with Sometimes can one make a sacrifice and let him/her take part. b. han råkade he happened [pret] bara kom just came [pret] emot against He just happened to come across the wasp s nest. c. Det it är något is something hade läst had read på to som vi that we ett prov. a test getingboet the wasp-nest alla nog skulle gör all probably would [pret] do [imp] om vi inte if we not This is something that we all probably would do if we had not been studying for a test. Five false alarms occurred in sentence boundary. In (5.33a) we see an example where the end of a preceding direct-speech clause is not marked and the final verb is selected with the main verb of the subsequent clause. Similarly, in (5.33b) the verb cluster ending a clause where the boundary is not marked is selected together with the (adverb) and the initial main verb of the subsequent clause.

Error Detection and Previous Systems 157 (5.33) a. ALARM GRANSKA S DIAGNOSIS Jo, det kanske han kan sa no that maybe he can [pres] said [pret] pappa. Daddy No, maybe he can, said Daddy. b. precis när dom skulle börja så just when they would [pret] start [inf] so hörde heard [pret] dom they en röst a voice Just when they were about to begin, they heard a voice. unusual with verb form sa said [pret] after modal verb kan can [pres]. kan säga can [pres] say [inf] unusual with verb form hörde heard [pret] after modal verb skulle would [pret]. skulle börja så ha hört would [pret] start [inf] so have [inf] heard [sup] or skulle börja så höra would [pret] start [inf] so hear [inf] Granska s performance rates are 57% in recall and 44% in precision. Scarrie detected only one error in verb form after an auxiliary verb in Child Data (G6.1.6 - see (5.32c) above) and made altogether nine false flaggings. Two false alarms occurred at sentence boundaries, one of them in the same instance as in Granska, see (5.33a) above. Scarrie ends up with a performance result of 14% in recall and 10% in precision. In conclusion, Granska detects more than half of the verb errors after the auxiliary, but the performance of the other tools is very low, detecting either none or just one such error. Missing Auxiliary Verb All the tools check explicitly for supine verb forms without the auxiliary infinitive form ha have. It is not clear if they also check for omission of the finite forms of the auxiliary verb in front of a bare supine. In Swedish, the supine is only used in subordinate clauses (see Section 4.3.5). Two errors with bare supine form in main clauses were found in Child Data. Grammatifix did not find the two errors in Child Data. Instead, Grammatifix suggested insertion of the auxiliary verb ha have in constructions between an auxiliary verb and a supine verb form. This is rather a stylistic correction and is not part of the present analysis. Altogether, nine such suggestions were made of the kind given below:

158 Chapter 5. (5.34) ALARM GRAMMATIFIX S DIAGNOSIS jag skulle ätit för en kvart Consider the word ätit eaten I should [pret] eaten [sup] for a quarter [sup]. A verb such as skulle should [pret] combines in polished style with ha have [inf] sen later + supine rather than only a supine. skulle ha ätit should [pret] have [inf] eaten [sup] I should have eaten a quarter of an hour ago. The same happened in Granska, no errors were detected and suggestions made were for insertion of auxiliary ha have in front of supine forms preceded by auxiliary verbs. Seven such flaggings occurred as in (5.35) and two flaggings were false and occurred at sentence boundaries. (5.35) ALARM GRANSKA S DIAGNOSIS Jag måste svimmat. unusual with verb form I must [pret] fainted [sup] svimmat fainted [sup] after the modal verb måste must [pret]. måste ha svimmat must [pret] have [inf] fainted [sup] I must have fainted. Scarrie did find one of the error instances in Child Data with a missing auxiliary verb (G6.2.1). Eight other detections included the same stylistic issues as for the other tools, suggesting insertion of ha have between an auxiliary verb and a supine verb form, as in: (5.36) ALARM SCARRIE S DIAGNOSIS de kunde berott på att dom wrong verb form after modal it could [pret] depend [sup] on that they verb gillade liked samma tjej same girl It could have been because they liked the same girl. In conclusion, just one of the two missing auxiliary verb errors in Child Data was found by Scarrie. The systems bring more attention to the stylistic issue of omitted ha have with supine forms, pointing out that the supine verb form should not stand alone in formal prose.

Error Detection and Previous Systems 159 Verb Form in Infinitive Phrase Granska and Scarrie search for erroneous verb forms following an infinitive marker and should not have problems with finding these errors in Child Data, where only one instance included an adverb splitting the infinitive. Granska identified three errors in verb form after an infinitive marker, missing only the one with an adverb between the parts of the infinitive (G7.1.1 - see (4.35) p.62). This problem of syntactic coverage was already discussed in Section 5.4.4 in the examples in (5.13), where it also showed that Granska does not take adverbs into consideration. Altogether six false alarms occurred. Granska s overall performance rates are 75% in recall and 33% in precision. Scarrie detected one of the errors in Child Data, where the infinitive marker is followed by a verb in imperative form instead of infinitive: att gör to do [imp] (G7.1.4). Also, one false flagging occurred, shown in (5.37), where it seems that the system misinterpreted the conjunction för att because as the infinitive marker att to : (5.37) ALARM SCARRIE S DIAGNOSIS så jag sa att hon skulle ta det lite inflected verb form after att to so I said that she should take it little lugnt easy för for skada hurt[inf] bra. good att that sig herself annars otherwise och and så so kan can[pres] det är ju it is of-course inte not hon she så so So I said that she should take it easy a little because otherwise she might hurt herself and that is of course not so good. In conclusion, Granska finds all but one of the errors, due to insufficient syntactic coverage and makes also quite many false flaggings. Scarrie has difficulties with this error type and Grammatifix does not target it at all. Missing Infinitive Marker with Verbs All the tools check explicitly for both missing and extra inserted infinitive marker. Three errors in missing infinitive marker with verbs occurred in Child Data in the context of the auxiliary verb komma will. As presented in Section 4.3.5, certain main verbs take also an infinitive phrase as complement and some lack the infinitive marker and start to behave as auxiliary verbs, that normally do not combine with an

160 Chapter 5. infinitive marker and only take bare infinitives as complement. This development is now in progress in Swedish, which indicates then rather to treat these constructions as stylistic issues. Grammatifix did not find the three errors in Child Data with omitted infinitive markers with the auxiliary verb komma will (see example (4.36) p.62). In seven cases, the tool rather suggested removing the infinitive marker with the verbs börja begin and tänka think, e.g.: (5.38) a. ALARM GRAMMATIFIX S DIAGNOSIS Jag och Virginia började att berätta Check the words att to and I and Virginia started [pret] to tell [inf] berätta tell [inf]. If an infinitive is governed by the verb om tromben och den övergivna började started [pret], the infinitive should not be preceded about the-tornado and the abandoned byn the-village by att to började berätta started [pret] tell [inf] b. 4 4 Virginia and I started to tell about the tornado and the abandoned village. av by hus houses och and gumman old-lady 5 5 som who affärer shops var were hade had [pret] att göra museum av den to make [inf] museum of the ordning order gjorda done tänkt thought [sup] gamla staden old the-city 4 houses and 5 shops were tidied up by the old lady who had planned to make a museum of the old city. Check the words att to and göra make [inf]. If an infinitive is governed by the verb tänkt thought [sup], the infinitive should not be preceded by att to tänkt göra thought [sup] make [inf] Granska detected all the three omitted infinitive markers in the context of the auxiliary verb komma will. In this case also six false flaggings occurred, concerning the same verb used as a main verb, e.g.:

Error Detection and Previous Systems 161 (5.39) ALARM GRANSKA S DIAGNOSIS han he kommer comes [pres] och and klappar pats kommer will without att to before verb in infinitive alla all på on handen the-hand utan except en a kille boy undra ( undrar) wonder [inf] ( wonder [pres]) hur how han he känner sig feels himself då? then He comes and pats everybody s hand except one boy. (I) wonder how he feels then? In two cases, Granska also suggested insertion of the the infinitive marker with the verbs fortsätta continue and prova try. In nine cases, it wanted to remove the infinitive marker with the verbs börja begin, försöka try, sluta stop and tänka think. Scarrie detected two of the three missing infinitive marker errors with the verb komma will found in Child Data. Quite a large number of false alarms (13) with the verb used as main verb occurred as in (5.40), where så is ambiguous between the conjunction so or and and a verb reading sow. The precision rate is then only 13%. (5.40) ALARM SCARRIE S DIAGNOSIS men kom nu så går vi hem att to missing but come now so OR sow go we home But come now and we ll go home. In five cases, Scarrie suggested removal of the infinitive marker in the context of the verbs börja begin, fortsätta continue and sluta stop. In conclusion, whereas both Granska and Scarrie performed well, Grammatifix did not succeed in tracing any of the errors with omitted infinitive markers with the auxiliary verb komma will. Overall, all the tools suggested both omission and insertion of infinitive markers with certain main verbs. In some cases they agree, but there are also cases where one system suggests removal of the infinitive marker and an another suggests insertion. A clear indication of confusion in the use or omission of the infinitive marker showed up when Granska suggested to insert the infinitive marker in the verb sequence fortsätta leva continue live as shown in (5.41a), whereas in (5.41b) Scarrie suggested to remove it in the same verb sequence. This fact indicates clearly that this issue should be classified as a matter of style and not as a pure grammar error.

162 Chapter 5. (5.41) a. ALARM DIAGNOSIS när jag dog 1978 i cancer återvände jag Granska: fortsätta att leva when I died 1978 of cancer returned I continue to live hit here för for liv här life here att that fortsätta continue [inf] leva live [inf] mitt my b. Vi we When I died in 1978 of cancer, I returned here to continue live my life here. fortsatte continued [pret] att to leva [inf] live som as en a Scarrie: fortsatte leva continued live hel whole Göteborg. Göteborg familj family i in vårt our nya new hus house här here i in We continued to live as a whole family in our new house here in Göteborg. Word Order Errors All three tools check for the position of adverbs (or negation) in subordinate clauses and constituent order in interrogative subordinate clauses. Scarrie also checks for word order in main clauses with inversion. Among the word order errors found in Child Data, all the errors are quite complex and also none of the tools succeeded in detection of this type of error. However, false flaggings of correct sentences occurred. Grammatifix made 15 false alarms when checking word order, one included a split and three occurred in clause boundary. A false flagging involving clause boundary is presented in (5.42a), where Grammatifix concerned the adverb hem home as being placed wrongly between verbs. This problem is not only complicated by the second verb initiating a subsequent clause, but also in that not all adverbs can precede verbs. Another false flagging is presented in (5.42b), where Grammatifix checked for adverbs placed after the main verb in the expected subordinate clause, but here, main word order is found in the indirect speech construction. 28 28 Main clause word order occurs when the clause expresses the speaker s or the subject s opinion or beliefs.

Error Detection and Previous Systems 163 (5.42) a. ALARM GRAMMATIFIX S DIAGNOSIS När when vi we kom came hem home undra ( undrar) wonder [inf] ( wonder [pres]) mamma vart mother where vi we varit... been självklart of-course When we came home, mother wondered of course where we had been. b. killen i luren sa att han kommer the-guy in the-receiver said that he comes genast immediately The guy in the receiver said that he would come immediately Check the placement of hem home. In a subclause adverb is not usually placed between the verbs. Placement before the finite verb is often suitable. Check the placement of genast immediately. In a subclause sentential adverb is placed by rule before the finite verb. genast kommer immediately comes In (5.43) the sentence is erroneously marked as a word order error in the placement of negation. The problem however concerns the choice of the (explanative) conjunction för att since/due to that combines with main clause and is more typical of spoken Swedish (Teleman et al., 1999, Part2:730). This conjunction corresponds to för due to/in order to in writing and coordinates then only main clauses. It is often confused with the causal subjunction för att because/with the intention of that is used only with subordinate clauses and requires then adverbs to be placed before the main verb (Teleman et al., 1999, Part2:736). (5.43) ALARM GRAMMATIFIX S DIAGNOSIS...då sa han ja för att han ville inte Check the placement of inte then said he yes for that he wanted not not. In a subclause sentential adverb is placed by rule before berätta för fröken att han var ensam the finite verb. inte ville not tell to the-teacher that he was alone wanted... then he said yes, because he did not want to tell the teacher that he was alone. All of the 15 flaggings by Granska were false, interpreting conjunctions as subjunctions as in (5.44a) or not taking indirect speech into consideration as in (5.44b), where the subject s opinion is expressed by main clause word order and not subordinate clause word order as interpreted by the tool.

164 Chapter 5. (5.44) a. ALARM GRANSKA S DIAGNOSIS... men den gick av så jag hade bara lite Word order error, erroneous but it went off so I had just little placement of adverb in subordinate clause. bara hade gips kvar. just had plaster left b.... but it broke off so I only had a little plaster left. då then tycker believe utan deras. but theirs jag att I that det it var was inte not hans fel his fault Then I think that it was not his fault but theirs. Word order error, erroneous placement of adverb in subordinate clause. inte var not was Scarrie s 11 diagnoses were also false, mostly of the type subject taking the position of the verb as in (5.45a) and also cases of interpreting conjunctions as subjunctions as in (5.45b): (5.45) a. ALARM SCARRIE S DIAGNOSIS Då vi kom till min by. Trillade jag the subject in the verb position when we came to my village fell I av off brand bilen för fire the-car for det var it was en guppig väg. a bumpy road When we arrived in my village, I fell off the fire-engine because the road was bumpy. b. dom they kanske maybe prov för dom test for they inte not ens even lyssnar inte listen not vet know att that på sitt at their man one barn... child har has They probably do not even know that you have a test, because they do not listen to their child... the inflected verb before sentence adverbial in subordinate clause In conclusion, word order errors were hard to find due to their inner complexity. The tools seem to apply rather straight-forward approaches that resulted in many false flaggings. Redundancy According to the error specifications, only Grammatifix searches for repeated words and should then be able to at least detect errors with doubled words.

Error Detection and Previous Systems 165 Grammatifix identified the five errors with duplicated words immediately following each other. The number of false alarms is quite high (18 occurrences). One example is given below: (5.46) ALARM GRAMMATIFIX S DIAGNOSIS Var var den där överraskningen. doubled word where was the there surprise Where was that surprise? No other superfluous elements were detected so the system ends up with a performance rate of 38% in recall, and 23% in precision. Missing Constituents All three tools search for sentences with omitted verbs or infinitive markers, also in the context of a preceding preposition. Grammatifix did not find any missing verbs, but detected the only error with a missing infinitive marker in front of an infinitive verb after certain prepositions (G10.3.1) shown in (5.47). (5.47) a. Efter ha sprungit igenom häckarna två after have [inf] run [sup] through the-hurdles two vi lite... we little After twice running through the hurdles, we rested a little. b. Efter after att ha to have [inf] sprungit run [sup] gånger så vilade times then rest Six false alarms occurred for this error type, mostly when the adverb tillbaka back was split as shown in (5.48). The problem lies in that the split word results in a preposition till to and the verb baka bake. (5.48) ALARM inget kvack no quack kom came no quack came back. till baka to bake GRAMMATIFIX S DIAGNOSIS Check the word baka. If an infinitive is governed by a preposition it should be preceded by att to Granska checked in the case of omitted verbs only for occurrences of single words such as Slut. End. or sentence fragments, such as Tom grå och tyst. Empty

166 Chapter 5. grey and silent. or Inte ens pappa. Not even daddy.. The program further suggested that the error might be a title Verb seems to be missing in the sentence. If this is a title it should not be ended with a period. Altogether, 25 sentences were judged to be missing a verb and 12 false alarms occurred. None of the errors listed in Child Data were detected by Granska. This particular error type is not included in the present performance analysis. Granska also checks for missing subjects. Two cases concerned short sentence fragments and two were false flaggings as the one in (5.49) below. (5.49) ALARM GRANSKA S DIAGNOSIS Hade alla 7 vandrat förgäves? a subject seems to be missing in had all 7 walked in vain the sentence Had all seven walked in vain? Scarrie also checks for missing subjects and successfully detected the error G10.1.5, shown in (5.50). The other three flaggings were false. In the case of a missing infinitive marker in constructions where a preposition precedes an infinitive phrase, six false flaggings occurred. Like Grammatifix, Scarrie marks erroneous splits homonymous with prepositions (see (5.48) above). (5.50) a. man one försöker att lära tries to teach barnen the-children att to om if fuskar med t ex cheat with e.g. One tries to teach children that if they cheat on e.g. a test then... b. om de if they fuskar med cheat with ett a prov då... test then In conclusion, many of the omitted constituents are not covered by these tools and result mostly in false flaggings. Grammatifix successfully detected a missing infinitive marker preceded by a preposition and Scarrie detected a missing subject. Other Errors Among other error types, all the tools also check if a sentence has too many finite verbs. Grammatifix succeeded in finding three instances of unmarked sentence boundaries. In three cases, false flaggings occurred, listed in (5.51). Two such flaggings concerned ambiguity between a verb and a pronoun and the one in (5.51c) involved a spelling error that resulted in a verb. These alarms are not part of the system s performance test, since such errors were not the target of this analysis.

Error Detection and Previous Systems 167 (5.51) a. ALARM GRAMMATIFIX S DIAGNOSIS Han undrade var de var någonstans Check the word forms undrade wondered and var he wondered where they were somewhere where/was. It seems as if the sentence would have too many finite verbs. He wondered where they were? b. Var where var was den the där there Where was that surprise? c. Pojken the-boy blev became The boy became afraid. överraskningen. surprise red ( rädd) rode ( afraid) Check the word forms var where/was and var where/was. It seems as if the sentence might have too many finite verbs. Check the word forms blev became and red rode. It seems as if the sentence might have too many finite verbs. Granska checks for occurrences of other finite verbs after the copula verb vara be. In Child Data, however, the only detections were false flaggings (8 occurrences), mostly due to homonymy between the verb and the adverb var where as in (5.52a) (5 occurrences). Three false alarms occurred because of spelling errors as in (5.52a) or at sentence boundaries, as in (5.52b): (5.52) a. ALARM GRANSKA S DIAGNOSIS Pojken the-boy blev became [pret] it is unusual to have a verb after the verb blev became [pret] red ( rädd) rode [pret] ( afraid) b. som as The boy became afraid. tur luck var was [pret] landade landed [pret] jag I på on it is unusual to have a verb after the verb var was [pret] skyddsnätet the-safety-net på on brandbilen the-fire-engine luckily I landed on the safety-net on the fireengine. Scarrie also checks for occurrences of two finite verbs in a row, but provides a diagnosis of a possible sentence boundary as well. Eight sentence boundaries were found and eight false markings occurred, often due lexical ambiguity as in (5.53). Also, in Scarrie s case, these alarms are not included in the analysis.

168 Chapter 5. (5.53) ALARM SCARRIE S DIAGNOSIS Men sen kom en tjej som visste vem jag two inflected verbs in predicate but then came a girl that knew who I position or a sentence boundary var was [pret] för for OR lead [imp] hon... she But then came a girl that knew who I was, because she... Finally, Scarrie checks the noun case, where the genitive form of proper nouns is suggested in constructions of a proper noun followed by a noun. All result in false flaggings, due to part-of-speech ambiguity, e.g.: (5.54) ALARM SCARRIE S DIAGNOSIS Men på morgonen när Erik såg basic form instead of genitive but in the-morning when Erik [nom] saw att that hans his groda var frog was försvunnen. disappeared But in the morning when Erik saw that his frog had disappeared. 5.5.5 Overall Detection Results In accordance with the error specifications of the systems, none of the Swedish tools detects errors in definiteness in single nouns or reference and only Grammatifix checks for repeated words among redundancy errors. Missing constituents are checked only when a verb, subject or infinitive marker is missing. Word choice errors represented as prepositions in idiomatic expressions are checked by Granska. The detection results on Child Data, discussed in the previous section, are summarized in Tables 5.4, 5.5 and 5.6 below. Among the most frequent error types in Child Data, represented by errors in finite verbs, missing constituents, word choice errors, agreement in noun phrase and redundant words, Grammatifix succeeded in finding errors in four of these types, Scarrie in three of them and Granska in two categories. All the tools were best at finding errors in noun phrase agreement, with a recall rate between 53% and 67% and precision between 7% and 37%. For the most common error, finite verb form, all obtained very low coverage, with recall between 4% and 15% and precision between 36% and 57%. Grammatifix succeeded in finding all the repeated words among redundancy errors and one occurrence of missing constituent. Also Scarrie found one missing constituent. No word choice errors were found by Granska. Other error types in Child Data occurred less than ten times and no general assumptions can be made on how the tools performed on those.

Error Detection and Previous Systems 169 Table 5.4: Performance Results of Grammatifix on Child Data GRAMMATIFIX CORRECT ALARM FALSE ALARM PERFORMANCE Correct Incorrect No Other ERROR TYPE ERRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 7 1 4 16 53% 29% 37% Agreement in PRED 8 1 2 1 13% 25% 17% Definiteness in single nouns 6 0% Pronoun case 5 2 40% 100% 57% Finite Verb Form 110 3 1 5 2 4% 36% 7% Verb Form after Vaux 7 0% Vaux Missing 2 0% Verb Form after inf. marker 4 0% Inf. marker Missing 3 0% Word order 5 11 4 0% 0% Redundancy 13 5 16 1 38% 23% 29% Missing Constituents 44 1 1 6 5% 25% 8% Word Choice 28 0% Reference 8 0% Other 4 0% TOTAL 262 18 4 38 30 8% 24% 12% Table 5.5: Performance Results of Granska on Child Data GRANSKA CORRECT ALARM FALSE ALARM PERFORMANCE Correct Incorrect No Other ERROR TYPE ERRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 5 3 8 17 53% 24% 33% Agreement in PRED 8 3 3 2 38% 38% 38% Definiteness in single nouns 6 0% Pronoun case 5 3 24 60% 11% 19% Finite Verb Form 110 8 1 8 1 8% 50% 14% Verb Form after Vaux 7 4 5 57% 44% 50% Vaux Missing 2 2 0% 0% Verb Form after inf. marker 4 3 6 75% 33% 46% Inf. marker Missing 3 3 6 100% 33% 50% Word order 5 15 0% 0% Redundancy 13 0% Missing Constituents 44 2 0% 0% Word Choice 28 0% Reference 8 0% Other 4 0% TOTAL 262 29 4 79 20 13% 25% 17%

170 Chapter 5. Table 5.6: Performance Results of Scarrie on Child Data SCARRIE CORRECT ALARM FALSE ALARM PERFORMANCE Correct Incorrect No Other ERROR TYPE ERRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 8 2 83 50 67% 7% 13% Agreement in PRED 8 12 1 0% 0% Definiteness in single nouns 6 0% Pronoun case 5 3 17 60% 15% 24% Finite Verb Form 110 16 1 13 15% 57% 24% Verb Form after Vaux 7 1 7 2 14% 10% 12% Vaux Missing 2 1 50% 100% 67% Verb Form after inf. marker 4 1 1 25% 50% 33% Inf. marker Missing 3 2 13 67% 13% 22% Word order 5 11 0% 0% Redundancy 13 0% Missing Constituents 44 1 4 5 2% 10% 3% Word Choice 28 0% Reference 8 0% Other 4 0% TOTAL 262 33 3 161 58 14% 14% 14% Overall performance figures in detecting the errors in Child Data show that Grammatifix did not detect many of the verb errors at all and has the lowest recall. Scarrie on the other hand detects most errors of them all, but has a high number of false flaggings. Errors in agreement with predicative complement were hard to find in general, even in cases where the subject and the predicate were adjacent, more complex structures would obviously pose more of a problem for the tools. Even when errors were found in these constructions, the tools often gave an incorrect diagnosis. Among the false flaggings, quite many included errors other than grammatical ones. The overall performance of the tools including all error types when applied to Child Data ends up at a recall rate of 14% at most, and a precision rate between 14% and 25%. Grammatifix detected the least number of errors and had the least number of false alarms, but the quite low recall rate leads to the lowest F-value of 12%. Granska found slightly more errors and had more false flaggings, obtaining the best F-value of 17%. Scarrie performed best of the tools in grammatical coverage, but at the cost of lots of false alarms, giving an F-value of 14%.

Error Detection and Previous Systems 171 In Table 5.7 the overall performance of the systems is presented for the errors they target specifically, excluding the zero-results. Observe that the F-values are slightly higher due to increased recalls. Precision rates remain the same. 29 Table 5.7: Performance Results of Targeted Errors CORRECT FALSE TOOL ERRORS ALARM ALARM RECALL PRECISION F-VALUE Grammatifix 166 22 68 13% 24% 17% Granska 174 33 97 19% 25% 22% Scarrie 170 36 214 21% 14% 17% The performance tests on published adult texts and some student papers provided by the developers of these tools (see Table 5.3 on p.141), show on average much higher validation rates for these texts, with an overall coverage between 35% and 85% and precision between 53% and 77%. Granska shows to be best at detecting errors in verb form in the adult text data evaluated by the developers with a recall rate of 97%. Verb form errors are mostly represented by errors in finite verb form in Child Data, where Granska obtained a recall of 8%. Other types of verb errors occurred less than ten times which makes the performance result uncertain. For agreement errors in noun phrase, which is the second best category of Granska when tested on adult texts, Granska obtained much better results and detected at least half of the errors with a recall of 53%. Since the error frequency is much higher in texts written by children, the size of the Child Data corpus can be considered to be satisfactory and safe for evaluation, at least for the most frequent error types. This performance test shows that the three Swedish tools, designed for adult writers in the first place, have in general difficulty in detecting errors in such texts as Child Data. As indicated in some examples, this is not only due to insufficient error coverage of the defined error types in the systems. The structure of the texts may also be a cause for certain errors not being detected or being erroneously marked as errors. Different results were obtained sometimes when sentences were split into smaller units. 29 Grammatifix: redundancy includes 5 errors in doubled word, missing constituents are counted as infinitive marker (1) and verb (5). Granska: missing verb (5), choice of preposition (10). Scarrie: missing subject (10), missing infinitive marker (1).

172 Chapter 5. 5.6 Summary and Conclusion From the above analyses it is clear that among the grammar errors found in Child Data, all non-structural errors and some types of structural errors should be possible to detect by syntactic analysis and partial parsing, whereas other errors require more complex analysis or wider context. Among the central error types in Child Data, errors in finite verb form and agreement errors in noun phrases could be handled by partial parsing, which I will show in Chapter 6. The other more frequent errors, such as missing constituents, word choice and redundant words; forming new lemmas require deeper analysis. Furthermore, some real word spelling errors might be detected if they violate syntax. Missing punctuation in sentence boundaries requires analysis of at least the predicate s complement structure. All the errors in Child Data except definiteness in single nouns and reference seem to be more or less covered by the Swedish tools considering the error specifications. The performance results show that agreement errors in noun phrases are the error type best covered, whereas errors in finite verb forms in relation to their frequency obtained a very low recall in all three systems. Grammatifix had in general difficulty detecting any errors concerning verbs. Granska performed best in this case. Overall, all the tools detect few errors in Child Data and the precision rate is quite low. It is not clear how many of the missed errors depended on insufficient syntactic coverage and how many depended on the complexity of the sentences in Child Data. That is, all three tools rely on sentences to be the unit of analysis, but sentences in Child Data do not always correspond to syntactic sentences. They often include adjoined clauses or quite long sentences (see Section 4.6). These tools are not designed to handle such complex structures. In conclusion, many errors that can be handled by partial parsing in Child Data are detected at a rate of less of not more than 60% by the Swedish grammar checkers. Errors in finite verb form obtained quite low results and are the type of error that needs the most improvement, especially since they are the most common error in Child Data.

Chapter 6 FiniteCheck: A Grammar Error Detector 6.1 Introduction This chapter reports on automatic detection of some of the grammar errors discussed in Chapter 4. The challenge of this part of the work is to exploit correct descriptions of language, instead of describing the structure of errors, and apply finite state techniques to the whole process of error detection. The implemented grammar error detector FiniteCheck identifies grammar errors using partial finite state methods, identifying syntactic patterns through a set of regular grammar rules (see Section 6.2.4). Constraints are used to reduce alternative parses or adjust the parsing result. There are no explicit error rules in the grammars of the system, in the sense that no grammar rules state the syntax of erroneous (ungrammatical) patterns. The rules of the grammar are always positive and define the grammatical structure of Swedish. The only constraints related to errors is the context of the error type. The present grammar is highly corpusoriented, based on the lexical and syntactic circumstances displayed in the Child Data corpus. Ungrammatical patterns are detected adopting the same method that Karttunen et al. (1997a) use for extraction of invalid date expressions, presented in Section 6.2.4. In short, potential candidates of grammatical violations are identified through a broad grammar that overgenerates and accepts also invalid (ungrammatical) constructions. Valid (grammatical) patterns are defined in an another narrow grammar and the ungrammaticalities among the selected candidates are identified as the difference between these two grammars. In other words, the strings selected

174 Chapter 6. by the rules of the broad grammar that are not accepted by the narrow grammar are the remaining ungrammatical violations. The current system looks for errors in noun phrase agreement and verb form, such as selection of finite and non-finite verb forms in main and subordinate clauses and infinitival complements. Errors in the finite verb form in the main verb were the most natural choice for implementation since these are the most frequent error type in the Child Data corpus, represented by 110 error instances (see Figure 4.1 on p.73). Moreover, verb form errors are possible to detect using partial parsing techniques (see Section 5.3.3). Inclusion of errors in the finite main verb motivated expansion of this category to include other errors related to verbs, with addition of other types of finite verb errors and errors in non-finite verb forms. Errors in noun phrase agreement were among the five most frequent error types. In comparison to other writing populations this type of error might be considered as one of the central error types in Swedish (see Section 4.7). Furthermore, noun phrase errors are limited within the noun phrase and can most likely be detected by partial parsing (see Section 5.3). The other errors among the five most common error types in Child Data, including word choice errors and errors with extra or missing constituents, are not locally restricted in this way and will certainly require a more complex analysis. The development of the grammar error detector started with the project Finite State Grammar for Finding Grammatical Errors in Swedish Text (1998-1999). It was part of a larger project Integrated Language Tools for Writing and Document Handling in collaboration with the Numerical Analysis and Computer Science Department (NADA) at the Royal Institute of Technology (KTH) in Stockholm. 1 The project group in Göteborg consisted of Robin Cooper, Robert Andersson and myself. In the description of the system I will include the whole system and its functionalities, in particular my own contributions concerning mainly a first version of the lexicon, expansion of grammar and adjustment to the present corpus data of children s texts, disambiguation and other adjustments to parsing results, as well as evaluation and improvements made on the system s flagging accuracy. The work of the other two members concerns primarily the final version of the lexicon, optimization of the tagset, the basic grammar and the system interface. I will not discuss their contributions in detail but will refer to the project reports when relevant. The chapter proceeds with a short introduction to finite state techniques and parsing (Section 6.2). The description of FiniteCheck starts with an overview of the system s architecture including short presentations of the different modules (Sec- 1 The project was sponsored by HSFR/NUTEK Language Technology Programme. See http: //www.ling.gu.se/ sylvana/fsg/ for methods and goals of our part of the project.

FiniteCheck: A Grammar Error Detector 175 tion 6.3). Then follows a section on the composition of the lexicon with a description of the tagset, and identification of grammatical categories and features (Section 6.4). Next, the overgenerating broad grammar set is presented (Section 6.5), followed by a section on parsing (Section 6.6). The chapter then proceeds with a presentation of the narrow grammar of noun phrases and the verbal core (Section 6.7) and the actual error detection (Section 6.8). The chapter concludes with a summary (Section 6.9). Performance results of FiniteCheck are presented in Chapter 7. 6.2 Finite State Methods and Tools 6.2.1 Finite State Methods in NLP Finite state technology as such has been used since the emergence of computer science, for instance for program compilation, hardware modeling or database management (Roche, 1997). Finite state calculus is considered in general to be powerful and well-designed, providing flexible, space and time effective engineering applications. However, in the domain of Natural Language Processing (NLP) finite state models were considered to be efficient but somewhat inaccurate, often resulting in applications of limited size. Other formalisms such as context-free grammars were preferred and considered to be more accurate than finite state methods, despite difficulties reaching reasonable efficiency. Thus, grammars approximated by finite state models were considered more efficient and simpler, but at the cost of a loss of accuracy. Improvement of the mathematical properties of finite state methods and reexamination of the descriptive possibilities made it possible for the emergence of applications for a variety of NLP tasks, such as morphological analysis (e.g. Karttunen et al., 1992; Clemenceau and Roche, 1993; Beesley and Karttunen, 2003), phonetic and speech processing (e.g. Pereira and Riley, 1997; Laporte, 1997), parsing (e.g. Koskenniemi et al., 1992; Appelt et al., 1993; Abney, 1996; Grefenstette, 1996; Roche, 1997; Schiller, 1996). In this section the finite state formalism is described along with possibilities for compilation of such devices (Section 6.2.2). Next, the Xerox compiler used in the present implementation is presented (Section 6.2.3). The techniques of finite state parsing are explained along with description of a method for extracting invalid input from unrestricted text that plays an important role for the present implementation (Section 6.2.4).

176 Chapter 6. 6.2.2 Regular Grammars and Automata Adopting finite state techniques in parsing means modeling the syntactic relations between words using regular grammars 2 and applying finite state automata to recognize (or generate) corresponding patterns defined by such grammar. A finite state automaton is a computer model representing the regular expressions defined in a regular grammar that takes a string of symbols as input, executes some operations in a finite number of steps and halts with information interpreted depending on the grammar as either that the machine accepted or rejected the input. It is defined formally as a tuple consisting of a finite set of symbols (the alphabet), a finite set of states with an unique initial state, a number of intermediate states and final states, and finally a transition relation defining how to proceed between the different states. 3 Regular expressions represent sets of simple strings (a language) or sets of pairs of strings (a relation) mapping between two regular languages, upper and lower. Regular languages are represented by simple automata and regular relations by transducers. Transducers are bi-directional finite state automata, which means for example that the same automaton can be used for both analysis and generation. Several tools for the compilation of regular expressions exist. AT&T s FSM Library 4 is a toolbox designed for building speech recognition systems and supports development of phonetic, lexical and language-modeling components. The compiler runs under UNIX and includes about 30 commands to construct weighted finite-state machines (Mohri and Sproat, 1996; Pereira and Riley, 1997; Mohri et al., 1998). FSA Utilities 5 is an another compiler developed in the first place for experimental purposes applying finite-state techniques in NLP. The tool is implemented in SICStus Prolog and provides possibilities to compile new regular expressions from the basic operations, thus extending the set of regular expressions handled by the system (van Noord and Gerdemann, 1999). The compiler used in the present implementation is the Xerox Finite-State Tool, one of Xerox software tools for computing with finite state networks, described further in the subsequent section. 2 Regular grammars are also called type-3 in the classification introduced by Noam Chomsky (Chomsky, 1956, 1959). 3 See e.g. Hopcroft and Ullman (1979); Boman and Karlgren (1996) for exact formal definitions of finite state automata. A gentle introduction is presented in Beesley and Karttunen (2003). 4 The homepage of AT&T s FSM Library: http://www.research.att.com/sw/ tools/fsm/ 5 The homepage of FSA Utilities : http://www.let.rug.nl/ vannoord/fsa/

FiniteCheck: A Grammar Error Detector 177 6.2.3 Xerox Finite State Tool Introduction Xerox research developed a system for computing and compilation of finite-state networks, the Xerox Finite State Tool (XFST). 6 The tool is a successor to two earlier interfaces: IFSM created at PARC by Lauri Karttunen and Todd Yampol 1990-92, and FSC developed at RXRC by Pasi Tapanainen in 1994-95 (Karttunen et al., 1997b) The system runs under UNIX and is supplemented with an interactive interface and a compiler. Finite state networks of simple automata or transducers are compiled from regular expressions and can be saved into a binary file. The networks can also be converted to Prolog-format. The Regular Expression Formalism The metalanguage of regular expressions in XFST includes a set of basic operators such as union (or), concatenation, optionality, ignoring, iteration, complement (negation), intersection (and), subtraction (minus), crossproduct and composition, and an extended set of operators such as containment, restriction and replacement. The notational conventions of some part of the regular expression formalism in XFST, including the operators and atomic expressions that are used in the present implementation, are presented in Table 6.1 (cf. Karttunen et al., 1997b; Beesley and Karttunen, 2003). Uppercase letters such as A denote here regular expressions. For a description of the syntax and semantics of these operators see Karttunen et al. (1997a). The replacement operators play an important role in the present implementation and are further explained below. 6 Technical documentation and demonstration of the XFST can be found at: http://www. rxrc.xerox.com/research/mltt/fst/

178 Chapter 6. Table 6.1: Some Expressions and Operators in XFST ATOMIC EXPRESSIONS epsilon symbol (the empty-string) 0 any (unknown) symbol, universal language?,?* UNARY OPERATIONS iteration: zero or more (Kleene star) A* iteration: one or more (Kleene plus) A+ optionality (A) containment $A complement (not) A BINARY OPERATIONS concatenation union (or) intersection (and) ignoring composition subtraction (minus) replacement (simple) A B A B A & B A / B A.o. B A - B A B Replacement Operators The original version of the replacement operator was developed by Ronald M. Kaplan and Martin Kay in the early 1980s, and was applied as phonological rewrite rules by finite state transducers. Replacement rules can be applied in an unconditional version or constrained by context or direction (Karttunen, 1995, 1996). Simple (unconditional) replacement has the format UPPER LOWER denoting a regular relation (Karttunen, 1995): 7 (RE6.1) [ NO_UPPER [UPPER.x. LOWER] ] * NO_UPPER; For example the relation [a b c d e] 8 maps the string abcde to dede. Replacement may start at any point and include alternative replacements, making these transducers non-deterministic, and yield multiple results. For example, a transducer represented by the regular expression in (RE6.2) produces four different results (axa, ax, xa, x) to the input string aba as shown in (6.1) (Karttunen, 1996). 7 NO UPPER corresponds to $[UPPER - []]. 8 Lower-case letters, such as a, represent symbols. Symbols can be unary (e.g. a, b, c ) or symbol pairs (e.g. a:x, b:0) denoting relations (i.e. transducers). Identity relation where a symbol maps to the same symbol as in a:a is ignored and written thus as a.

FiniteCheck: A Grammar Error Detector 179 (RE6.2) a b b b a a b a x (6.1) a b a a b a a b a a b a - --- --- ----- a x a a x x a x Directionality and the length of replacement can be constrained by the directed replacement operators. The replacement can start from the left or from the right, choosing the longest or the shortest replacement. Four types of directed replacement are defined (Karttunen, 1996): Table 6.2: Types of Directed Replacement longest match shortest match left-to-right @ @> right-to-left @ >@ Now, applying the same regular expression as above to the left-to-right longestmatch replacement as in the regular expression in (RE6.3), yields just one solution to the string aba as shown in (6.2). (RE6.3) a b b b a a b a @ x (6.2) a b a ----- x Directed replacement is defined as a composition of four relations that are composed in advance by the XFST-compiler. The advantage is that the replacement takes place in one step without any additional levels or symbols. For instance, the left-to-right longest-match replacement UPPER @ LOWER is composed of the following relations (Karttunen, 1996): (6.3) Input string.o. Initial match.o. Left-to-right constraint.o. Longest-match constraint.o. Replacement With these operators, transducers that mark (or filter) patterns in text can be constructed easily. For instance, strings can be inserted before and after a string

180 Chapter 6. that matches a defined regular expression. For this purpose a special insertion symbol... is used on the right-hand side to represent the string that is found matching the left-hand side: UPPER @ PREFIX... SUFFIX. Following an example from Karttunen (1996), a noun phrase that consists of an optional determiner (d), any number of adjectives a* and one or more nouns n+, can be marked using the regular expression in (RE6.4), mapping dannvaa into [dann]v[aan] as shown in (6.4). Thus, the expression compiles to a transducer that inserts brackets around maximal instances of the noun phrase pattern. (RE6.4) (d) a* n+ @ %[... ]% (6.4) dann v aan ---- --- [dann] v [aan] The replacement can be constrained further by a specific context, both on the left and the right of a particular pattern: UP- PER @ LOWER LEFT RIGHT (see Karttunen, 1995, for further variations). Furthermore, the replacement can be parallel, meaning that multiple replacements are performed at the same time (see Kempe and Karttunen, 1996). For instance, the regular expression in (RE6.5) denotes a constrained parallel replacement, where the symbol a is replaced by symbol b and at the same time symbol b is replaced by c. Both replacements occur at the same time and only if the symbols are preceded by symbol x and followed by symbol y. Applying this automaton to the string xaxayby yields then the string xaxbyby and to the string xbybyxa yields xcybyxa as presented in (6.5). (RE6.5) a b, b c x y (6.5) xaxayby xbybyxa --- --- xaxbyby xcybyxa 6.2.4 Finite State Parsing Introduction New approaches to parsing with the finite state formalism show that the calculus can be used to represent complex linguistic phenomena accurately and large scale lexical grammars can be represented in a compact way (Roche, 1997). There are various techniques for creating careful representations at increasing efficiency. For

FiniteCheck: A Grammar Error Detector 181 instance, parts of rules that are similar are represented only once, reducing the whole set of rules. For each state only one unique outgoing transition is possible (determinization), an automaton can be reduced to a minimal number of states (minimization). Moreover, one can create bi-directional machines, where the same automaton can be used for both parsing and generation. Applications of finite state parsing are used mostly in the fields of terminology extraction, lexicography and information retrieval for large scale text. The methods are more partial in the sense that the goal is not production of complete syntactic descriptions of sentences, but rather recognition of various syntactic patterns in a text (e.g. noun phrases, verbal groups). Parsing Methods Many finite-state parsers adopt the chunking techniques of Abney (1991) and collect sets of pattern rules into ordered sequences of levels of finite number, so called cascades, where the result of one level is the input to the next level (e.g. Appelt et al., 1993; Abney, 1996; Chanod and Tapanainen, 1996; Grefenstette, 1996; Roche, 1997). The parsing procedure over a text tagged for parts-of-speech usually proceeds by marking boundaries of adjacent patterns, such as noun or verbal groups, then the nominal and verbal heads within these groups are identified. Finally, patterns between non-adjacent heads are extracted identifying syntactic relations between words, within and across group boundaries. For this purpose, finite state transducers are used. The automata are applied both as finite state markers, that introduce extra symbols such as surrounding brackets to the input (as exemplified in the previous section), and as finite state filters that extract and label patterns. Usually, a combination of non-finite state methods and finite state procedures is applied, but the whole parser can be built as a finite state system (see further Karttunen et al., 1997a). The first application of finite state transducers to parsing was a parser developed at the University of Pennsylvania between 1958 and 1959 (Joshi and Hopely, 1996). 9 The parser is essentially a cascade of finite state transducers and the parsing style resembles Abney s chunking parser (Abney, 1991). Syntactic patterns using subcategorization frames and local grammars were constructed and recognize simple NPs, PPs, AdvPs, simple verb clusters and clauses. All of the modules of the parser, including dictionary look-up and part-of-speech disambiguation are finite state computations, except for the module for recognition of clauses. 9 The original version of the parser is presented in Joshi (1961) Up-to-date information about the reconstructed version of this parser - Uniparse - can be accessed from: http://www.cis. upenn.edu/ phopely/tdap-fe-post.html.

182 Chapter 6. Besides Abney s chunking approach (Abney, 1991, 1996), constructive finite state parsing of collections of syntactic patterns and local grammars, others use this technique to locate noun phrases (or other basic phrases) from unrestricted text (e.g. Appelt et al., 1993; Schiller, 1996; Senellart, 1998). Further, Grefenstette (1996) uses this technique to mark syntactic functions such as subject and object. Other approaches to finite-state parsing start from a large number of alternative analyses and, through application of constraints in the form of elimination or restriction rules, they reduce the alternative parses (e.g. Voutilainen and Tapanainen, 1993; Koskenniemi et al., 1992). These techniques were also used for extraction of noun phrases or other basic phrases (e.g. Voutilainen, 1995; Chanod and Tapanainen, 1996; Voutilainen and Padró, 1997). Salah Ait-Mokhtar and Jean-Pierre Chanod constructed a parser that combines the constructive and reductionist approaches. The system defines segments by constraints rather than patterns. They mark potential beginnings and ends of phrases and use replacement transducers to insert phrase boundaries. Incremental decisions are made throughout the whole parsing process, but at each step linguistic constraints may eliminate or correct some of the previously added information (Ait- Mohtar and Chanod, 1997). In the case of Swedish, finite state methods have been applied on a small scale to lexicography and information extraction. A Swedish regular expression grammar was implemented early at Umeå University, parsing a limited set of sentences (Ejerhed and Church, 1983; Ejerhed, 1985). Recently, a cascaded finite state parser Cass-Swe was developed for the syntactic analysis of Swedish (Kokkinakis and Johansson Kokkinakis, 1999), based on Abney s parser. Here the regular expression patterns are applied in cascades ordered by complexity and length to recognize phrases. The output of one level in a sequence is used as input in the subsequent level, starting from tagging and syntactic labeling proceeding to recognition of grammatical functions. The grammar of Cass-Swe has been semi-automatically extracted from written text by the application of probabilistic methods, such as the mutual information statistics which allows the exclusion of incorrect part-of-speech n-grams (Magerman and Marcus, 1990), and by looking at which function words signal boundaries between phrases and clauses. Discrimination of Input One parsing application using finite state methods presented by Karttunen et al. (1997a) aims at extraction of not only valid expressions, but also invalid patterns occurring in free text due to errors and misprints. The method is applied to date expressions and the idea is simply to define two language sets - one that overgen-

FiniteCheck: A Grammar Error Detector 183 erates and accepts all date expressions, including dates that do not exist, and one that defines only correct date expressions. The language of invalid dates is then obtained by subtracting the more specific language from the more general one. Thus, by distinguishing the valid date expressions from the language of all date expressions we obtain the set of expressions corresponding to invalid dates, i.e. those dates not accepted by the language set of valid expressions. To illustrate, the definitions in Karttunen et al. (1997a) express date expressions from January 1, 1 to December 31, 9999 and are represented by a small finite state automaton (13 states, 96 arcs), that accepts date expressions consisting of a day of the week, a month and a date with or without a year, or a combination of the two as defined in (RE6.6a) (SP is a separator consisting of a comma and a space, i.e., ). The parser for that language presented in (RE6.6b) is constraint by the left to right, longest match replacement operator which means that only the maximal instances of such expressions are accepted. However, this automaton also accepts dates that do not exist, such as April 31, which exceeds the maximum number of days for the month. Other problems concern leap days and the relationship between the day of the week and the date. A new language is defined by intersecting constraints of invalid types of dates with the language of date expressions as presented in (RE6.6c). 10 This much larger automaton (1346 states, 21006 arcs) accepts only valid date expressions and again a transducer marks the maximal instances of such dates, see (RE6.6d). (RE6.6) a. DateExpression = Day (Day SP) Month Date (SP Year) b. DateExpression @ %[... %] c. ValidDate = DateExpression & MaxDaysInMonth & LeapDays & WeekDayDates d. ValidDate @ %[... %] As the authors point out, it may be of use to distinguish valid dates from invalid ones, but in practice we also need to recognize the invalid dates due to errors and misprints in real text corpora. For this purpose we do not need to define a new language that reveals the structure of invalid dates. Instead, we make use of the already defined languages of all date expressions DateExpression and valid dates ValidDate and obtain the language of invalid dates by subtracting these language sets from each other [DateExpression - ValidDate]. 10 For more detail on the separate definitions of constraints see Karttunen et al. (1997a).

184 Chapter 6. A parser that identifies maximal instances of date expressions is presented in (RE6.7), that tags both the valid (VD) and invalid (ID) dates. (RE6.7) [ [DateExpression - ValidDate] @ [ID... %], ValidDate @ [VD... %] ] In the example in (6.6) below given by the authors, the parser identified two date expressions. First a valid one (VD) and then an invalid one (ID) differing only in the weekday from the valid one. Notice that the effect of the application of the longest match is reflected when for instance the invalid date Tuesday, September 16, 1996 is selected over Tuesday, September 16, 19, which is a valid date. 11 (6.6) The correct date for today is [VD Monday, September 16, 1996]. There is an error in the program. Today is not [ID Tuesday, September 16, 1996]. 6.3 System Architecture 6.3.1 Introduction After this short introduction to finite state automata, parsing methods with finite state techniques and a description of the XFST-compiler, I will now proceed with a description of the implemented grammar error detector FiniteCheck. In this section an overview is given of the system s architecture and how the system proceeds in the individual modules identifying errors in text. The types of automata used in the implementation are also described. The implementation methods and detailed descriptions of the individual modules are discussed in subsequent sections. The framework of FiniteCheck is built as a cascade of finite state transducers compiled from regular expressions including operators defined in the Xerox Finite- State Tool (XFST; see Section 6.2.3). Each automaton in the network composes with the result of the previous application. The implemented tool applies a strategy of simple dictionary lookup, incremental partial parsing with minimal disambiguation by parsing order and filtering, and error detection using subtraction of positive grammars that differ in their level of detail. Accordingly, the current system of sequenced finite state transducers is divided into four main modules: the dictionary lookup, the grammar, the parser and the error finder, see Figure 6.1 below. The system runs under UNIX in a simple emacs environment implemented by Robert Andersson with an XFST-mode that allows for menus to be used to 11 This date is however only valid in theory since the Gregorian calendar was not yet in use in the year 19 AD. The Gregorian calendar that replaced the Julian calendar was introduced in Catholic countries by the pope Gregory XIII on Friday, October 15, 1582 (in Sweden 1753).

FiniteCheck: A Grammar Error Detector 185 recompile files in the system. The modules are further described in the following subsection on the flow of data in the error detector. The form of the types of automata are discussed at the end of this section. Figure 6.1: The System Architecture of FiniteCheck

186 Chapter 6. 6.3.2 The System Flow The Dictionary Lookup The input text into FiniteCheck is first manually tokenized so that spaces occur between all strings and tokens, including punctuation. This formatted text is then tagged with part-of-speech and feature annotations by the lookup module that assigns all lexical tags stored in the lexicon of the system to a string in the text. No disambiguation is involved, only a simple lookup. The underlying lexicon of around 160,000 word forms is built as a finite state transducer. The tagset is based on the tagformat defined in the Stockholm Umeå Corpus (Ejerhed et al., 1992) combining part-of-speech information with feature information (see Section 6.4 and Appendix C). As an example, the sentence (6.7a) is ungrammatical, containing a (finite) auxiliary verb followed by yet another finite verb (see (4.32) on p.61). It will be annotated by the dictionary lookup as shown in (6.7b): (6.7) a. Men kom ihåg att But remember that det it inte ska not will [pres] But remember that there will not be a real fire. blir becomes [pres] någon some riktig brand real fire b. Men[kn] kom[vb prt akt] ihåg[ab][pl] att[sn][ie] det[pn neu sin def sub/obj][dt neu sin def] inte[ab] ska[vb prs akt] blir[vb prs akt] någon[dt utr sin ind][pn utr sin ind sub/obj] riktig[jj pos utr sin ind nom] brand[nn utr sin ind nom] The Grammar The grammar module includes two grammars with (positive) rules reflecting the grammatical structure of Swedish, differing in their level of detail. The broad grammar (Section 6.5) is especially designed to handle text with ungrammaticalities and the linguistic descriptions are less accurate, accepting both valid and invalid patterns. The narrow grammar (Section 6.7) is more refined and accepts only grammatical segments. For example, the regular expression in (RE6.8) belongs to the broad grammar and recognizes potential verb clusters (VC) (both grammatical and ungrammatical) as a pattern consisting of a sequence of two or three verbs in combination with (zero or more) adverbs (Adv ). (RE6.8) define VC [Verb Adv* Verb (Verb)]; This automaton accepts all the verb cluster examples in (6.8), including the ungrammatical instance (6.8c) extracted from the text in (6.7), where a finite verb

FiniteCheck: A Grammar Error Detector 187 in present tense follows a (finite) auxiliary verb, instead of a verb in infinitive form (i.e. bli be [inf] ). (6.8) a. kan can inte not b. skulle would springa run [inf] ha have c. ska blir will be [pres] sprungit run [sup] Corresponding rules in the narrow grammar, represented by the regular expressions in (RE6.9), take into account the internal structure of a verb cluster and define the grammar of modal auxiliary verbs (Mod) followed by (zero or more) adverb(s) (Adv ), and either a verb in infinitive form (VerbInf) as in (RE6.9a), or a temporal verb in infinitive (PerfInf) and a verb in supine form (Verb- Sup), as in (RE6.9b). These rules thus accept only the grammatical segments in (6.8) and will not include example (6.8c). The actual grammar of grammatical verb clusters is a little bit more complex (see Section 6.7). (RE6.9) a. define VC1 [Mod Adv* VerbInf]; b. define VC2 [Mod Adv* PerfInf VerbSup]; The Parser The system proceeds and the tagged text in (6.7b) is now the input to the next phase, where various kinds of constituents are selected applying a lexical-prefixfirst strategy, i.e. parsing first from the left margin of a phrase to the head and then extending the phrase by adding on complements. The phrase rules are ordered in levels. The system proceeds in three steps by first recognizing the head phrases in a certain order (verbal head vphead, prepositional head pphead, adjective phrase ap) and then selecting and extending the phrases with complements in a certain order (noun phrase np, prepositional phrase pp, verb phrase vp). The heuristics of parsing order gives better flexibility to the system in that (some) false parses can be blocked. This approach is further explained in the section on parsing (Section 6.6). The system then yields the output in (6.9). 12 Simple < and > around a phrase-tag denote the beginning of a phrase and the same signs together with a slash / indicate the end. 12 For better readability, the lexical tags are kept only in the erroneous segment and removed manually in the rest of the exemplified sentence.

188 Chapter 6. (6.9) Men <vp> <vphead> kom ihåg </vphead> </vp> att <np> det </np> <vp> <vphead> inte <vc> ska[vb prs akt] blir[vb prs akt] </vc> </vphead> <np> någon <ap> riktig </ap> brand </np> </vp> We apply the rules defined in the broad grammar set for this parsing purpose, like the one in (RE6.8) that identified the verb cluster in boldface in (6.9) above as a sequence of two verbs. The parsing output may be refined and/or revised by application of filtering transducers. Earlier parsing decisions depending on lexical ambiguity are resolved, and phrases are extended, e.g. with postnominal modifiers (see further in Section 6.6). Other structural ambiguities, such as verb coordinations or clausal modifiers on nouns, are also taken care of (see Section 6.7) The Error Finder Finally the error finder module is used to discriminate the grammatical patterns from the ungrammatical ones, by subtracting the narrow grammar from the broad grammar. These new transducers are used to mark the ungrammatical segments in a text. For example, the regular expression in (RE6.10a) identifies verb clusters that violate the narrow grammar of modal verb clusters (VC1 or VC2 in (RE6.9)) by subtracting ( - ) these rules from the more general (overgenerating) rule in the broad grammar (VC in ((RE6.8) within the boundaries of a verb cluster (<vc>, </vc> ), previously marked in the parsing stage in (6.9). That is, the output of the parsing stage in (6.9) is the input to this level. By application of the marking transducer in (RE6.10b), the erroneous verb cluster consisting of two verbs in present tense in a row is annotated directly in the text as shown in (6.10). (RE6.10) a. define VCerror [ "<vc>" [VC - [VC1 VC2]] "</vc>" ]; b. define markvcerror [ VCerror -> "<Error Verb after Vaux>"... "</Error>"]; (6.10) Men <vp> <vphead> kom ihåg </vphead> </vp> att <np> det </np> <vp> <vphead> inte <Error Verb after Vaux> <vc> ska[vb prs akt] blir[vb prs akt] </vc> </Error> </vphead> <np> någon <ap> riktig </ap> brand </np> </vp>

FiniteCheck: A Grammar Error Detector 189 6.3.3 Types of Automata In accordance with the techniques of finite-state parsing (see Section 6.2.4), there are in general two types of transducers in use: one that annotates text in order to select certain segments and one that redefines or refines earlier decisions. Annotations are handled by transducers called finite state markers that add reserved symbols into the text and mark out syntactic constituents, grammar errors, or other relevant patterns. For instance, the regular expression in (RE6.11) inserts noun phrase tags in text by application of the left-to-right-longest-match replacement operator ( @ ) (see Section 6.2.3). (RE6.11) define marknp [NP @-> "<np>"... "</np>"]; The automaton finds the pattern that matches the maximal instance of a noun phrase (NP) and replaces it with a beginning marker (<np> ), copies the whole pattern by application of the insertion operator (... ) and then assigns the endmarker (</np> ). Three (maximal) instances of noun phrase segments are recognized in the example sentence (6.11a), discussed earlier in Chapter 4 (see (4.2) on p.46) as shown in (6.11b), where one violates definiteness agreement (in boldface). 13 (6.11) a. En one gång time blev was ur stan. from the-city den the [def] hemska awful [def] pyroman pyromaniac [indef]) Once the awful pyromaniac was thrown out of the city. utkastad thrown-out b. <np> En gång </np> blev <np> den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] </np> utkastad ur <np> stan </np>. The regular expression in (RE6.12) represents an another example of an annotating automaton. (RE6.12) define marknpdeferror [ npdeferror -> "<Error definiteness>"... "</Error>"]; This finite state transducer marks out agreement violations of definiteness in noun phrases (npdeferror; see Section 6.8). It detects for instance the erroneous noun phrase den hemska pyroman in the example sentence, where the determiner den the is in definite form and the noun pyroman pyromaniac is in indefinite form (6.12). By application of the left-to-right replacement operator 13 Only the erroneous segment is marked by lexical tags.

190 Chapter 6. ( ) the identified segment is replaced by first inserting an error-diagnosis-marker (<Error definiteness> ) as the beginning of the identified pattern, then the pattern is copied and the error-end-marker (</Error> ) is added. (6.12) <np> En gång </np> blev <Error definiteness> <np> den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] </np> </Error> utkastad ur <np> stan </np>. The marking transducers of the system have the form A @ S... E, when marking the maximal instances of A from left to right by application of the left-toright-longest-match replacement operator ( @ ) and inserting a start-symbol S (e.g. <np> ) and an end-symbol E (e.g. </np> ). In cases where the maximal instances are already recognized and only the operation of replacement is necessary, the transducers use the form A S... E, applying only the left-to-right replacement operator ( ). The other types of transducers are used for refinement and/or revision of earlier decisions. These finite state filters can for instance be used to remove the noun phrase tags from the example sentence, leaving just the error marking. The regular expression in (RE6.13) replaces all occurrences of noun phrase tags with an empty string ( 0 ) by application of the left-to-right replacement operator ( ). The result is shown in (6.13). (RE6.13) define removenp ["<np>" -> 0, "</np>" -> 0]; (6.13) En gång blev <Error definiteness> den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] </Error> utkastad ur stan. These filtering transducers have the form A B and are used for simple replacement of instances of A by B by application of the left-to-right replacement operator ( ). In cases where the context plays a crucial role, the automata are extended by requirements on the left and/or the right context and have the form A B L R. Here, the patterns in A are replaced by B only if A is preceded by the left context L and followed by the right context R. In some cases only the left context is constrained, in others only the right, and in some cases both are needed.

FiniteCheck: A Grammar Error Detector 191 6.4 The Lexicon 6.4.1 Composition of The Lexicon The lexicon of the system is a full form lexicon based on two resources, the Lexin (Skolverket, 1992) developed at the Department of Swedish Language, Section of Lexicology, Göteborg University and a corpus-based lexicon from the SveLex project under the direction of Daniel Ridings, LexiLogik AB. At the initial stage of lexicon composition, only the Lexin dictionary of 58 326 word forms was available to us and we chose it especially for the lexical information stored in it, namely that the lexicon also included information on valence. I converted the Lexin text records to one single regular expression by a two-step process using the programming language gawk (Robbins, 1996). From the Lexin records (exemplified in (6.14a) and (6.14b)) a new file was created with lemmas separated by rows as in (6.14c). The first line here represents the Lexin-entry for the noun bil car in (6.14a) and the second for the verb bilar travels by car [pres] in (6.14b). Only a word s part-of-speech (entry #02), lemma (entry #01), and declined forms (entry #12) are listed in the current implementation. 14 The number and type of forms vary according to the part-of-speech, and sometimes even within a part-of-speech. (6.14) a. #01 bil #02 subst #04 ett slags motordrivet fordon #07 åka bil #09 bild 17:34, 18:36-37 #11 bil trafik -en #11 personbil #11 bil buren #11 bil fri #11 bil sjuk #11 bil sjuka #11 bil telefon #11 lastbil #12 bilen bilar #14 bi:l b. #01 bilar #02 verb #04 åka bil #10 A & (+ RIKTNING) #12 bilade bilat bila(!) #14 2bI:lar c. subst bil bilen bilar verb bilar bilade bilat bila 14 Future work will further extend the other kinds of information stored in the lexicon, such as valence and compounding.

192 Chapter 6. In the next step I converted the data in (6.14c) directly to a single regular expression as shown in (RE6.14). Each word entry in the lexicon was represented as a single finite state transducer with the string in the LOWER side and the category and feature in the UPPER side, allowing both analysis and generation. The whole dictionary is formed as the union of these automata. At this stage I used only simple tagsets that were later converted to the SUC-format (see below). Using this automatic generation of lexical entries to a regular expression, alternative versions of the lexicon are easy to create with for example different tagsets or including other information from Lexin (e.g. valence, compounds). (RE6.14) [ A % - i n k o m s t 0: %[%+NSI%] A % - k a s s a 0: %[%+NSI%] A % - s k a t t 0: %[%+NSI%]... b i l 0: %[%+NSI%] b i l e n 0: %[%+NSD%] b i l a r 0:%[%+NPI%]... b i l a 0:%[%+VImp%] b i l a r 0:%[%+VPres%] b i l a d e 0:%[%+VPret%] b i l a t 0:%[%+VSup%]... ö v ä r l d 0: %[%+NSI%] ö v ä r l d a r 0: %[%+NPI%] ö v ä r l d e n 0:%[%+NSD%] ]; The Lexin dictionary was later extended with 100,000 most frequent word forms selected from the corpus-based SveLex. At this stage the format of the lexicon was revised. The new lexicon of 158,326 word forms was compiled to a new transducer using instead the Xerox tool Finite-State Lexicon Compiler (LEXC) (Karttunen, 1993), that made the lexicon more compact and effective. This software facilitates in particular the development of natural-language lexicons. Instead of regular expression declarations, a high-level declarative language is used to specify the morphotactics of a language. I was not part of the composition of the new version of the lexicon. The procedures and achievements of this work are described further in Andersson et al. (1998, 1999).

FiniteCheck: A Grammar Error Detector 193 6.4.2 The Tagset In the present version of the lexicon, the set of tags follows the Stockholm Umeå Corpus project conventions (Ejerhed et al., 1992), including 23 category classes and 29 feature classes (see Appendix C). Four additional categories were added to this set for recognition of copula verbs (cop), modal verbs (mvb), verbs with infinitival complement (qmvb) and unknown words, that obtain the tag [nil]. This morphosyntactic information is used for identification of strings by both their category and/or feature(s). For reasons of efficiency, the whole tag with category and feature definitions is read by the system as a single symbol and not as a separate list of atoms. An experiment conducted by Robert Andersson showed that the size of an automaton recognizing a grammatical noun phrase decreased with 90% less states and 60% less transitions in comparison to declaring a tag as consisting of a category and a set of features (see further in Andersson et al., 1999). As a consequence of this choice, the automata representing the tagset are divided both in accordance with the category they state and the features, always rendering the whole tag. The automata are constructed as an union of all the tags of the same category or feature. In practice this means that the same tag occurs in different tag-definitions as many times as the number of defined characteristics. For instance, the tag defining an active verb in present tense [vb prs akt] occurs in three definitions, first as an union of all tags defining the verb category (TagVB in (RE6.15)), then among all tags for present tense (TagPRS in (RE6.16)) and then among all tags for active voice (TagAKT in (RE6.17)). (RE6.15) define TagVB [ "[vb an]" "[vb sms]" "[vb prt akt]" "[vb prt sfo]" "[vb prs akt]" "[vb prs sfo]" "[vb sup akt]" "[vb sup sfo]" "[vb imp akt]" "[vb imp sfo]" "[vb inf akt]" "[vb inf sfo]" "[vb kon prt akt]" "[vb kon prt sfo]" "[vb kon prs akt]" ];

194 Chapter 6. (RE6.16) define TagPRS [ "[pc prs utr/neu sin/plu ind/def gen]" "[pc prs utr/neu sin/plu ind/def nom]" "[vb prs akt]" "[vb prs sfo]" "[vb kon prs akt]" ]; (RE6.17) define TagAKT [ "[vb prt akt]" "[vb prs akt]" "[vb sup akt]" "[vb imp akt]" "[vb inf akt]" "[vb kon prt akt]" "[vb kon prs akt]" ]; On the other hand, the tag for an interjection ([in]) that consists only of the category, occurs just once in the definitions of tags: (RE6.18) define TagIN [ "[in]" ]; There are in total 55 different lexical-tag definitions of categories and features. One single automaton (Tag) represents all the different categories and features, that is composed as the union of these 55 lexical tags. The largest category of singular-feature (TagSIN) includes 80 different tags. 6.4.3 Categories and Features In the parsing and error detection processes, strings need to be recognized by their category and/or feature inclusion. The morphosyntactic information in the tags is used for this purpose and automata identifying different categories and feature sets are defined. For instance, the regular expression in (RE6.19a) recognizes the tagged string kan[vb prs akt] can as a verb, i.e. a sequence of one or more (the iteration sign + ) letters followed by a sequence of tags one of which is a tag containing vb (TagVB). Features are defined in the same manner. The same string can be recognized as a carrier of the feature of present tense. The regular expression in (RE6.19b) defines the automaton for present tense as a sequence of (one or more) letters followed by a sequence of tags, where one of them fulfills the feature of present tense prs (TagPRS). (RE6.19) a. define Verb Letter+ Tag* TagVB Tag*; b. define Prs Letter+ Tag* TagPRS Tag*;

FiniteCheck: A Grammar Error Detector 195 By using intersection ( & ) of category and feature sets, there is also the possibility recognizing category-feature combinations. The same string can then be recognized directly as a verb in present tense by the regular expression VerbPrs given in (RE6.20), that presents all the verb tense features. (RE6.20) define VerbImp [Verb & Imp]; define VerbPrs [Verb & Prs]; define VerbPrt [Verb & Prt]; define VerbSup [Verb & Sup]; define VerbInf [Verb & Inf]; Even higher level sets can be built. For instance, a category of tensed (finite) and untensed (non-finite) verbs may be defined as in (RE6.21), including the union of appropriate verb form definitions from the verb tense feature set in (RE6.20) above. Our string example falls then as a verb in present tense form among the finite verb forms (VerbTensed). (RE6.21) define VerbTensed [VerbPrs VerbPrt]; define VerbUntensed [VerbSup VerbInf]; 6.5 Broad Grammar The rules of the broad grammar are used to mark potential phrases in a text, both grammatical and ungrammatical. The grammar consists of valid (grammatical) rules that define the syntactic relations of constituents mostly in terms of categories and list the order of them. There are no other constraints on the selections than the types of part-of-speech that combine with each other to form phrases. The grammar is in other words underspecified and does not distinguish between grammatical and ungrammatical patterns. The parsing is incremental, i.e. identifying first heads and then complements. This is also reflected in the broad grammar listed in (RE6.22), that includes rules divided in heads and complements. The whole broad grammar consists of six rules, including the head rules of adjective phrase (AP), verbal head (VPHead) and prepositional head (PPHead) and then rules for noun phrase (NP), prepositional phrase (PP) and verb phrase (VP).

196 Chapter 6. (RE6.22) # Head rules define AP [(Adv) Adj+]; define PPhead [Prep]; define VPhead [[[Adv* Verb] [Verb Adv*]] Verb* (PNDef & PNNeu)]; # Complement rules define NP [[[(Det Det2 NGen) (Num) (APPhr) (Noun)] &?+] Pron]; define PP [PPheadPhr NPPhr]; define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*]; An adjective phrase (AP) consists of an (optional) adverb and a sequence of (one or more) adjectives. This means that an adjective phrase consists of at least one adjective. The head of a prepositional phrase (PPHead) is a preposition. A verbal head (VPHead) includes a verb preceded by a (zero or more) adverb(s) or followed by a (zero or more) adverb(s), and possibly followed by a (zero or more) verb(s) and an optional pronoun. This means that a verbal head consists at least of a single verb, that in turn may be preceded or followed by adverb(s) and followed by verb(s). In order to prevent pronouns being analyzed as determiners in noun phrases, e.g. jag anser det bra I think it is good, single neuter definite pronouns are included in the verbal head. The regular expression describing a noun phrase (NP) consists of two parts. The first states that a noun phrase includes a determiner (Det) or a determiner with adverbial här here or där there (Det2) or a possessive noun (NGen), followed by a numeral (Num), an adjective phrase (APPhr) and a (proper) noun (Noun). Not only the noun can form the head of the noun phrase which is why all the constituents are optional. The intersection with any-symbol (? ) followed by the iteration sign ( + ) is needed to state that at least one of the listed constituents has to occur. The second part of the noun phrase rule, states that a noun phrase may consist of a single pronoun (Pron). A prepositional phrase (PP) is recognized as a sequence of prepositional head (PPheadPhr) followed by a noun phrase (NPPhr). A verb phrase consists of a verbal head (VPheadPhr) followed by at most three (optional) noun phrases and (zero or more) prepositional phrases. 6.6 Parsing 6.6.1 Parsing Procedure The rules of the (underspecified) broad grammar are used to mark syntactic patterns in a text. A partial, lexical-prefix-first, longest-match, incremental strategy is

FiniteCheck: A Grammar Error Detector 197 used for parsing. The parsing procedure is partial in the sense that only portions of text are recognized and no full parse is provided for. Patterns not recognized by the rules of the (broad) grammar remain unchanged. The maximal instances of a particular phrase are selected by application of the left-to-right-longest-match replacement operator ( @ ) (see Section 6.2.3). In (RE6.23) we see all the marking transducers recognizing the syntactic patterns defined in the broad grammar. The automata replace the corresponding phrase (e.g. noun phrase, NP) with a label indicating the beginning of such pattern (<np> ), the phrase itself and a label that marks the end of that pattern (</np> ). (RE6.23) define markpphead [PPhead @-> "<pphead>"... "</pphead>"]; define markvphead [VPhead @-> "<vphead>"... "</vphead>"]; define markap [AP @-> "<ap>"... "</ap>" ]; define marknp [NP @-> "<np>"... "</np>" ]; define markpp [PP @-> "<pp>"... "</pp>" ]; define markvp [VP @-> "<vp>"... "</vp>" ]; The segments are built on in cascades in the sense that first the heads are recognized, starting from the left-most edge to the head (so called lexical-prefix) and then the segments are expanded in the next level by addition of complement constituents. The regular expressions in (RE6.24) compose the marking transducers of separate segments into a three step process. (RE6.24) define parse1 [markvphead.o. markpphead.o. AP]; define parse2 [marknp]; define parse3 [markpp.o. markvp]; First the verbal heads, prepositional heads and adjective phrases are recognized by composition in that order (parse1). The corresponding marking transducers presented in (RE6.23) insert syntactic tags around the found phrases as in (6.15a). 15 This output serves then as input to the next level, where the adjective phrases are extended and noun phrases are recognized (parse2) and marked as exemplified in (6.15b). This output in turn serves as input to the last level, where the whole prepositional phrases and verb phrases are recognized in that order (parse3) and marked as in (6.15c). 15 The original sentence example is presented in (6.11) on p.189.

198 Chapter 6. (6.15) a. PARSE 1: VPHead.o. PPHead.o. AP En gång <vphead> blev </vphead> den <ap> hemska </ap> pyroman <ap> utkastad </ap> <pphead> ur </pphead> stan. b. PARSE 2: NP <np> En gång </np> <vphead> blev </vphead> <np> den <ap> hemska </ap> pyroman </np> <np> <ap> utkastad </ap> </np> <pphead> ur </pphead> <np> stan </np>. c. PARSE 3: PP.o. VP <np> En gång </np> <vp> <vphead> blev </vphead> <np> den <ap> hemska </ap> pyroman </np> <np> <ap> utkastad </ap> </np> <pp> <pphead> ur </pphead> <np> stan </np> </pp> </vp>. During and after this parsing annotation, some phrase types are further expanded with post-modifiers, split segments are joined and empty results are removed (see Section 6.6.4). The broadness of the grammar and the lexical ambiguity in words, necessary for parsing text containing errors, also yields ambiguous and/or alternative phrase annotations. We block some of the (erroneous) alternative parses by the order in which phrase segments are selected, which causes bleeding of some rules and more correct parsing results are achieved. The order in which the labels are inserted into the string influences the segmentation of patterns into phrases (see Section 6.6.2). Further ambiguity resolution is provided for by filtering automata (see Section 6.6.3). 6.6.2 The Heuristics of Parsing Order The order in which phrases are labeled supports ambiguity resolution in the parse to some degree. The choice of marking verbal heads before noun phrases prevents merging constituents of verbal heads into noun phrases which would yield noun phrases with too wide a range. For instance, marking first the sentence in (6.16a) for noun phrases ((6.16b) NP:) 16 would interpret the pronoun De they as a determiner and the verb såg saw, that is exactly as in English homonymous with the noun saw, as a noun and merges these two constituents to a noun phrase. The output would then be composed with the selection of the verbal head ((6.16b) NP.o. VPHead) that ends up within the boundaries of the noun phrase. Composing the marking transducers in the opposite order instead yields the more correct parse in (6.16c). Although the alternative of the verb being parsed as verbal head or a noun remains (<vphead> <np> såg </np> </vphead> ), the pronoun is now marked correctly as a separate noun phrase and not merged together with the main verb into a noun phrase. 16 Asterix * indicates erroneous parse.

FiniteCheck: A Grammar Error Detector 199 (6.16) a. De They b. såg looked They looked sad. ledsna ut sad out NP: <np> De såg </np> <np> ledsna </np> ut. NP.o. VPHead: <np> De <vphead> såg </vphead> </np> <np> ledsna </np> ut. c. VPHead: De <vphead> såg </vphead> ledsna ut. VPHead.o. NP: <np> De </np> <vphead> <np> såg </np> </vphead> ledsna </np> ut. <np> This ordering strategy is not absolute however, since the opposite scenario is possible where parsing noun phrases before verbal heads is more suitable. Consider for instance example (6.17a) below, where the string öppna open in the noun phrase det öppna fönstret the open window will be split in three separate noun phrase segments when applying the order of parsing verbal heads before noun phrases (6.17c), due the homonymity between an adjective and an infinitive or imperative verb form. The opposite scenario of parsing noun phrases before verbal heads yields a more correct parse (6.17b), where the whole noun phrase is recognized as one segment. (6.17) a. han tittade he looked genom through det the öppna fönstret open window he looked through the open window b. NP: <np> han </np> tittade genom <np> det öppna fönstret </np> c. NP.o. VPHead: <np> han </np> <vphead> tittade </vphead> genom <vphead> öppna </vphead> fönstret </np> <np> det VPHead: han <vphead> tittade </vphead> genom det <vphead> öppna </vphead> fönstret VPHead.o. NP: <np> han </np> <vphead> tittade </vphead> genom <np> det </np> <vphead> <np> öppna </np> </vphead> <np> fönstret </np> We analyzed the ambiguity frequency in the Child Data corpus and found that occurrences of nouns recognized as verbs are more recurrent than the opposite. On

200 Chapter 6. this ground, we chose the strategy of marking verbal heads before marking noun phrases. In the case of the opposite scenario, the false parsing can be revised and corrected by an additional filter (see Section 6.6.3). A similar problem occurs with homonymous prepositions and nouns. For instance, the string vid is ambiguous between an adjective ( wide ) and a preposition ( by ) and influences the order of marking prepositional heads and noun phrases. Parsing prepositional heads before noun phrases is more suitable for preposition occurrences as shown in (6.18c) in order to prevent the preposition from being merged as part of a noun phrase, as in (6.18b). (6.18) a. Jag I b. satte mig vid sat me by bordet the-table I sat down at the table. NP: <np> Jag </np> satte <np> mig </np> <np> vid bordet </np>. NP.o. PP: <np> Jag </np> satte <np> mig </np> <np> </pphead> bordet </np>. c. PP: Jag satte mig <pphead> vid </pphead> bordet. <pphead> vid PP.o. NP: <np> Jag </np> satte <np> mig </np> <pphead> <np> vid </np> </pphead> <np> bordet </np>. The opposite order is more suitable for adjective occurrences, as in (6.19), where the adjective is joined together with the head noun when selecting noun phrases first as in (6.19b). But when recognizing the adjective as prepositional head, that noun phrase is split into two noun phrases as in (6.19c). Again, the choice of marking prepositional heads before noun phrases was based on the result of frequency analysis in the corpus, i.e. the string vid occurred more often as a preposition than an adjective.

FiniteCheck: A Grammar Error Detector 201 (6.19) a. Hon She hade vid had wide kjol på sig. skirt on herself. She was wearing a wide skirt. b. NP: <np> Hon </np> hade <np> vid kjol </np> på <np> sig </np>. c. NP.o. PP: <np> Hon </np> hade <np> <pphead> vid </pphead> kjol </np> på <np> sig </np>. PP: Hon hade <pphead> vid </pphead> kjol på sig. PP.o. NP: <np> Hon </np> hade <pphead> <np> vid </np> </pphead> <np> kjol </np> på <np> sig </np>. 6.6.3 Further Ambiguity Resolution As discussed above, the parsing order does not give the correct result in every context. Nouns, adjectives and pronouns are homonymous with verbs and might then be interpreted by the parser as verbal heads, or adjectives homonymous with prepositions can be analyzed as prepositional heads. These parsing decisions can be redefined at a later stage by application of filtering transducers (see Section 6.3.3). As exemplified in (6.17) above, the consequence of parsing verbal heads before noun phrases may yield noun phrases that are split into parts, due to the fact that adjectives are interpreted as verbs. The filtering transducer in (RE6.25) adjusts such segments and removes the erroneous (inner) syntactic tags (i.e. replaces them with the empty string 0 ) so that only the outer noun phrase markers remain and converts the split phrase in (6.20a) to one noun phrase yielding (6.20b). The regular expression consists of two replacement rules that apply in parallel. They are constrained by the surrounding context of a preceding determiner (Det) and a subsequent adjective phrase (APPhr) and a noun phrase (NPPhr) in the first rule, and a preceding determiner and an adjective phrase in the second rule. (6.20) a. <np> han </np> <vphead> tittade </vphead> genom <np> det </np> <vphead> <np> öppna </np> </vphead> <np> fönstret </np> b. <np> han </np> <vphead> tittade </vphead> genom <np> det öppna fönstret </np> (RE6.25) define adjustnpadj [ "</np><vphead><np>" -> 0 Det _ APPhr "</np></vphead>" NPPhr,, "</np></vphead><np>" -> 0 Det "</np><vphead><np>" APPhr _ ];

202 Chapter 6. Noun phrases with a possessive noun as the modifier are split when the head noun is homonymous with a verb as in (6.21). 17 The parse is then adjusted by a filter that simply extracts the noun from the verbal head and moves the borders of the noun phrase yielding (6.21c). (6.21) a. barnens children s far father hade dött had died the father of the children had died b. <np> barnens </np> <vphead> <np> far </np> hade dött </vphead> c. <np> barnens far </np> <vphead> hade dött </vphead> The filtering automaton in (RE6.26) inserts a start-marker for verbal head (i.e. replaces the empty string 0 with the syntactic tag vphead) right after the end of the actual noun phrase and removes the redundant syntactic tags in the second replacement rule. The replacement procedure is (again) simultaneous, by application of parallel replacement. (RE6.26) define adjustnpgen [ 0 -> "<vphead>" NGen "</np><vphead>" NPPhr _,, "</np><vphead><np>" -> 0 NGen _ $"<np>" </np>"]; Another ambiguity problem occurs with the interrogative pronoun var where that in Swedish is ambiguous with the copula verb var were or was. Since verbal heads are annotated first in the system identifying segments of maximal length, the homonymous pronoun is recognized as a verb and combined with the subsequent verb as in (6.22) and (6.23). (6.22) a. Var where var was den där the there Where was that surprise? överraskningen. surprise b. <vp> <vphead> <vc> <np> Var var </np> </vc> </vphead> <np> den där överraskningen </np> </vp>? (6.23) a. Var where såg saw du you hästen the-horse Madde Madde frågar jag. ask I Where did you see the horse, Madde? I asked. b. <vp> <vphead> <vc> <np> Var såg </np> </vc> </vphead> <np> du </np> <np> hästen </np> </vp> Madde<vp> <vphead> frågar <np> jag </np> </vphead> </vp>. 17 Here the string far is ambiguous between the noun reading father and the present tense verb form goes.

FiniteCheck: A Grammar Error Detector 203 A similar problem occurs with adjectives or participles homonymous with verbs as in (6.24), where the adjective rädda scared [pl] is identical to the infinitive or imperative form of the verb rescue and is joined with the preceding copula verb to form a verb cluster. (6.24) a. Alla blev all became rädda... afraid All became afraid... b. <np> Alla </np> <vp> <vphead> <vc> blev<np> <ap> rädda </ap> </np> </vc> </vphead> </vp>... All verbal heads recognized as sequences of verbs with a copula verb in the beginning are selected by the replacement transducer in (RE6.27) that changes the verb cluster label (<vc> ) to a new marking (<vccopula> ). This selection provides no changes in the parsing result in that no markings are (re)moved. Its purpose rather is to prevent false error detection and mark such verb clusters as being different. For instance, applying this transducer on the example in (6.22,) will yield the output presented in (6.25). (RE6.27) define SelectVCCopula [ "<vc>" -> "<vccopula>" _ [CopVerb / NPTags] $"<vc>" "</vc>"]; (6.25) <vp> <vphead> <vccopula> <np> Var var </np> </vc> </vphead> <np> den där överraskningen </np> </vp>? 6.6.4 Parsing Expansion and Adjustment The text is now annotated with syntactic tags and some of the segments have to be further expanded with postnominal attributes and coordinations. In the current system, partitive prepositional phrases are the only postnominal attributes taken care of. The reason is that grammatical errors were found in these constructions. By application of the filtering transducer in (RE6.28) the example text in (6.26a) with the partitive noun phrase split into a noun phrase followed by a prepositional head that includes the partitive preposition av of and yet another noun phrase from the parsing stage in (6.26b) is merged to form a single noun phrase as in (6.26c). This automaton removes the redundant inner syntactic markers by application of two replacement rules, constrained by the right or left context. The replacement occurs simultaneously by application of parallel replacement. (RE6.28) define adjustnppart [ "</np><pphead>" -> 0 _ PPart "</pphead><np>",, "</pphead><np>" -> 0 "</np><pphead>" PPart _ ];

204 Chapter 6. (6.26) a. Mamma mum Dom the och and Virginias Virginia s gamla husen. old the-houses mamma mum hade had öppnat opened en a tygaffär fabric-store i in en one Mum and Virginia s mum had opened a fabric-store in one of the old houses. b. <np> Mamma </np> och <np> Virginias mamma </np> <vp> <vphead> <vc> hade öppnat </vc> </vphead> <np> en tygaffär </np> i <np> en </np> <pphead> av </pphead> <np> Dom <ap> gamla </ap> husen </np>. c. <np> Mamma </np> och <np> Virginias mamma </np> <vp> <vphead> <vc> hade öppnat </vc> </vphead> <np> en tygaffär </np> i <NPPart> en av Dom <ap> gamla </ap> husen </np>. Another type of phrase that needs to be expanded are verbal groups with a noun phrase in the middle, normally occurring when a sentence is initiated by other constituents than a subject (i.e. with inverted word order; see Section 4.3.6), as in (6.27a). In the parsing phase the verbal group is split into two verbal heads, as in (6.27b) that should be joined in one as in (6.27c). (6.27) a. En One dag day tänkte Urban thought Urban göra do varma mackor hot sandwiches One day Urban thought of making hot sandwiches. b. <np> En dag </np> <vphead> tänkte </vphead> <np> Urban </np> <vphead> göra </vphead> <np> varma mackor </np>. c. <np> En dag </np> <vphead> tänkte <np> Urban </np> göra </vphead> <np> varma mackor </np>. The filtering automaton merging the parts of a verb cluster to a single segment is constrained so that two verbal heads are joined together only if there is a noun phrase in-between them and the preceding verbal head includes an auxiliary verb or a verb that combines with an infinitive verb form (VBAux). The corresponding regular expression (RE6.29) removes the redundant verbal head markers in this constrained context. The replacement works in parallel, here removing both the redundant start-marker (<vphead> ) and the end-marker (</vphead> ) at the same time. There are two (alternative) replacement rules for every tag, since the noun phrase can either occur directly after the first verbal head as in our example (6.27) above, or as a pronoun be part of the first verbal head. Tags not relevant for this replacement (VCTags) are ignored (/). av of

FiniteCheck: A Grammar Error Detector 205 (RE6.29) define adjustvc [ "</vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags] _ NPPhr VPheadPhr,, "</vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags] NPPhr _ VPheadPhr,, "<vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags]] "</vphead>" NPPhr _ $"<vphead>" "</vphead>",, "<vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags]] NPPhr "</vphead>" _ $"<vphead>" "</vphead>" ]; Other filtering transducers are used for refining the parsing result. Incomplete parsing decisions are eliminated at the end of parsing. For instance, incomplete prepositional phrases, i.e. a prepositional head without a following noun phrase, defined in the regular expression (RE6.30a) are removed. Also removed are empty verbal heads as in (RE6.30b) and other misplaced tags. (RE6.30) a. define errorpphead [ "<pphead>" -> 0 \["<pp>"] _,, "</pphead>" -> 0 _ \["<np>"]]; b. define errorvphead [ "<vp><vphead></vphead></vp>" -> 0]; 6.7 Narrow Grammar The narrow grammar is the grammar proper, whose purpose is to distinguish the grammatical segments from the ungrammatical ones. The automata of this grammar express the valid (grammatical) rules of Swedish, and constrain both the order of constituents and feature requirements. The current grammar is based on the Child Data corpus and includes rules for noun phrases and the verbal core. 6.7.1 Noun Phrase Grammar Noun Phrases The rules in the noun phrase grammar are divided, following Cooper s approach (Cooper, 1984, 1986), according to what types of constituent they consist of and what feature conditions they have to fulfill (see Section 4.3.1). There are altogether ten noun phrase types implemented, listed in Table 6.3, including noun phrases with the (proper) noun as the head, pronoun or determiner, adjective, numeral and partitive attribute, reflecting the profile of the Child Data corpus.

206 Chapter 6. Table 6.3: Noun Phrase Types RULE SET NOUN PHRASE TYPE EXAMPLE NP1 single noun (Num) N (två) grodor (two) frogs PNoun Kalle Kalle NP2 determiner and noun Det (DetAdv) (Num) N de (här) (två) grodorna the/these (two) frogs poss. noun and noun NGen (Num) N flickans (två) grodor girl s (two) frogs NP3 determiner, adj. and noun Det AP N den lilla grodan the little frog poss. noun, adj. and noun NGen AP N flickans lilla groda girl s little frog NP4 adjective and noun (Num) AP N (två) små grodor (two) little frogs NP5 single pronoun PN han he NP6 single determiner Det den that NP7 adjective Adj+ obehörig unauthorized NP8 determiner and adjective Det Adj+ de gamla the old NP9 numeral (Det) Num den tredje, 8 the third, 8 NPPart partitive Num PPart NP två av husen two of houses partitive Det PPart NP ett av de gamla husen one of the old houses Every noun phrase type is divided into six subrules, expressing the different types of errors, two for definiteness (NPDef, NPInd), two for number (NPSg, NPPl) and two for gender agreement (NPUtr, NPNeu). 18 For instance, in (RE6.31) we have the set of rules representing noun phrases consisting of a single pronoun, that present the feature requirements on the pronoun as the only constituent, i.e. that a definite form of the pronoun is required (PNDef) in order to be considered as a definite noun phrase (NPDef). 18 Utr denotes the common gender called utrum in Swedish.

FiniteCheck: A Grammar Error Detector 207 (RE6.31) define NPDef5 [PNDef]; define NPInd5 [PNInd]; define NPSg5 [PNSg]; define NPPl5 [PNPl]; define NPNeu5 [PNNeu]; define NPUtr5 [PNUtr]; The rule set NP2 presented in (RE6.32) is more complex and defines the grammar for both definite, indefinite and mixed noun phrases (see Section 4.3.1) with a determiner (or a possessive noun) and a noun. For instance, the definite form of this noun phrase type (NPDef2) is defined as a sequence of a definite determiner (DetDef), an optional adverbial (DetAdv; e.g. här here ), an optional numeral (Num), and a definite noun, or as a sequence of mixed determiner (DetMixed i.e. those that take an indefinite noun as complement; e.g. denna this ) or a possessive noun (NGen), followed by an optional numeral and an indefinite noun. (RE6.32) define NPDef2 [DetDef (DetAdv) (Num) NDef] [[DetMixed NGen] (Num) NInd]; define NPInd2 [DetInd (Num) NInd]; define NPSg2 [[DetSg (DetAdv) NGen] (NumO) NSg]; define NPPl2 [[DetPl (DetAdv) NGen] (Num) NPl]; define NPNeu2 [[DetNeu (DetAdv) NGen] (Num) NNeu]; define NPUtr2 [[DetUtr (DetAdv) NGen] (Num) NUtr]; This particular automaton (NPDef2) accepts all the noun phrases in (6.28), except for the first one that forms an indefinite noun phrase and will be handled by the subsequent automaton of indefinite noun phrases of this kind (NPInd2). It also accepts the ungrammatical noun phrase in (6.28c), since it only constrains the definiteness features. This erroneous noun phrase is then handled by the automaton representing singular noun phrases of this type (NPSg2) that states that only ordinal numbers (NumO) can be combined with singular determiners and nouns. (6.28) a. en a [indef] b. den this [def] (första) blomma (first) flower [indef] (här) (here) (första) (first) blomman flower [def] c. den this [def] (här) (here) (två) (two) blomman flower [def] d. denna this [def] e. flickans the [def] (första) blomma (first) flower [indef] (första) (first) blomma flower [indef]

208 Chapter 6. The different noun phrase rules can be joined by union into larger sets and divided in accordance with what different feature conditions they meet. For instance, the set of all definite noun phrases is defined as in (RE6.33a) and indefinite noun phrases as in (RE6.33b). All noun phrases that meet definiteness agreement are then represented by the regular expression in (RE6.33c), that is an automaton formed as an union of all definite and all indefinite noun phrase automata. (RE6.33) a. ### Definite NPs define NPDef [NPDef1 NPDef2 NPDef3 NPDef4 NPDef5 NPDef6 NPDef7 NPDef8 NPDef9]; b. ### Indefinite NPs define NPInd [NPInd1 NPInd2 NPInd3 NPInd4 NPInd5 NPInd6 NNPInd7 PInd8 NPInd9]; c. ###### NPs that meet definiteness agreement define NPDefs [NPDef NPInd]; Noun phrases with partitive attributes have a noun phrase as the head and are treated separately in the grammar. Although the agreement occurs only between the quantifier and the noun phrase in gender, the rules of definiteness and number state that the noun phrase has to be definite and plural, see (RE6.34). (RE6.34) define NPPartDef [[Det Num] PPart NPDef]; define NPPartInd [[Det Num] PPart NPDef]; define NPPartSg [[DetSg Num] PPart NPPl]; define NPPartPl [[DetPl Num] PPart NPPl]; define NPPartNeu [[DetNeu Num] PPart NPNeu]; define NPPartUtr [[DetUtr Num] PPart NPUtr]; Adjective Phrases Adjective phrases occur as modifiers in two types of the defined noun phrases (NP3 and NP4) and form a head of its own in two others, (NP7 and NP8). It consists in the present implementation of an optional adverb and a sequence of one or more adjectives and is also defined in accordance with the feature conditions that have to be fulfilled for definiteness, number and gender as shown in (RE6.35). The gender feature set includes also an additional definition for masculine gender. (RE6.35) define APDef ["<ap>" (Adv) AdjDef+ "</ap>"]; define APInd ["<ap>" (Adv) AdjInd+ "</ap>"]; define APSg ["<ap>" (Adv) AdjSg+ "</ap>"]; define APPl ["<ap>" (Adv) AdjPl+ "</ap>"]; define APNeu ["<ap>" (Adv) AdjNeu+ "</ap>"]; define APUtr ["<ap>" (Adv) AdjUtr+ "</ap>"]; define APMas ["<ap>" (Adv) AdjMas+ "</ap>"];

FiniteCheck: A Grammar Error Detector 209 One problem related to error detection concerns the ambiguity in weak and strong forms of adjectives that coincide in the plural, but in the singular the weak form of adjectives is used only in definite singular noun phrases (see Section 4.3.1). Consequently, such adjectives obtain both singular and plural tags and errors such as the one in (6.29) will be overlooked by the system. As we see in (6.29a), the adjective trasiga broken is both singular (and definite) and indefinite (and plural) as the surrounding determiner and head noun and the check for number and definiteness will succeed. Since the whole noun phrase is singular, the plural tag highlighted in bold face in (6.29b) is irrelevant and can be removed by the automaton defined in (RE6.36) allowing a definiteness error to be reported. (6.29) a. en a [sg,indef] trasiga broken [sg,wk] or [pl] speldosa musical box [sg,indef] b. <np> en[dt utr sin ind][pn utr sin ind sub/obj] <ap> trasiga[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] </ap> speldosa[nn utr sin ind nom] </np> (RE6.36) define removepluraltagsnpsg [ TagPLU -> 0 DetSg "<ap>" Adj _ $"</np>" "</np>"]; Other Selections In addition to these noun phrase rules, noun phrases with a determiner and a noun as the head that are followed by a relative subordinate clause are treated separately, for the reason that definiteness conditions are different in this context (see Section 4.3.1). As in (6.30) the head noun that is normally in definite form after a definite article lacks the suffixed article and stands instead in indefinite form. In the current system, these segments are selected as separate from other noun phrases by application of the filtering transducer in (RE6.37), that simply changes the beginning noun phrase label (<np> ) to the label <NPRel> in the context of a definite determiner with other constituents and the complementizer som that. The grammar is then prepared for expansion of detection of these error types as well.

210 Chapter 6. (6.30) a. Jag I tycker att think that de the [pl,def] det borde it should finnas exist elever pupils [pl, indef] en hjälpgrupp a help-group som that har have lite some för for sociala problem. social problems I think that there should be a help-group for the pupils that have som social problems. b. <np> Jag </np> <vp> <vphead> tycker </vphead> </vp> att <np> det </np> <vp> <vphead> <vc> borde finnas </vc> </vphead> <np> en </np> </vp> hjälpgrupp <np> för </np> <NPRel> de elever </np> som <vp> <vphead> har </vphead> <np> <ap> sociala </ap> problem </np> </vp>. (RE6.37) define SelectNPRel ["<np>" -> "<NPRel>" _ DetDef $"<np>" "</np>" (" ") {som} Tag*]; 6.7.2 Verb Grammar The narrow grammar of verbs specifies the valid rules of finite and non-finite verbs (see Section 4.3.5). The rules consider the form of the main finite verb, verb clusters and verbs in infinitive phrases. Finite Verb Forms The finite verb form occurs in verbal heads either as a single main verb or as an auxiliary verb in a verb cluster. The grammar rule in (RE6.38) states that the first verb in the verbal head (possibly preceded by adverb(s)) has to be tensed. Any following verbs (or other constituents) in the verbal head are then ignored (the any-symbol? indicates that). (RE6.38) define VPFinite [Adv* VerbTensed?*]; Infinitive Verb Phrases The rule defining the verb form in infinitive phrases concerns verbal heads preceded by an infinitive marker. The marking transducer in (RE6.39a) selects these verbal heads and changes the label to infinitival verbal head (<vpheadinf> ). The grammar rule of the infinitive verbal core is defined in (RE6.39b), including just one verb in infinitive form (<VerbInf> ), possibly preceded by (zero or more) adverbs and/or a modal verb also in infinitive form (<ModInf> ).

FiniteCheck: A Grammar Error Detector 211 (RE6.39) a. define SelectInfVP [ "<vphead>" -> "<vpheadinf>" InfMark "<vp>" _ ]; b. define VPInf [Adv* (ModInf) VerbInf Adv*?*]; Verb Clusters The narrow grammar of verb clusters is more complex, including rules for both modal (Mod) and temporal auxiliary verbs (Perf) and verbs combining with infinitive verbs (INFVerb), i.e. infinitive phrases without infinitive marker (see Section 4.3.5). The grammar rules state the order of constituents and the form of the verbs following the auxiliary verb. The form of the auxiliary verb is defined in the VPFinite rule above (see (RE6.38), i.e. the verb has to have finite form. The marking automaton (RE6.40b) selects all verbal heads that include more than one verb as verb clusters by the VC-rule in (RE6.40a). The potential verb clusters have the form of a verb followed by (zero or more) adverbs, an (optional) noun phrase, (zero or more) adverbs and subsequently one or two verbs. Other syntactic tags (NPTags) are ignored ( / is the ignore-operator). (RE6.40) a. define VC [ [[Verb Adv*] / NPTags] (NPPhr) [[Adv* Verb (Verb)] / NPTags] ]; b. define SelectVC [VC @-> "<vc>"... "</vc>" ]; Five different rules describe the grammar of verb clusters. Three rules concern the modal verbs (VC1, VC2, VC3 presented in (RE6.41)) and two rules deal with temporal auxiliary verbs (VC4, VC5 presented in (RE6.42)). Verbs that take infinitival phrases (without the infinitival marker) (INFVerb) share two rules with the modal verbs (VC1, VC2). All the verb cluster rules have the form VBaux (NP) Adv* Verb (Verb), i.e. an auxiliary verb followed by an optional noun phrase, (zero or more) adverb(s), a verb and an optional verb. By including the optional noun phrase, the grammar also handles inverted sentences. Again, irrelevant tags (NPTags) are ignored. (RE6.41) a. define VC1 [ [[Mod INFVerb] / NPTags ] (NPPhr) [[Adv* VerbInf] / NPTags] ]; b. define VC2 [ [Mod / NPTags] (NPPhr) [[Adv* ModInf VerbInf] / NPTags] ]; c. define VC3 [ [Mod / NPTags] (NPPhr) [[Adv* PerfInf VerbSup] / NPTags] ];

212 Chapter 6. (RE6.42) a. define VC4 [ [Perf / NPTags] (NPPhr) [[Adv* VerbSup] / NPTags] ]; b. define VC5 [ [Perf / NPTags] (NPPhr)[[Adv* ModSup VerbInf] / NPTags] ]; All the five rules can be combined by union in one automaton that represents the grammar of all verb clusters presented in (RE6.43). (RE6.43) define VCgram [VC1 VC2 VC3 VC4 VC5]; Other Selections Coordinations of verbal heads in verb clusters or as infinitive verb phrases are selected as separate segments by the marking transducer in (RE6.44). The automaton replaces the verbal head marking with a new label that indicates coordination of verbs (<vpheadcoord>) as exemplified in (6.31) and (6.32.) (RE6.44) define SelectVPCoord ["<vphead>" -> "<vpheadcoord>" ["<vpheadinf>" "</vc>"] $"<vphead>" $"<vp>" [{eller} {och}] Tag* (" ") "<vp>" _ ]; (6.31) a. hon skulle she would springa ner run down och larma and alarm she was about to run down and give the alarm. b. <np> hon </np> <vp> <vphead> <vc> skulle <np> springa </np> </vc> </vphead> </vp> ner och <vp> <vpheadcoord> larma </vphead> </vp> (6.32) a. det är dags it is time att gå och to go and It is time to go to bed. lägga sig. lay oneself b. <np> det </np><vp><vphead> är </vphead><np> dags </np> </vp> att <vp> <vphead> gå </vphead> </vp> och <vp> <vpheadcoord> lägga <np> sig </np> </vphead> </vp>. The infinitive marker att is in Swedish homonymous with the complementizer att that or part of för att because and thus not necessarily followed by an infinitive, as in (6.33), (6.34) and (6.35). Such ambiguous constructions are selected as separate segments by the regular expression in (RE6.45), that changes the verbal head label to <vpheadattfinite>.

FiniteCheck: A Grammar Error Detector 213 (6.33) a. Tuni Tuni ringde called bara bra. just good mig me sen later och and sa said att that allt everything hade had [pret] Tuni called me later and said that everything had gone just fine. gått gone [sup] b. Tuni <vp> <vphead> ringde </vphead> <np> mig </np> </vp> sen och <vp> <vphead> sa </vphead> </vp> att <vp> <vpheadattfinite> <np> allt </np> <vc> hade gått </vc> </vphead> <np> <ap> bara bra </ap> </np> </vp>. (6.34) a. Men but det it skulle should han he aldrig never ha have gjort done för att because då then börjar starts [pres] grenen the-branch att röra to move på sig... on itself But he should never have done that because then the branch starts to move. b. Men <np> det </np> <vp> <vphead> <vc> skulle <np> han </np> aldrig ha <np> <ap> gjort </ap> </np> </vc> </vphead> </vp> <np> för </np> att <vp> <vpheadattfinite> då börjar </vphead> <np> grenen </np> </vp> att <vp> <vpheadinf> <np> röra </np> <pp> <pphead> på </pphead> <np> sig </np> </pp> </vphead> </vp>... (6.35) a. så tänkte so thought jag att nu I that now hade had [pret] so I thought that now I had the chance. jag chansen. I the-chance b. <vp> <vphead> så tänkte <np> jag </np> </vphead> </vp> att <vp> <vpheadattfinite> nu hade <np> jag </np> </vphead> <np> chansen </np> </vp>. (RE6.45) define SelectATTFinite [ "<vphead>" -> "<vpheadattfinite>" [ [ [[{sa} Tag+] [[{för} Tag+] / NPTags]] ("</vphead></vp>")] [ [{tänkte} Tag+] [[NPPhr "</vphead></vp>" ] ["</vphead>" NPPhr "</vp>"]]]] InfMark "<vp>"_ ]; Verbal heads with only a supine verb as in (6.36) and (6.37) are also selected separately. They are considered grammatical in subordinate clauses, whereas main clauses with supine verbs without preceding auxiliary verbs are invalid in Swedish (see Section 4.3.5). The transducer created by the regular expression in (RE6.46) replaces a verbal head marking with <vpheadsup>.

214 Chapter 6. (6.36) a. Tänk think om if jag bott I lived [sup] hos pappa. with daddy Think if I had lived at Daddy s. b. Tänk <pp> <pphead> om </pphead> <np> jag </np> </pp> <vp> <vpheadsup> bott </vphead> <pp> <pphead> <np> hos </np> </pphead> <np> pappa </np> </pp> </vp>. (6.37) a. det var it was en gång a time en a pojke som boy that fångat caught [sup] There was once a boy that had caught a frog. en groda. a frog b. <np> det </np> <vp> <vphead> <np> var </np> </vphead> <np> en gång </np> <np> en pojke </np> </vp> som <vp> <vpheadsup> fångat </vphead> <np> en groda </np> </vp>. (RE6.46) define SelectSupVP [ "<vphead>" -> "<vpheadsup>" _ VerbSup "</vphead>"]; 6.8 Error Detection and Diagnosis 6.8.1 Introduction The broad grammar is applied for marking both the grammatical and ungrammatical phrases in a text. The narrow grammar expresses the nature of grammatical phrases in Swedish and is then used to distinguish the true grammatical patterns from the ungrammatical ones. The automata created in the stage of error detection correspond to the patterns that do not meet the constraints of the narrow grammar and thus compile into a grammar of errors. This is achieved by subtraction of the narrow grammar from the broad grammar. The potential phrase segments recognized by the broad grammar are checked against the rules in the narrow grammar and by looking at the difference. The constructions violating these rules are identified. The detection process is also partial in the sense that errors are located in an appropriately delimited context, i.e. a noun phrase when looking for agreement errors in noun phrases, a verbal head when looking for violations of finite verbs, etc. The replacement operator is used for selection of errors in text.

FiniteCheck: A Grammar Error Detector 215 6.8.2 Detection of Errors in Noun Phrases In the current narrow grammar, there are three rules for agreement errors in noun phrases without postnominal attributes and three for partitive constructions, all reflecting the features of definiteness, number and gender and differing only in the context they are detected in. We present the detection rules for noun phrases without postnominal attributes in (RE6.47) and for partitive noun phrases in (RE6.48). These automata represent the result of subtracting the narrow grammar of e.g. all noun phrases that meet the definiteness conditions (NPDefs) ((RE6.33) on p.208), from the overgenerating grammar of all noun phrases (NP) ((RE6.22) on p.196). By application of a marking transducer, the ungrammatical segments are selected and annotated with appropriate diagnosis-markers related to the types of rules that are violated, presented in (RE6.47) and (RE6.48). (RE6.47) a. define npdeferror ["<np>" [NP - NPDefs] "</np>"]; define npnumerror ["<np>" [NP - NPNum] "</np>"]; define npgenerror ["<np>" [NP - NPGen] "</np>"]; b. define marknpdeferror [ npdeferror -> "<Error definiteness>"... "</Error>"]; define marknpnumerror [ npnumerror -> "<Error number>"... "</Error>"]; define marknpgenerror [ npgenerror -> "<Error gender>"... "</Error>"]; (RE6.48) a. define NPPartDefError [ "<NPPart>" [NPPart - NPPartDefs "</np>"]; define NPPartNumError [ "<NPPart>" [NPPart - NPPartNum] "</np>"]; define NPPartGenError [ "<NPPart>" [NPPart - NPPartGen] "</np>"]; b. define marknppartdeferror [ NPPartDefError -> "<Error definiteness NPPart>"... "</Error>"]; define marknppartnumerror [ NPPartNumError -> "<Error number NPPart>"... "</Error>"]; define marknppartgenerror [ NPPartGenError -> "<Error gender NPPart>"... "</Error>"]; The narrow grammar of noun phrases is prepared for further extension of noun phrases modified by relative clauses that in the current version of the system, are just selected as distinct from the other noun phrase types.

216 Chapter 6. 6.8.3 Detection of Errors in the Verbal Head Three detection rules are defined for verb errors, identifying the three types of context they can appear in. Errors in finite verb form are checked directly in the verbal head (vphead). Errors in infinitive phrases are detected in the context of a verbal head preceded by an infinitive marker (vpheadinf). Errors in verb form following an auxiliary verb are detected in the context of previously selected (potential) verb clusters (vc). The nets of these detecting regular expressions presented in (RE6.49a) correspond (as for noun phrases) to the difference between the grammatical rules (e.g. VPInf in (RE6.39) on p. 211) and the more general rules (e.g. VPHead in (RE6.22) on p. 196), yielding the ungrammatical verbal head patterns. The annotating automata in (RE6.49b) are used for error diagnosis. (RE6.49) a. define vpfiniteerror [ "<vphead>" [VPhead - VPFinite] "</vphead>"]; define vpinferror [ "<vpheadinf>" [VPhead - VPInf] "</vphead>"]; define VCerror [ "<vc>" [VC - VCgram] "</vc>"]; b. define markfiniteerror [ vpfiniteerror -> "<Error finite verb>"... "</Error>"]; define markinferror [ vpinferror -> "<Error infinitive verb>"... "</Error>"]; define markvcerror [ VCerror -> "<Error verb after Vaux>"... "</Error>"]; Also, the narrow grammar of verbs can be extended with the grammar of coordinated verbs, use of finite verb forms after att that and bare supine verb form as the predicate, all selected as separate patterns. 6.9 Summary This chapter presented the final step of this thesis, to implement detection of some of the grammar errors found in the Child Data corpus. The whole system is implemented as a network of finite state transducers, disambiguation is minimal, achieved essentially by parsing order and filtering techniques, and the grammars of the system are always positive. The system detects errors in noun phrase agreement and errors in the finite and non-finite verb forms. The strength of the implemented system lies in the definition of grammars as positive rule sets, covering the valid rules of the language. The rule sets remain

FiniteCheck: A Grammar Error Detector 217 quite small and practically no description of errors by hand is necessary. There are altogether six rules defining the broad grammar set and the narrow grammar set is also quite small. Other automata are used for selection and filtering. We do not have to elaborate on what errors may occur, only in what context, and certainly not spend time on stipulating the structure of them. The approach aimed further at minimal information loss in order to be able to handle texts containing errors. The degree of ambiguity is maximal at the lexical level, where we choose to attach all lexical tags to strings. At a higher level, structural ambiguity is treated by parsing order, grammar extension and filtering techniques. The parsing order resolves some structural ambiguities and is complemented by grammar extensions as an application of filtering transducers that refine and/or redefine the parsing decisions. Other disambiguation heuristics are applied for instance to noun phrases, where pronouns that follow a verbal head are attached directly to the verbal head in order to prevent them from attachment to a subsequent noun.

218

Chapter 7 Performance Results 7.1 Introduction The implementation of the grammar error detector is to a large extent based on the lexical and syntactic circumstances displayed in the Child Data corpus. The actual implementation proceeded in two steps. In the first phase we developed the grammar so that the system could run on sentences containing errors and correctly identify the errors. When the system was then run on complete texts, including correct material, the false alarms allowed by the system were revealed. The second phase involved adjustment of the grammar to improve the flagging accuracy of the system. FiniteCheck was tested for grammatical coverage (recall) and flagging accuracy (precision) on Child Data and on an arbitrary text not known to the system in accordance with the performance test on the other three grammar checkers (see Section 5.2.3). In this chapter I present results from both the initial phase in the development of the system (Section 7.2) and the improved current version (Section 7.3). The results are further compared to the performance of the other three Swedish checkers on both Child Data (Section 7.4) and the unseen adult text (Section 7.5). The chapter ends with a short summary and conclusions (Section 7.6). 7.2 Initial Performance on Child Data 7.2.1 Performance Results: Phase I The results of the implemented detection of errors in noun phrase agreement, verb form in finite verbs, after auxiliary verb and after infinitive markers in Child

220 Chapter 7. Data from the initial Phase I in the development of FiniteCheck are presented in Table 7.1. Table 7.1: Performance Results on Child Data: Phase I FINITECHECK: PHASE I CORRECT ALARM FALSE ALARM PERFORMANCE Correct Incorrect No Other ERROR TYPE ERRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 14 1 76 64 100% 10% 18% Finite Verb Form 110 98 0 237 19 89% 28% 42% Verb Form after Vaux 7 6 0 61 10 86% 8% 15% Verb Form after inf. m. 4 4 0 5 0 100% 44% 62% TOTAL 136 122 1 379 93 90% 21% 34% The grammatical coverage (recall) in this training corpus was maximal, except in one erroneous verb form after auxiliary verb and a few instances of errors in finite verb form. The overall recall rate for these four error types was 90%. When tested on the whole Child Data corpus, many segments were wrongly marked as errors and the precision rate was quite low, only 21% total, resulting in the overall F-value of 34%. Most of the false alarms occurred in errors in finite verb form, followed by errors in noun phrase agreement. Related to the error frequency of the individual error types, errors in verb form after an auxiliary verb had the lowest precision (8%), closely followed by errors in noun phrase agreement (10%). The grammar of the system was at this initial stage based essentially on the syntactic constructions displayed in the erroneous patterns that we wanted to capture. Many of the false alarms were due to missing grammar rules when tested on the whole text corpus. Other false markings of correct text occurred due to ambiguity, incorrect segmentation of the text in parsing stage, or occurrences of other error categories than grammatical ones. Below I discuss in more detail the grammatical coverage and flagging accuracy in this initial phase. 7.2.2 Grammatical Coverage Errors in Noun Phrase Agreement All errors in noun phrase agreement were detected and one with incorrect diagnosis, due to a split in the head-noun. FiniteCheck is not prepared to handle segmentation errors and exactly as the other Swedish grammar checkers the noun phrase with inconsistent use of adjectives (G1.2.3; see (4.9) on p.49) is only detected in part. The detector yields both the correct diagnosis of gender mismatch and

Performance Results 221 an incorrect diagnosis of a definiteness mismatch, since the first part troll troll of the head-noun is indefinite and neuter and does not agree with the definite, common gender determiner den the as seen in (7.1a). When the head-noun has the correct form and is no longer split into two parts, the whole noun phrase is selected and a gender mismatch is reported, as seen in (7.1b). (7.1) (G1.2.3) a. det it va was <Error definiteness><error gender> den the [com,def] hemske awful [masc] fula ugly [def] troll </Error> troll [neu,indef] karlen man [com,def] tokig Tokig b. det it som... that It was the awful ugly magician Tokig that... va was trollkarlen magician [com,def] <Error gender> den the [com,def] </Error> tokig Tokig som... that It was the awful ugly magician Tokig that... hemske awful [masc] fula ugly [def] Errors in Finite Verb Form Among the errors in finite verb form, none of the errors concerning main verbs realized as participles were detected (G5.2.90 - G5.2.99; see (4.30) p.60). They require other methods for detection, since as seen in (7.2) they are interpreted as adjective phrases. (7.2) a. älgen the-moose och and hans hund his dog sprang med olof ran with Olof till to ett stup a cliff och kastad ner olof and threw [past part] down Olof The moose ran with Olof to a cliff and threw Olof and his dog over it. b. <np> älgen </np> <vp> <vphead> sprang med</vphead> <np> olof </np> <pp> <pphead> till </pphead> <np> ett stup </np> </pp> </vp> och <np> <ap> kastad </ap> </np> ner <np> olof</np> och <np> hans hund</np> Two errors were missed due to preceding verbs joined into the same segment and were then treated as verb clusters, as shown in (7.3) and (7.4).

222 Chapter 7. (7.3) (G5.1.1) a. Madde Madde och and jag I få can [untensed] bestämde decided se vind. see Vind oss usselfs för for att to sova sleep i in kojan the-hut och and se see Madde and I decided to sleep in the hut and see if we will see Vind. b. <np> Madde </np> och <np> jag </np> <vp> <vphead> bestämde </vphead> <np> oss </np> </vp> <np> för </np> att <vp> <vpheadinf> sova i </vphead> <np> kojan </np> </vp> och <vp> <vphead> se om <np> vi </np> <np> få </np> se </vphead> <np> vind </np> </vp>. om if vi we (7.4) (G5.2.40) a. När when rulla role vi we kom came fram forward upp sovsäcken. up the-sleepingbag börja start [untensed] vi we packa pack upp up våra our grejer stuff och and When we arrived, we started to unpack our things and role out the sleepingbag. b. När <np> vi </np> <vp> <vphead> <vc> kom fram börja </vc> <np> vi </np> packa upp </vphead> <np> våra grejer </np> </vp> och <vp> <vpheadcoord> rulla upp </vphead> <np> sovsäcken </np> </vp>. One of the errors in finite verb was wrongly selected as seen in (7.5b). Here, the noun bo nest is homonymous with the verb bo live and joined together with the main verb to a verb cluster the detector selects the verb cluster 1 and diagnoses it as an error in finite verb, which is actually true but only for the main verb, the second constituent of this segment. 1 The noun phrase tags surrounding bo are ignored in the selection as verb cluster, see (RE6.40) on p.211.

Performance Results 223 (7.5) (G5.2.70) a. Då then gick went pojken the-boy vidare further och and såg saw inte not att that binas bees s bo nest trilla tumble [untensed] ner. down Then the boy went further on and did not see that the nest of the bees tumbled down. b. <vp> <vphead> Då gick </vphead> <np> pojken </np> <np> vidare </np> </vp> och <vp> <vphead> <np> såg </np> inte </vphead> </vp> att <np> binas </np> <vp> <vphead> <Error finite verb> <vc> <np> bo </np> trilla </vc> </Error> </vphead> </vp> ner. Rest of the Verb Form Errors One error in verb form after auxiliary verb was not detected (see (7.6)), that, involved coordination of a verb cluster and yet another verb, that should follow the same pattern and thus be in infinitive form (i.e. låta let [inf] ). The system does not take coordination of verbs into consideration and the coordinated verb is identified as a separate verbal head with a finite verb, which then is a valid form in accordance with the grammar rules of the system, and the error is overlooked. (7.6) (G6.1.2) a. Ibland sometimes får must [pres] man one bjuda offer [inf] på on sig själv oneself och and låter let [pres] henne/honom her/him vara med! be with Sometimes one has to make a sacrifice and let him/her come along. b. <vp> <vphead> Ibland <np> får </np><np> man </np> bjuda <pp> <pphead> på </pphead><np> sig </np> </pp></vphead><np> själv </np></vp> och <vp> <vphead> låter </vphead> </vp> henne/honom <vp> <vphead><np> vara </np> med </vphead> </vp>! Finally, all errors in verb form after infinitive marker were detected. 7.2.3 Flagging Accuracy In this subsection follows a presentation of the kinds of false flaggings that occurred in this first test of the system. The description proceeds error type by error type, with specifications on whether the appearance was due to missing grammar rules, erroneous segmentation of text at parsing stage or ambiguity. Furthermore, the false alarms with other error categories are specified.

224 Chapter 7. False Alarms in Noun Phrase Agreement The kinds and the number of false alarms occurring in noun phrases are presented in Table 7.2. Table 7.2: False Alarms in Noun Phrases: Phase I FALSE ALARM TYPE NO. Not in Grammar: NPInd+som 5 Adv in NP 28 other 8 Segmentation: too long parse 26 Ambiguity: PP 7 V 2 Other Error: misspelling 12 split 48 sentence boundary 4 Most of these false alarms were due to the fact that they were not included in the grammar of the system. For instance, adverbs in noun phrases as in (7.7a) were not covered, causing alarms in gender agreement since often in Swedish a neuter form adjective coincides with the adverb of the same lemma. Further, noun phrases with a subsequent relative clause such as (7.7b) were selected as errors in definiteness, although they are correct since the form of the head noun is indefinite when followed by such clauses (see Section 4.3.1). (7.7) a. Det var it was i skolan in school och and jag kom I came <Error gender> väldigt sträng very hard/strict lite för sent little too late lärare. teacher till to </Error> en lektion med a class with It was in school and I came little late to a class with very strict teacher. b. Jag I tycker att think that det borde it should finnas exist en hjälpgrupp a help-group för for <Error definiteness> de the [pl,def] elever pupils [pl, indef] </Error> som har that have lite some sociala problem. social problems I think that there should be a help-group for the pupils that have some social problems. Other false flaggings depended on the application of longest-match resulting in noun phrases with too wide range as in (7.8a), where the modifying predicative

Performance Results 225 complement and the subject are merged to one noun phrase since the inverted word order forced the verb to be placed at the end of sentence instead of the usual place in-between, i.e. skolan school should form a noun phrase on its own. (7.8) dom they tänker think inte not hur how <Error definiteness> viktig important [str] skolan school [def] </Error> är is They do not think how important school is. Furthermore, due to lexical ambiguity some prepositional phrases such as in (7.9) and verbs were parsed as noun phrases and later marked as errors. (7.9) Det it är en ganska stor väg is a rather big road ungefär somewhere <Error definiteness> vid wide [indef]/at hamnen harbor [def] It is a rather big road somewhere at the harbor. </Error> Also false flaggings with other error categories than grammar errors were quite common. Mostly splits as in (7.10a) were flagged. Here, the noun ögonblick moment is split and the first part of it ögon eyes does not agree in number with the preceding singular determiner and adjective. Also, flaggings involving misspellings occurred as in (7.10b), where the newly formed word results in a noun of different gender and definiteness than the preceding determiner and causes agreement errors. Some cases of missing sentence boundary were flagged as errors in noun phrase agreement. (7.10) a. För For <Error number> ett kort ögon a [sg] short eye [pl] </Error> blick trodde blinking thought jag... I... For a short moment I thought... b. <Error definiteness> <Error gender> Det the [neu,def] </Error> </Error> jag vet I know The only thing I know... ända end [com,indef] Furthermore, erroneous tags assigned in the lexical lookup caused trouble when for instance many words were erroneously selected as proper names.

226 Chapter 7. False Alarms in Finite Verb Form The types and number of false alarms in finite verbs are summarized in Table 7.3. These occurred mostly because of the small size of the grammar, but also due to ambiguity problems. Table 7.3: False Alarms in Finite Verbs: Phase I FALSE ALARM TYPE NO. Not in Grammar: imperative 56 coordinated infinitive 74 discontinuous verb cluster 43 Ambiguity: noun 36 pronoun 8 preposition/adjective 20 Other Error: misspelling 9 split 10 Imperative verb forms, that in the first phase were not part of the grammar, caused false alarms not only in verbs as in (7.11a), but also in strings homonymous with such forms as in (7.11b). Here the noun sätt is ambiguous between the noun reading way and the imperative verb form set. (7.11) a. Men but <Error finite verb> titta look [imp] But look, a log. b. Dom samlade in pengar they collected in money </Error> They collected money in different ways. </Error> en a stock. log <Error finite verb> på olika sätt in different ways/set [imp] Further, coordinated infinitives as in (7.12) were diagnosed as errors in the finite verb form, since due to the partial parsing strategy, they were selected as separate verbal heads (see (6.31) and (6.32) on p.212). (7.12) a. hon skulle she would b. det it springa run [inf] ner down och and she would run down and alarm. är dags is time att gå och to go and It is time to go and lay down. <Error finite verb> larma alarm <Error finite verb> lägga lay [inf] sig. oneself </Error> </Error>

Performance Results 227 Similar problems occurred with discontinuous verb clusters when a noun followed the auxiliary verb and the subsequent verb forms are treated as separate verbal heads (see (6.27) on p.204). Further, primarily nouns, but also pronouns, adjectives and prepositions were recognized also as verbal heads causing false diagnosis as errors. Other error categories selected as errors in finite verb form concerned both splits and misspellings, but were considerably fewer in comparison to similar false alarms in noun phrase agreement. False Alarms in Verb Forms after an Auxiliary Verb False alarms in verb forms after an auxiliary verb occurred either due to ambiguity in nouns, pronouns, adjectives and prepositions interpreted as verbs or due to occurrences of other error categories (Table 7.4). In the case of pronouns, they were interpreted as verbs (mostly) in front of a copula verb and merged together to a verbal cluster segment. Similar problems occurred with adjectives and participles (see (6.22)-(6.24) starting on p.202). Table 7.4: False Alarms in Verb Clusters: Phase I FALSE ALARM TYPE NO. Ambiguity: noun 26 pronoun 18 preposition/adjective 17 Other error category: misspelling 3 split 7 Among false flaggings concerning other error categories, both spelling errors and splits were flagged. In (7.13) we see an example of a misspelling where the adjective rädd afraid is written as red coinciding with the verb rode being marked as an error in verb form after auxiliary verb. 2 (7.13) pojken the-boy <Error verb after Vaux> blev became The boy became afraid. red rode </Error> Furthermore, many instances of missing punctuation at a sentence boundary were flagged as errors in verb clusters, as the ones in (7.14). 3 Similarly to the 2 The broad grammar rule for verb clusters joins any types of verbs which is why the copula verb blev became is included. 3 Two vertical bars indicate the missing clause or sentence boundary.

228 Chapter 7. performance test of the other grammar checkers, these flaggings are not included in the test. They represent correct flaggings, although the diagnosis is not correct. (7.14) a. Jag I <Error verb after Vaux> fortsatte continued vägen the-road fram forward då then såg saw b. I in </Error> jag I en brandbil a fire-car jag visste vad I knew what det var. it was I continued forward on the road, then I saw a firetruck. I knew what it was. hålet the-hole en mullvad. a mole pojken the-boy In the hole the boy found a mole. <Error verb after Vaux> hittat found fanns was </Error> False Alarms in Verb Forms in Infinitive Phrase Finally, five false alarms in infinitival verbal heads occurred in constructions that do not require an infinitive verb form after att, which is both an infinitive marker to and a subjunction that (see (6.33)-(6.35) starting on p.213). 7.3 Current Performance on Child Data 7.3.1 Introduction As shown above, almost all the errors in Child Data were detected by FiniteCheck. The erroneously selected segments classified as errors by the implemented detector were mostly due to the small number of grammatical structures covered by the grammar, tagging problems and the high degree of ambiguity in the system. Many alarms included also other error categories, such as misspellings, splits and omitted punctuation. In accordance with these observations, also the detection performance of the system was improved in these three ways in order to avoid false alarms: extend and correct the lexicon extend the grammar improve parsing The full form lexicon of the system is rather small (around 160,000 words) and not without errors. So, the first and rather easy step was to correct erroneous

Performance Results 229 tagging and add new words to the lexicon. The grammar rules were extended and filtering transducers were used to block false parsing. Below follows a description of the grammar extension and other improvements in the system to avoid false alarms in the individual error types. Then the current performance of the system is presented (Section 7.3.3). 7.3.2 Improving Flagging Accuracy Improving Flagging Accuracy in Noun Phrase Agreement The grammar of adjective phrases was expanded with missing adverbs. Noun phrases followed by relative clauses. These display distinct agreement constraints and were selected separately by the already discussed regular expression (RE6.37) (see p.210). This does not mean that the grammar is extended for such noun phrases, but false alarms in these constructions may be avoided. The false alarms in noun phrases caused by limitations in the grammar set were all avoided. This grammar update further improved parsing in the system and decreased the number of wide parses giving rise to false alarms. The types and number of false alarms that remain are presented in Table 7.5. Table 7.5: False Alarms in Noun Phrases: Phase II FALSE ALARM TYPE NO. Segmentation: too long parse 5 Ambiguity: PP 10 Other Error: misspelling 10 split 35 sentence boundary 2 Among these are (relative) clauses without complementizers as in (7.15). (7.15) a. det var den it was the godaste frukost best breakfast jag I någonsin ätit... ever eaten It was the best breakfast I ever have eaten... b. <np> det </np> <vp> <vphead> <np> var </np> </vphead> <Error definiteness> <np> den <ap> godaste </ap> frukost </np> </Error> <np> jag </np> </vp> <vp> <vphead> någonsin ätit </vphead> </vp>.

230 Chapter 7. Improving Flagging Accuracy in Finite Verbs In the case of finite verbs, the problem with imperative verbs is solved to the extent that forms that do not coincide with other verb forms are accepted as finite verb forms, e.g. tänk think. The imperative forms that coincide with infinitives (e.g. titta look ) remain. The problem lies mostly in that errors in verbs realized as lack of the tense endings, often coincide with the imperative (and infinitive) form of the verb. Allowing all imperative verb forms as grammatical finite verb forms would then mean that such errors would not be detected by the system. Normally other hints, such as for example checking for end-of-sentence marking or a noun phrase before the predicate, are used to identify imperative forms of verbs. These methods are however not suitable for the texts written by children since these texts often lack end-of-sentence punctuation or capitals indicating the beginning of a sentence. This could then mean that a noun phrase preceding the predicate could be an end to a previous sentence. However, just to define imperative verb forms not coinciding with other verb forms as grammatical finite verb forms reduced the number of false alarms in imperatives decrease to half as shown in Table 7.6 below. Finite verb false alarms in coordinations with infinitive verbs decreased to just nine alarms and were blocked by selection of infinitive verbs preceded by a verbal group or infinitive phrase as a separate pattern category by the transducer in (RE6.44) (see p.212). Discontinuous verbal groups with a noun phrase following the auxiliary verb were joined together by the automaton (RE6.29) (see p.205) and the narrow grammar of verb clusters was expanded to include (optional) noun phrases. Almost half of such false alarms were avoided. False alarms in finite verbs occurring because of ambiguous interpretation also decreased. Some of those were avoided by the grammar update that also improved parsing. Further adjustments included nouns being interpreted as verbs in possessive noun phrases and adjectives in noun phrases being interpreted as verbal heads that were filtered applying the automata (RE6.26) and (RE6.25) (see p.201). Furthermore, verbal heads with a single supine verb form were distinguished since they are grammatical in subordinate clauses (see (RE6.46) on p.214).

Performance Results 231 The remaining false alarms are summarized in Table 7.6. Table 7.6: False Alarms in Finite Verbs: Phase II FALSE ALARM TYPE NO. Not in Grammar: imperative 27 coordinated infinitive 9 discontinuous verb clusters 28 Ambiguity: noun 9 pronoun 1 preposition/adjective 14 Other: 6 Other Error: misspelling 18 split 14 Improving Flagging Accuracy in Verb Form after Auxiliary Verb The ambiguity resolutions defined for finite verbs blocked not only the false alarms in finite verbs, but also in verb clusters. Furthermore, an annotation filter (RE6.27) (see p.203) was defined for copula verbs to block false markings of copula verbs combined with other constituents such such as pronouns, adjectives, and participles as a sequence of verbs. The types and number of false alarms that remain are presented in Table 7.7. Table 7.7: False Alarms in Verb Clusters: Phase II FALSE ALARM TYPE NO. Ambiguity: noun 4 pronoun 4 preposition/adjective 24 Other error category: misspelling 6 split 9 Improving Flagging Accuracy in Verb Form in Infinitive Phrases The false alarms in infinitive verb phrases occurred due to constructions that do not require an infinitive verb form after an infinitive marker. These were selected as separate patterns by the automaton (RE6.45) (see p.213) and false markings of this type were blocked.

232 Chapter 7. 7.3.3 Performance Results: Phase II The performance of the system in the new improved version (Phase II) is presented in Table 7.8. The grammatical coverage is the same for all error types, except for finite verbs, where the recall rate (slightly) decreased from 89% to 87%. Table 7.8: Performance Results on Child Data: Phase II FINITECHECK: PHASE II CORRECT ALARM FALSE ALARM PERFORMANCE Correct Incorrect No Other ERROR TYPE ERRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 14 1 15 47 100% 19% 33% Finite Verb Form 110 96 0 94 32 87% 43% 58% Verb Form after Vaux 7 6 0 32 15 86% 11% 20% Verb Form after inf. m. 4 4 0 0 0 100% 100% 100% TOTAL 136 120 1 145 94 89% 34% 49% This is due to the improvement in flagging accuracy. That is, in addition to the errors not detected by the system from the initial stage (see Section 7.2), two additional errors in finite verb form realized as bare supine were not detected (G5.2.88, G5.2.89) as a consequence of selecting all bare supine forms as separate segments, as shown in (7.16). This selection was necessary in order to avoid marking correct use of bare supine forms as erroneous. When the grammar for the bare supine verb form is covered, these errors can be detected as well. (7.16) (G5.2.89) a. Han tittade he looked [pret] på hunden. Hunden försökt at the-dog the-dog tried [sup] He looked at the dog. The dog tried to climb down. att to klättra ner. climb down b. <np> Han </np> <vp> <vphead> tittade på </vphead> <np> hunden </np> </vp>, <np> hunden </np> <vp> <vpheadsup> försökt </vphead> </vp> att <vp> <vpheadinf> klättra ner </vphead> </vp> We were able to avoid many of the false flaggings by improvement of the lexical assignment of tags and expansion of grammar. The parsing results of the system also improved as was the case for false flaggings. The total precision rate improved from 21% to 34%. The remaining false alarms have most often to do with ambiguity. Only in the case of verb clusters is further expansion of grammar needed. Figure 7.1 shows the number of false markings of correct text as erroneous in comparison between the initial Phase I and the current Phase II.

Performance Results 233 Figure 7.1: False Alarms: Phase I vs. Phase II The types and number of alarms revealing other error categories are more or less constant and can be considered a side-effect of such a system. Methods for recognizing these error types are of interest. In the case of splits and misspellings, most of them were discovered due to agreement problems. Omission of sentence boundaries is in many cases covered by verb cluster analysis. The overall performance of the system in detecting the four error types defined, increased in F-value from 34% in the initial phase to 49% in the current improved version. 7.4 Overview of Performance on Child Data I presented earlier in Section 5.5 the linguistic performance on the Child Data corpus of the other three Swedish tools: Grammatifix, Granska and Scarrie. Here I discuss the results of these tools for the four error types targeted by FiniteCheck and explore the similarities and differences in performance between our system and the other tools. The purpose is not to claim that FiniteCheck is in general superior to the other tools. FiniteCheck was developed on the Child Data corpus, whereas the other tools were not. However, it is important to show that FiniteCheck represents some improvement over systems that were not even designed to cover this particular data.

234 Chapter 7. The grammatical coverage of these three tools and our detector for the four error types are presented in Figure 7.2. 4 The three other tools are designed to detect errors in adult texts and not surprisingly the detection rates are low. Among these four error types, agreement errors in noun phrases is the error type best covered by these tools, whereas errors in verb form obtained in general much lower results. All three systems managed to detect at least half of the errors in noun phrase agreement. Errors in the finite verb form obtained the worst results. In the case of Grammatifix, errors in verbs obtained no or very few results. Granska targeted all four error types and detected more than half of the errors in three of the types and only 4% of the errors in finite verb form. Scarrie also had problems in detecting errors in verbs, although it performed best on finite verbs in comparison to the other tools, detecting 15% of them. Figure 7.2: Overview of Recall in Child Data FiniteCheck, which was trained on this data, obtained maximal recall rates for errors in noun phrase agreement and verb form after infinitive markers. Errors in other types of verb form obtained a somewhat lower recall (around 86%). Although this is a good result, we should keep in mind that FiniteCheck is here tested on the data that was used for development. That is, it is not clear if the system would 4 The number of errors per error type is presented within parentheses next to the error type name.

Performance Results 235 receive such high recall rates for all four error types even for unseen child texts. 5 However, the high performance in detecting errors especially in the frequent finite verb form error type is an obvious difference in comparison to the low performance of the other tools, which at least seems to motivate the tailoring of grammar checkers to children s texts. Precision rates are presented in Figure 7.3. They are in most cases below 50% for all systems. The result is however relative to the number of errors. Best valued are probably errors in finite verb form as a quite frequent error type. The errors in verb form after infinitive marker are too few to draw any concrete conclusions about the outcome. Figure 7.3: Overview of Precision in Child Data Evaluating the overall performance of the systems in detection of these four error types presented in Figure 7.4 below, the three other systems obtained a recall of 16% on average. The recall rate of FiniteCheck is considerably higher, which can mean that the tool is good at finding erroneous patterns in texts written by children, but that remains to be seen when tests on unseen texts are performed. Flagging accuracy is slightly above 30% for Grammatifix, Granska and FiniteCheck. Scar- 5 We have not been able to test the system on new child data. Texts written by children are hard to get and require a lot of preprocessing.

236 Chapter 7. rie obtained slightly lower precision rates. In combining these rates and measuring the overall system performance in F-value, Grammatifix obtained the lowest rate, probably due to the low recall, closely followed by Scarrie. Granska had slightly higher results of 23%. Our system obtained twice the value of Granska. Figure 7.4: Overview of Overall Performance in Child Data In conclusion, among these four error types the three other grammar checkers had difficulties in detecting the verb form errors in Child Data and only detected around half of the errors in noun phrase agreement. FiniteCheck had high recall rates for all four error types and a precision on the same level as the other tools. It is unclear how much the outcome is influenced by the fact that the system is based on exactly this data, but FiniteCheck seems not to have difficulties in finding errors in verb form (especially in finite verbs) that the other tools clearly signal. Further evaluation of FiniteCheck on a small text not known to the system is reported in the following section.

Performance Results 237 7.5 Performance on Other Text In order to see how FiniteCheck would perform on unseen text, of the kind used to test the other Swedish grammar checkers, a small literary text of 1,070 words describing a trip was evaluated. This text is used as a demonstration text by Granska. 6 The text includes 17 errors in noun phrase agreement, five errors in finite verb form and one error in verb form after an auxiliary verb. The purpose of this test is to see if the results are comparable to the other Swedish tools. Note that, the aim is not to compare the performance between all the checkers, which would be unfair since the text is a demonstration text of Granska, but rather to see how our detector performs in just the error types it targets in comparison to tools designed for this kind of text. Below, I first present and discuss the results of FiniteCheck, then the performance of the three other checkers is presented, followed by a comparative discussion. 7.5.1 Performance Results of FiniteCheck Introduction The text was first manually prepared and spaces were inserted when needed between all strings, including punctuation. Further, the lexicon had to be updated, since the text used a particular jargon. 7 The detection results of FiniteCheck are presented in Table 7.9. Table 7.9: Performance Results of FiniteCheck on Other Text FINITECHECK: OTHER TEXT CORRECT ALARM FALSE ALARM PERFORMANCE Correct Incorrect No Other ERROR TYPE ERRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 17 13 1 2 4 82% 70% 76% Finite verb form 5 5 0 1 0 100% 83% 91% Verb form after Vaux 1 1 0 1 0 100% 50% 67% TOTAL 23 19 1 4 4 87% 71% 78% FiniteCheck missed three errors in noun phrase agreement, which leaves it with a total recall of 87%. False alarms occurred in all three error types, mostly in noun 6 Demonstration page of Granska: http://www.nada.kth.se/theory/projects/ granska/. 7 FiniteCheck s lexicon would need to be extended anyway to make a general grammar checking application.

238 Chapter 7. phrase agreement, and results in a total precision of 71%. Below I discuss the performance results in more detail. Errors in Noun Phrase Agreement Among the noun phrase agreement errors, three errors were not detected and one was incorrectly diagnosed. The latter concerned a proper noun preceded by an indefinite neuter determiner. The noun phrase was selected and marked for all three types of agreement errors, as shown in (7.17). The reason for this selection is that the noun phrase was recognized by the broad grammar as a noun phrase, but rejected by the narrow grammar as ungrammatical. In this case it is true, since the proper noun should stand alone or be preceded by a neuter gender determiner, but the system should signal only an error in gender agreement. That is, the noun phrase was as a whole rejected by the system, since there are no rules for noun phrases with a determiner and a proper noun. (7.17) a. Detta this är sannerligen en is certainly a [com,indef] Mekka Mekka This is certainly a Mekka for the mountain-lover... för fjällälskaren... for the mountain-lover b. <np> Detta </np> <vp> <vphead> är sannerligen </vphead> <Error definiteness> <Error number> <Error gender> <np> en Mekka </np> </Error> </Error> </Error> </vp> <np> för </np> fjällälskaren... The undetected errors all concerned constructions not covered by our grammar. The first one in (7.18a) 8 involves a possessive noun phrase modifying another noun. FiniteCheck covers noun phrases with single possessive nouns as modifiers. The other two concern numerals with nouns in definite form. Our current grammar does not explore much about numerals and definiteness. (7.18) a. den the b. två two c. två two stora forsen big stream [nom] nackdelarna disadvantages [def] kåsorna scoops [def] kaffe coffee brus roar [nom] den the stora forsens big stream [gen] två nackdelar two disadvantages [indef] två kåsor two scoops [indef] kaffe coffee brus roar [nom] Altogether six false flaggings occurred in noun phrase agreement, four of them due to a split, thus involving an another error category. Two were due to ambiguity in the parsing. Both types are exemplified in the sentence in (7.19), where in 8 Correct forms are presented to the right after the arrow in the examples.

Performance Results 239 the first case the noun fjällutrustningen mountain-equipment [sg,com,def] is split and the first part does not agree with the preceding modifiers. The second case involves the complex preposition framför allt above all where allt is joined with the following noun to build a noun phrase and a gender mismatch occurs. (7.19) a....i in tältet tent och and den the [sg,com,def] övriga rest fjäll mountain [sg/pl,neu,indef] utrustningen equipment [sg,com,def] vilar rests tryggheten the-safety och and framför above allt all [neu, indef] friheten. freedom [com,def]... in the tent and the other mountain equipment lies the safety and above all freedom. b. <pp> <pphead> i </pphead> <np> tältet </np> </pp> och <Error definiteness> <Error number> <Error gender> <np> den övriga fjäll </np> </Error> </Error> </Error> <np> utrustningen </np> <vp> <vphead> vilar </vphead> <np> tryggheten </np> </vp> och fram <pp> <pphead> <np> för </np> </pphead> <Error gender> <np> allt friheten </np> </Error> </pp>. Errors in Verb Form All the errors in verb form have been detected. One false alarm occurred in each error type. In the case of finite verbs, the alarm was caused due to homonymity in the noun styrka force interpreted as the verb prove, as seen in (7.20). (7.20) a. Vinden mojnar the-wind subside styrka. force inte under natten not during the-night utan fortsätter med oför minskad but continues with undiminished The wind does not subside during the night, but continues with undiminished force. b. <np> Vinden </np> mojnar inte <pp> <pphead> <np> under </np> </pphead> <np> natten </np> </pp> <vp> <vphead> <np> utan </np> fortsätter </vphead> </vp> med oför <np> minskad </np> <vp> <Error finite verb> <vphead> <np> styrka </np> </vphead> </Error> </vp>. The false alarm in verb form after auxiliary verb concerned the split noun sovsäcken sleeping-bag, where the first part sov is homonymous with the verb sleep and was joined with the preceding verb to form a verb cluster, as shown in (7.21).

240 Chapter 7. (7.21) a. Det There finns exist dock however två two nackdelarna disadvantages med with tältning, camping pjäxorna the skiing-boots måste i sov must into sleeping säcken för att bag because inte not krympa ihop shrink together av from kylan... the-cold There are two disadvantages with camping, the skiing-boots must be inside the sleeping-bag in order to not shrink from the cold... b. <np> Det </np> <vp> <vphead> finns dock </vphead> <np> två nackdelarna </np> </vp> med tältning, pjäxorna <vp> <vphead> <Error verb after Vaux> <vc> måste i sov </vc> </Error> </vphead> <np> säcken </np> </vp> <np> för </np> att <vp> <vpheadattfinite> inte krympa </vphead> </vp> ihop <pp> <pphead> av </pphead> <np> kylan </np> </pp> 7.5.2 Performance Results of Other Tools Grammatifix The results for Grammatifix are presented in Table 7.10 below, with 12 detected errors in noun phrase agreement, one error in finite verb form and one false alarm in verb form error after an auxiliary verb. This leaves the system a total recall of 57% and a precision of 93% for these three error types. Table 7.10: Performance Results of Grammatifix on Other Text GRAMMATIFIX: OTHER TEXT FALSE ALARM PERFORMANCE CORRECT No Other ERROR TYPE ERRORS ALARM Error Error Recall Precision F-value Agreement in NP 17 12 0 0 71% 100% 83% Finite Verb Form 5 1 0 0 20% 100% 33% Verb Form after Vaux 1 0 0 1 0% 0% TOTAL 23 13 0 1 57% 93% 70% The five errors in noun phrase agreement that were missed concerned the same segment with a possessive noun modifying another noun (see (7.18a)) and the one with a numeral and a noun in the definite form (see (7.18b)). Other cases concerned a possessive proper noun with erroneous definite noun (see (7.22a)), another definiteness error in noun (see (7.22b)) and a strong form of adjective used in definite noun phrase (see (7.22c)). Correct forms are presented to the right, next to the erroneous phrases.

Performance Results 241 (7.22) a. Lapplands Lappland s b. en a [indef] c. den the [def] drottningen queen [def] ny new [indef] djup deep [str] dagen day [def] snön snow [def] Lapplands Lappland s en a [indef] den the [def] drottning queen [indef] ny new [indef] djupa deep [wk] dag day [indef] snön snow [def] No false alarms occurred other than one with a verb form after an auxiliary verb, concerning exactly the same segment and error suggestions as our detector as exemplified in (7.21) above. Granska The result for Granska is presented in Table 7.11. This system detected 11 agreement errors in noun phrase, the one error in verb form after auxiliary verb and one false alarm occurred in noun phrase agreement. No errors in finite verb form were identified. The total recall is 52% and precision 92% for these three error types. Table 7.11: Performance Results of Granska on Other Text GRANSKA: OTHER TEXT FALSE ALARM PERFORMANCE CORRECT No Other ERROR TYPE ERRORS ALARM Error Error Recall Precision F-value Agreement in NP 17 11 0 1 65% 92% 76% Finite Verb Form 5 0 0 0 0% Verb Form after Vaux 1 1 0 0 100% 100% 100% TOTAL 23 12 0 1 52% 92% 67% The six errors in noun phrase agreement that were missed concerned the same segment with a possessive noun modifying another noun (see (7.18a)) and both cases with the numeral and a noun in definite form (see (7.18b-c)). Further errors concerned a possessive noun with an erroneous definite noun (see (7.23a)), a neuter gender possessive pronoun with a common gender noun (see (7.23b)) and an indefinite determiner with a definite noun (see (7.23c)).

242 Chapter 7. (7.23) a. ripornas grouse s kurren hoot [def] ripornas grouse s kurr hoot [indef] b. mitt my [neu] huva hood [com] min my [com] huva hood [com] c. en a [indef] smulan bit [def] en a [indef] smula bit [indef] One false alarm occurred in a noun phrase with a split adjective and a missing noun, as shown in (7.24). Here the adjective vinteröppna winter-open (i.e. open for winter) is split and the first part causes an agreement error in definiteness. (7.24)... i den andra vinter in the [def] other winter [indef] öppna open husera en arg gubbe... haunt [inf] an angry old man... the other cottage open for the winter was haunted by an angry old man... Scarrie The results for Scarrie are presented in Table 7.12. This system detected 10 agreement errors in noun phrase and one error in finite verb form. It had six false markings concerning noun phrase agreement. The total recall is 48% and precision 65%. Table 7.12: Performance Results of Scarrie on Other Text SCARRIE: OTHER TEXT FALSE ALARM PERFORMANCE CORRECT No Other ERROR TYPE ERRORS ALARM Error Error Recall Precision F-value Agreement in NP 17 10 2 4 59% 63% 61% Finite Verb Form 5 1 0 0 20% 100% 33% Verb Form after Vaux 1 0 0 0 0% TOTAL 23 11 2 4 48% 65% 55% The seven errors in noun phrase agreement that were missed concerned the three our system did not find (see (7.18)) and two that Granska did not find (see (7.23a) and (7.23c)). The others are presented below, where two concerned gender agreement with determiner and a (proper) noun (see (7.25a)) and ((7.25b)), and one definiteness agreement with a weak form adjective together with an indefinite noun (see (7.25c)).

Performance Results 243 (7.25) a. b. c. en a [com] en a [com] Mekka Mekka mantra mantra [neu] orörda untouched [wk] fjällnatur mountain-nature [indef] ett a [neu] ett a [neu] Mekka Mekka mantra mantra [neu] orörd untouched [str] fjällnatur mountain-nature [indef] All false alarms concerned noun phrase agreement, where four of them concerned other error categories, as for instance in the ones presented in (7.19) or in (7.24). 7.5.3 Overview of Performance on Other Text In Figure 7.5 I present the recall values for all three of the grammar checkers and our FiniteCheck for the three evaluated error types. All the tools detected 60% or more errors in noun phrase agreement, whereas verb form errors obtained different results. The other tools detected at most one verb form error in total of either the finite verb kind or after an auxiliary verb. FiniteCheck identified all six of the verb form errors. The errors in verb form are in fact quite few (six instances in total), but even for such a small amount there are indications that the other tools have problems identifying errors in verb form. Flagging accuracy for these error types is presented in Figure 7.6. Concerning errors in noun phrase agreement, Grammatifix had no false flaggings and obtains a precision of 100%. Granska s precision rate is also quite high with only one false alarm. Scarrie and FiniteCheck obtained a lower precision around 70% due to six false alarms by each tool. Concerning verb errors, the three systems obtained full rates without any false flaggings when detection occurred. FiniteCheck had one false alarm in each error type, thus obtaining lower precision rates. The flagging accuracy of FiniteCheck in this text is a bit lower in comparison to Grammatifix and Granska, but comparable to the results of Scarrie.

244 Chapter 7. Figure 7.5: Overview of Recall in Other Text Figure 7.6: Overview of Precision in Other Text

Performance Results 245 The overall performance on the evaluated text presented in Figure 7.7 with 23 grammar errors, the three grammar checkers obtained on average 52% in recall, FiniteCheck scored 87%. The opposite scenario applies for precision, where FiniteCheck had slightly worse rate (71%) than Grammatifix and Granska which had a precision above 90%. Scarrie s precision rate was 65%. In the combined measure of recall and precision (F-value) our system obtained a rate of 78%, which is slightly better in comparison to the other tools that had 70% or less in F-value. Figure 7.7: Overview of Overall Performance in Other Text In conclusion, this test only compared a few of the constructions covered by the other systems, represented by the error types targeted by FiniteCheck. The result is promising for our detector that obtained comparable or better performance rates for coverage in this text. Flagging accuracy was slightly worse, especially in comparison to Grammatifix and Granska. Moreover the text was small with few errors and future tests on larger unseen text are of interest for better understanding of the system s performance.

246 Chapter 7. 7.6 Summary and Conclusion The performance of FiniteCheck was tested during the developmental stage and on the current version. The system is in general good at finding errors and the flagging accuracy of the system can be improved by relatively simple means. The initial performance was improved solely by extension of the grammar and some ambiguity resolution. The broad grammar was extended by filtering transducers that extended head phrases with complements and merged split constituents or somehow adjusted the parsing output as a disambiguation step. The narrow grammar was improved by either extension of existing grammar rules or additional selections of segments. These new selections provide a basis for definitions of new grammars, thus the possibility of extending the detection to other types of errors. In the current version, noun phrases followed by relative clauses, coordinated infinitives and verbs in supine form were selected as separate segments and can be further extended with corresponding grammar rules. Detection of the four implemented error types in FiniteCheck was tested on both Child Data and a short adult text not only for our detector but also for the other three Swedish grammar checkers. 9 In the case of Child Data, FiniteCheck achieved maximal or high grammatical coverage, being based on this corpus, and a total precision of around 30%. The other tools detected in general few errors in Child Data in the included error types with an average recall of 16%. Flagging accuracy is also around 30% for two of these tools and is lower for one of them. The outcome of FiniteCheck is hard to compare to the performance of the other tools, since our system is based on the Child Data corpus which was also used for evaluation, but there are indications of differences in the detection of errors in verb form at least, especially in finite verbs, where the other tools obtained quite low recall on average 9%. A similar effect occurs when the tools were tested on the adult text, where also here the other tools had difficulties to detect errors in verb form (although they were few), whereas FiniteCheck identified all of them. Otherwise, FiniteCheck obtained comparable (or even better) recall for the adult text with the three tools and a slightly lower accuracy in comparison to two of the tools. The performance rates of all the tools are in general higher on this adult text in comparison to Child Data, with a recall around 50% and a precision around 80%. Corresponding rates for Child Data are around 16% in recall 10 and 30% in precision. The validation tests on Child Data and the adult text indicate clearly that the children s texts and the errors in them really are different from the adult texts and 9 Recall that these tools target many more error types. Evaluation of these grammar checkers on all errors found in Child Data is presented in Chapter 5 (Section 5.5). 10 Here, the recall rates of FiniteCheck were not included, since it is developed on this data.

Performance Results 247 errors, and that they are more challenging for current grammar checkers that have been developed for texts and errors written by adult writers. The low performance of the Swedish tools on Child Data clearly demonstrates the need for adaptation of grammar checking techniques to other users, such as children. The performance of FiniteCheck is promising but at this point only preliminary. More tests are needed in order to see the real performance of this tool, both on other unseen children texts and texts written by other users, such as adult writers or even second language learners.

248

Chapter 8 Summary and Conclusion 8.1 Introduction This concluding chapter begins with a short summary of the thesis (Section 8.2), followed by a section on conclusions (Section 8.3), finally, some future plans are discussed (Section 8.4). 8.2 Summary 8.2.1 Introduction This thesis concerns the analysis of grammar errors in Swedish texts written by primary school children and the development of a finite state system for finding such errors. Grammar errors are more frequent for this group of writers and the distribution of the error types is different from texts written by adults. Also other writing errors above word-level are discussed here, including punctuation and spelling errors resulting in existing words. The method used in the implemented tool FiniteCheck involves subtraction of finite state automata that represent two positive grammars with varying degree of detail. The difference between the automata corresponds to the search for writing problems that violate the grammars. The technique shows promising results on the implemented agreement phenomena and verb selection phenomena. The work is divided into three subtasks, analysis of errors in the gathered data, investigation of what possibilities there are for detecting these errors automatically and finally, implementation of detection of some errors. The summary of the thesis presented below follows these three subtasks.

250 Chapter 8. 8.2.2 Children s Writing Errors Data, Error Categories and Error Classification The analysis of children s writing errors is based on empirical data of total 29,812 words gathered in a Child Data corpus consisting of three separate collections of hand written and computer written compositions written by primary school children between 9 and 13 years of age (see Section 3.2). The analysis concentrates on grammar, primarily. Other categories under investigation concern spelling errors which give rise to real word strings and punctuation. Error classification of the involved error categories is discussed in Chapter 3 (Section 3.3), where I present a taxonomy (Figure 3.1, p.31) and principles for classifying writing errors. Although this taxonomy was designed particularly for errors in the borderline between spelling and grammar error, it can be used for classification of both spelling and grammar errors. It takes into consideration the kind of new formation involved (new lemma or other forms of the same lemma), the type of violation (change in letter, morpheme or word) and what level was influenced (lexical, syntactic or semantic). What Grammar Errors Occur? In the survey of the considerably few existing studies on grammar errors in Chapter 2 (Section 2.4) I show that the most typical grammar errors in these studies are errors in noun phrase and predicative complement agreement, verb form and choice of prepositions in idiomatic expressions. Furthermore, some indications of errors influenced by spoken language are also evident in children s writing. However, grammar has in general low priority in research on writing in Swedish. In particular, there are no recent studies concerning grammar errors by children and certainly no studies whatever for the youngest primary school children (see Section 2.3). In the present analysis in Child Data in Chapter 4 (Section 4.3), a total of 262 grammar errors occur. They are spread over more than ten error types. The expected typical errors occur, but they are not all particularly frequent. The most common errors occur in finite verb form, omission of obligatory constituents in sentences, choice of words, agreement in noun phrases and extra added constituents in sentences. In comparison to adult writers (Section 4.4), there are clear differences in error frequency and the distribution of error types. Grammar errors occur on average as much as 9 times in a child text of 1,000 words, which is considerably more frequent compared to adult writers who make an average one grammar error per 1,000 words. For some error types (e.g. noun phrase agreement) frequency differs marginally, whereas more significant differences arise, for instance for errors in

Summary and Conclusion 251 verb form, that are on average eight times more common in Child Data. Frequency distribution across all error types is also different, although the representation of the most common error types is similar except for finite verb form errors. The most common error type for the adults in the studies presented were missing or redundant constituents in sentences, agreement in noun phrase and word choice errors. In contrast, the most common verb error among adult writers is in the verb form after auxiliary verb and not in the finite verb form, as is the case for children. What Real Word Spelling Errors Occur? Spelling errors resulting in existing words are usually not captured by a spelling checker. For that reason they have been included in the present analysis, since they often require analysis of context larger than a word in order to be detected. The ones found in the Child Data corpus (presented and discussed in Section 4.5) are three times less frequent than the non-word spelling errors, where misspelled words are the most common error type. These errors indicate a clear confusion as to what form to use in which context as well as the influence of spoken language. Splits were in general more common as real word errors. How Is Punctuation Used? The main purpose of the analysis of punctuation (Section 4.6) was to investigate how children delimit text and use major delimiters and commas to signal clauses and sentences. The analysis of Child Data reveals that mostly the younger children join sentences into larger units without using any major delimiters to signal sentence boundaries. The oldest children formed the longest units with the least adjoined clauses. Erroneous use of punctuation is mostly represented by omission of delimiters, but also markings occurring at syntactically incorrect places. Punctuation analysis concludes at this point with recommendation not to rely on sentence marking conventions in children s texts when describing grammar and rules of a system aiming at analyzing such texts. 8.2.3 Diagnosis and Possibilities for Detection Possibilities and Means for Detection The errors found in Child Data were analyzed according to what means and how much context is needed for detection of them. Most of the non-structural errors (i.e. substitutions of words, concerning feature mismatch) and some structural errors (i.e. omission, insertion and transposition of words) can be detected successfully by means of partial parsing. These errors concern agreement in noun phrases, verb

252 Chapter 8. form or missing constituents in verb clusters, some pronoun case errors, repeated words that cause redundant constituents, some word order errors and to some extent agreement errors in predicative complements. Furthermore, real word spelling errors giving rise to syntactic violations can also be traced by partial parsing. Other error types require more elaborate analysis in the form of parsing larger portions of a clause or even full sentence parsing (e.g. missing or extra inserted constituents), analysis above sentence-level requiring analysis of a preceding discourse (e.g. definiteness in single nouns, reference), or even semantics and world-knowledge (e.g. word choice errors). Among the most common errors in the Child Data corpus, errors in verb form and noun phrase agreement can be detected by partial parsing, whereas errors in the structure of sentences as insertions or omissions of constituents and word choice errors require more elaborate analysis. Coverage and Performance of Swedish Tools The three existing Swedish grammar checkers Grammatifix, Granska and Scarrie are designed for and primarily tested on texts written by (mostly professional) adult writers. According to their error specifications, they cover many of the error types found in Child Data. The errors that none of these tools targets include definiteness errors in single nouns and reference errors. The tools were tested on Child Data in order to gauge their real performance. The result of this test indicates low coverage overall and in particular for the most common error types. The systems are best at identifying errors in noun phrase agreement and obtain an average recall rate of 58%. However, the most common error in children s writing, finite verb forms, is on average covered only to 9% (see Tables 5.4, 5.5 and 5.6 starting on p.169 or Figure 7.2 on p.234). The overall grammatical coverage (recall) by the adult grammar checkers across all errors in Child Data averages around 12%. A figure which is almost five times lower than in the tests on adult texts provided by the developers of these tools where the average recall rate is 57% (see Table 5.3 on p.141). This test showed that although these three proofing tools target the grammar errors occurring in Child Data, they have problems in detecting them. The reasons for this effect could in some cases be ascribed to the complexity of the error (e.g. insertion of optional constituents). However, the low performance has more often to do with the high error frequency in some error types (e.g. errors in finite verb form are much less frequent in adult texts; see Figure 4.5 on p.87) and the complexity in the sentence and discourse structure of the texts used in this study (e.g. violations of punctuation and capitalization conventions resulting in adjoined clauses).

Summary and Conclusion 253 8.2.4 Detection of Grammar Errors Targeted Errors Among the errors found in Child Data, errors in noun phrase agreement and in the verb form in finite and non-finite verbs were chosen for implementation. There were two reasons for concentrating on these error types. First, they (almost all) occur among the five most common error types. Second, these error types are all limited to certain portions of text and can then be detected by means of partial parsing. In the current implementation, agreement errors in noun phrases with a noun, adjective, pronoun or numeral as the head are detected, as well as in noun phrases with partitive attributes. The noun phrase rules are defined in accordance with what feature requirements they have to fulfill (i.e. definiteness, number and gender). The noun phrase grammar is prepared for further detection of errors in noun phrases with a relative subordinate clause as complement, that display different agreement conditions. In the present implementation these are selected as separate segments from the other noun phrases. The main purpose of this selection was to avoid marking of correct noun phrase segments of this type as erroneous. The verb grammar detects errors in finite form, both in bare main verbs and in auxiliary verbs in a verb cluster, as well as non-finite forms in a verb cluster and in infinitive phrases following an infinitive marker. The grammar is designed to take into consideration insertion of optional constituents such as adverbs or noun phrases and also handles inverted word order. Also the verb grammar is prepared for expansion to cover detection of other errors in verbs. Coordinated verbs preceded by a verb cluster or infinitive phrase are selected as individual segments and invite further expansion of the system s grammar to detection of errors manifested as finite verbs instead of the expected non-finite verb form. Similarly, verbal heads with bare supine form separate segments and lay a basis for the detection of omitted temporal auxiliary verbs in main clauses. Detection Approach The implemented grammar error detector FiniteCheck is built as a cascade of finite state transducers compiled from regular grammars using the expressions and operators defined in the Xerox Finite-State Tool. The detection of errors in a given text is based on the difference between two positive grammars differing in degree of accuracy. This is the same method that Karttunen et al. (1997a) use for distinguishing valid and invalid date expressions. The two grammars always describe valid rules of Swedish. The first more relaxed (underspecified) grammar is needed in a text containing errors to identify all segments that could contain errors and marks both

254 Chapter 8. the grammatical and ungrammatical segments. The second grammar is a precise grammar of valid rules in Swedish and is used to distinguish the ungrammatical segments from the grammatical ones. The parsing strategy of FiniteCheck is partial rather than full, annotating portions of text with syntactic tags. The procedure is incremental recognizing first the heads (lexical prefix) and then expanding them with complements, always selecting maximal instances of segments. In order to prevent overlooking errors, the ambiguity in the system is maximal at the lexical level, assigning all the lexical tags presented in the lexicon. Structural ambiguity at a higher level is treated partially by parsing order and partially by filtering techniques, blocking or rearranging insertion of syntactic tags. Performance Results FiniteCheck was tested both on the (training) Child Data written by children and an adult text not known to the system. In the case of Child Data, the system showed high coverage (recall) in the initial phase of development, whereas many correct segments were selected as erroneous. Many of these false alarms were avoided by extending the grammar of the system, blocking an average of half of all the false markings. Remaining false alarms are more related to the ambiguity in parsing or selection of other error categories (i.e. misspelled words, splits and missing sentence boundaries). Only in the case of verb clusters did the system mark constructions not yet covered by the grammar of the system. Being based on this corpus, maximal or high grammatical coverage occurs with a total recall rate of 89% for the four implemented error types. Precision is 34%. The other three Swedish tools had on average lower results in recall with a total ratio of 16% on Child Data for the four error types targeted by FiniteCheck. The corresponding total precision value is on average 27%. Further, the performance of FiniteCheck on a text not known to the system shows that the system is good at finding errors, whereas the precision is lower. The three undetected errors in noun phrase agreement occurred due to the small size of the grammar. False flaggings involved both ambiguity problems and selections due to occurrence of other error categories. The total grammatical coverage (recall) of FiniteCheck on this text was 87% and precision was 71%. The other three Swedish tools are (again) good at finding errors in noun phrase agreement, whereas the verb errors obtain quite low results. The average total recall rate is 52% and precision is 83% for the three evaluated error types. The validation tests show that the performance of FiniteCheck on the four implemented error types is promising and comparable to current Swedish checkers. The low performance results of the Swedish systems on children s texts indicates

Summary and Conclusion 255 that the nature of the errors found in texts written by primary school writers are different from adult texts and are more challenging for current systems that are oriented towards texts written by adult writers. 8.3 Conclusion The present work contributes to research on children s writing by revealing the nature of grammar errors in their texts and fills a gap in this research field, since not many studies are devoted to grammar in writing. It shows further that it is important to develop aids for children since there are differences in both frequency and error types in comparison to adult writers. Current tools have difficulties coping with such texts. The findings here also show that it is plausible and promising to use positive rules for error detection. The advantage of applying positive grammars in detection of errors is first, that only the valid grammar has to be described and I do not have to speculate on what errors may occur. The prediction of errors is limited exactly to the portions of text that can be delimited. For example, errors in number in noun phrases with a partitive complement were not identified by any of the three Swedish checkers, since adults probably do not make these types of errors. The grammar of FiniteCheck describes the overall structure of such phrases in Swedish, including agreement between the quantifying numeral or determiner and the modifying noun phrase. It also states that the noun phrase has to have plural number in order to be considered correct. The Swedish tools take into consideration only the agreement between the constituents and not the whole structure of the phrase. Secondly, the rule sets remain quite small. Thirdly the grammars can be used for other purposes. That is, since the grammars of the system describe the real grammar of Swedish, they can also be used for detection of valid noun phrases and verbs and be applied for instance to extracting information in text or even parsing. The performance of FiniteCheck indicates promising results by the fact that not only good results were obtained on the training Child Data, but also running FiniteCheck on adult texts yielded good results comparable to the other current tools. This result perhaps also indicates that the approach could be used as a generic method for detection of errors. The ambiguity in the system is not fully resolved, but his does not disturb the error detection. However, false parses are hard to predict and they may give rise to errors not being detected or occurrence of false alarms.

256 Chapter 8. 8.4 Future Plans 8.4.1 Introduction The current version of the implemented grammar error detector is not intended to be considered a full-fledged grammar checker or a generic tool for detection of errors in any text written by any writer. The present version of FiniteCheck is based on a lexicon of limited size, ambiguity in the system is not fully resolved and it detects a limited set of grammar errors yielding simple diagnoses. The next challenge will be to expand the lexicon, experiment with disambiguation contra error detection, expand the coverage of the system to other error types, explore the diagnosis stage and test to detect errors in new texts written by different users. Furthermore, application of the grammars of the system for other purposes is also interesting to explore. 8.4.2 Improving the System The lexicon of the system has to be expanded with missing forms and new lemmas and other valuable information, such as valence or compound information. The latter has practically been accomplished, being stored in the original text-version of part of the lexicon. There is a high level of ambiguity in the system, especially at the lexical level since we do not use a tagger which might eliminate information in incorrect text that is later needed to find the error. The fact is that unresolved ambiguity can sometimes lead to false parsing, which in turn could mean false alarms. The degree of lexical ambiguity and the impact on parsing and by extension detection of errors can be studied by experiments with weighted lexical annotation for instance, i.e. lexical tags ordered by probability measures (e.g. weighted automata). Such taggers are however often based on texts written by adults and could give rise to deceptive results. Also, disambiguation is not fully resolved at the structural level, blocking some insertions by parsing order and further adjusting the output by filtering automata. Extension of grammars in the system have shown positive impact on parsing and further evaluation is needed in order to decide the degree of ambiguity and prospects for prediction of false parsing, both having an influence on error detection. Another possibility is to explore the use of alternative parses, implemented for instance as charts. The rules of the broad grammar overgenerate to a great extent. One thing to experiment with is the degree of broadness in order to see how it influences the detection process. Will the parsing of text be better at the cost of worse error

Summary and Conclusion 257 detection? How much could the grammar set be extended to improve the parsing without influencing the error detection? Since the grammars of the system are positive, experiments of using them for other purposes are in place. For instance, the more accurate narrow grammar could be applied for information extraction or even parsing. 8.4.3 Expanding Detection The first step in expansion of detection of FiniteCheck would naturally involve the types that are already selected for such expansion, i.e. noun phrases with relative clauses, coordinated infinitives and bare supine verbs. Furthermore, the verb grammar can be expanded with other constructions such as the auxiliary verb komma will that requires an infinitive marker preceding the main verb, or main verbs that combine with infinitive phrases (see Section 4.3.5). Further expansion naturally would concern errors that require least analysis. After noun phrase and verb form errors, only some constructions can be detected by simple partial parsing, but more complex analysis is required. The system can be further expanded to include detection of errors in predicative agreement, some pronoun case errors, some word order errors and probably some definiteness errors in single nouns. With regard to children, most crucial would be coverage of errors with missing or redundant constituents in clauses or word choice errors, which represent two of the more frequent error types. These errors will, as my analysis reveals, most probably require quite complex investigation with descriptions of complement structure. It would be plausible to do more analysis of children s writing in order to investigate if some such errors are for instance limited to certain portions of text and could then be detected by means of partial parsing. In consideration of children as the users of a grammar checker for educational purposes, the most important development will concern the error diagnosis and error messages to the user. A tool that supports beginning writers in their acquisition has to place high demands on the diagnosis of and information on errors in order to be useful. The message to the user has to be clear and adjusted to the skills of the child. A child not familiar with a given writing error or the grammatical terminology associated with it will certainly not profit from detecting such an error or from information containing grammatical terms. Studies of children s interaction with authoring aids are in place in order to explore how alternatives of detection, diagnosis and error messages could best profit this user group. For instance, such a tool could be used for training grammar allowing customizing and options for what error types to detect or train on. There could also be different levels of diagnosis and error messages depending on the individual child s level of writing acquisi-

258 Chapter 8. tion. Also other users could find such a tool interesting, for instance in language acquisition as a second language learner. The diagnosis stage could be adjusted by analysis of on-going processes in writing of children, that could be a step toward revealing the cause of an error. By logging all activities during the writing process on a screen, for instance all revisions could be stored and then analyzed if they indicate any repeated patterns and certainly if there is a difference between making a spelling error or making a grammar error. Could a grammar checker gain from such on-line information? This analysis would further be of interest for the errors in the borderline between grammar and spelling error and could aid detection of other categories of errors incorrectly detected as grammar errors. 8.4.4 Generic Tool? The detection and overall performance of the system has so far been tested on the training Child Data corpus and a small adult text not known to the system. The results for the four implemented error types are promising on both texts representing two different writing populations. This fact could also imply that this method could be used as generic. FiniteCheck obtained comparable performance to other Swedish grammar checkers for the adult text and on Child Data. Although FiniteCheck was based on these texts, considerable difference in coverage occurred for some error types that the other tools had difficulty finding. The system needs to be tested further on other children s texts not known to the system and also texts from other writers, primarily, texts of different genres written by adults. Furthermore, it would be interesting to test FiniteCheck on texts written by second language learners, dyslectics or even the hearing impaired, in order to explore how generic this tool is. 8.4.5 Learning to Write in the Information Society Some of the future work discussed above is already initiated within the framework of a three year project Learning to Write in the Information Society, initiated in 2003 and sponsored by Vetenskapsrådet. The project group consisting of Robin Cooper, Ylva Hård af Segerstad and me aims to investigate written language by school children in different modalities and the effects of the use of computers and other communicated media such as webchat and text messaging over mobile phone. The main aims are to see how writing is used today and how information technology can better be used for support. Texts written by primary school children will be gathered, both in hand written and computer written form. The study will also involve writing experiments with email, SMS (Short-Message-Service)

Summary and Conclusion 259 and webchat. Further studies dealing with interaction with different writing aids are included. The results of this study should reveal how writing aids influence the writing of children, what the needs and requirements are on such tools by this writing population and how writing aids can be improved to enhance writing development and instruction in school.

260

BIBLIOGRAPHY 261 Bibliography Abney, S. (1991). Parsing by chunks. In Berwick, R. C., Abney, S., and Tenny, C., editors, Principle-Based Parsing: Computation and Psycholinguistics, pages 257 278. Kluwer Academic Publishers, Dordrecht. Abney, S. (1996). Partial parsing via finite-state cascades. In Workshop on Robust Parsing at The European Summer School in Logic, Language and Information, ESSLLI 96, Prague, Czech Republic. Ahlström, K.-G. (1964). Studies in spelling I. Uppsala University, The Institute of Education. Report 20. Ahlström, K.-G. (1966). Studies in spelling II. Uppsala University, The Institute of Education. Report 27. Ait-Mohtar, S. and Chanod, J.-P. (1997). Incremental finite-state parsing. In ANLP 97, pages 72 79, Washington. Allard, B. and Sundblad, B. (1991). Skrivandets genes under skoltiden med fokus på stavning och övriga konventioner. Doktorsavhandling, Stockholms Universitet, Pedagogiska Institutionen. Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1998). Finite state grammar for finding grammatical errors in Swedish text: a finite-state word analyser. Technical report, Göteborg University, Department of Linguistics. [http: //www.ling.gu.se/ sylvana/fsg/report-9808.ps]. Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1999). Finite state grammar for finding grammatical errors in Swedish text: a system for finding ungrammatical noun phrases in Swedish text. Technical report, Göteborg University, Department of Linguistics. [http://www.ling.gu.se/ sylvana/fsg/ Report-9903.ps].

262 BIBLIOGRAPHY Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., and Tyson, M. (1993). FASTUS: A finite-state processor for information extraction from real-word text. In The Proceedings of IJCAI 93, Chambery, France. Arppe, A. (2000). Developing a grammar checker for Swedish. In Nordgård, T., editor, The 12th Nordic Conference in Computational Linguistics, NODALIDA 99, pages 13 27. Department of Linguistics, Norwegian University of Science and Technology, Trondheim. Arppe, A., Birn, J., and Westerlund, F. (1998). Lingsoft s Swedish Grammar Checker. [http:www.lingsoft.fi/doc.swegc]. Beesley, K. R. and Karttunen, L. (2003). Finite-State Morphology. CSLI- Publications. Bereiter, C. and Scardamalia, M. (1985). Cognitive coping strategies and the problem of inert knowledge. In Chipman, S. F., Segal, J. W., and Glaser, R., editors, Thinking and learning skills. Vol. 2, Research and open questions. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Biber, D. (1988). Variation across speech and writing. Cambridge University Press, Cambridge. Birn, J. (1998). Swedish Constraint Grammar: A Short Presentation. [http: //www.lingsoft.fi/doc/swecg/]. Birn, J. (2000). Detecting grammar errors with Lingsoft s Swedish grammar checker. In Nordgård, T., editor, The 12thNordic Conference in Computational Linguistics, NODALIDA 99, pages 28 40. Department of Linguistics, Norwegian University of Science and Technology, Trondheim. Björk, L. and Björk, M. (1983). Amerikansk projekt för bättre skrivundervisning. det viktiga är själva skrivprocessen - inte resultatet. Lärartidningen 1983:28, pages 30 33. Björnsson, C.-H. (1957). Uppsatsbedömning och mätning av skrivförmåga. Licentiatavhandling, Stockholm. Björnsson, C.-H. (1977). Skrivförmågan förr och nu. Pedagogiskt centrum, Stockholm. Bloomfield, L. (1933). Language. Henry Holt & CO, New York. Boman, M. and Karlgren, J. (1996). Abstrakta maskiner och formella språk. Studentlitteratur, Lund.

BIBLIOGRAPHY 263 Britton, J. (1982). Spectator role and the beginnings of writing. In Nystrand, M., editor, The Structure of Written Communication. Studies in Reciprocity Between Writers and Readers. Academic Press, New York. Bustamente, F. R. and León, F. S. (1996). GramCheck: A grammar and style checker. In The 16th International Conference on Computational Linguistics, Copenhagen, pages 175 181. Calkins, L. M. (1986). The Art of Teaching Writing. Heinemann, Portsmouth. Carlberger, J., Domeij, R., Kann, V., and Ola, K. (2002). A Swedish Grammar Checker. Submitted to Association for Computational Linguistics, October 2002. Carlberger, J. and Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software - Practice and Experience, 29(9):815 832. Carlsson, M. (1981). Uppsala Chart Parser 2: System documentation (UCDL-R- 81-1). Technical report, Uppsala University: Center for Computational Linguistics. Chafe, W. L. (1985). Linguistic differences produced by differences between speaking and writing. In Olson, D. R., Torrance, N., and Hildyard, A., editors, Literacy, language, and learning: The nature and consequences of reading and writing. Cambridge University Press, Cambridge. Chall, J. (1979). The great debate: ten years later, with a modest proposal for reading stages. In Resnick, L. B. and Weaver, P. A., editors, Theory and practice of early reading, Vol.2. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Chanod, J.-P. (1993). A broad-coverage French grammar checker: Some underlying principles. In the Sixth International Conference on Symbolic and Lo gical Computing, Dakota State University Madison, South Dakota. Chanod, J.-P. and Tapanainen, P. (1996). A robust finite-state parser for french. In Workshop on Robust Parsing at The European Summer School in Logic, Language and Information, ESSLLI 96, Prague, Czech Republic. Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory, 2, pages 113 124. Chomsky, N. (1959). On certain formal properties of grammars. Information and Control, 1, pages 91 112.

264 BIBLIOGRAPHY Chrystal, J.-A. and Ekvall, U. (1996). Skribenter in spe. elevers skrivförmåga och skriftspråkliga kompetens. ASLA-information 22:3, pages 28 35. Chrystal, J.-A. and Ekvall, U. (1999). Planering och revidering i skolskrivande. In Andersson, L.-G. e. a., editor, Svenskans beskrivning 23, pages 57 76. Lund. Clemenceau, D. and Roche, E. (1993). Enhancing a large scale dictionary with a two-level system. In EACL-93. Cooper, R. (1984). Svenska nominalfraser och kontext-fri grammatik. Nordic Journal of Linguistics, 7(2):115 144. Cooper, R. (1986). Swedish and the head-feature convention. In Hellan, L. and Koch Christensen, K., editors, Topics in Scandinavian Syntax. Crystal, D. (2001). Language and the Internet. Cambridge University Press, Cambridge. Dahlquist, A. and Henrysson, H. (1963). Om rättskrivning. Klassificering av fel i diagnostiska skrivprov. Folkskolan 3. Daiute, C. (1985). Writing and computers. Addison-Wesley, New York. Domeij, R. (1996). Detecting and presenting errors for Swedish writers at work. IPLab 108, TRITA-NA-P9629, KTH, Department of Numerical Analysis and Computing Science, Stockholm. Domeij, R. (2003). Datorstödd språkgranskning under skrivprocessen. Svensk språkkontroll ur användarperspektiv. Doktorsavhandling, Stockholms Universitet, Institutionen för lingvistik. Domeij, R. and Knutsson, O. (1998). Granskaprojektet: Rapport från arbetet med granskningsregler och kommentarer. KTH, Institutionen för numerisk analys och datalogi, Stockholm. Domeij, R. and Knutsson, O. (1999). Specifikation av grammatiska feltyper i Granska. Internal working paper. KTH, Institutionen för numerisk analys och datalogi, Stockholm. Domeij, R., Knutsson, O., Larsson, S., Eklundh, K., and Rex, Ȧ. (1998). Granskaprojektet 1996-1997. IPLab-146, KTH, Institutionen för numerisk analys och datalogi, Stockholm.

BIBLIOGRAPHY 265 Domeij, R., Ola, K., and Stefan, L. (1996). Datorstöd för språklig granskning under skrivprocessen: en lägesrapport. IPLab 109, TRITA-NA-P9630, KTH, Institutionen för numerisk analys och datalogi, Stockholm. EAGLES (1996). EAGLES Evaluation of Natural Langauge Processing Systems. Final Report. EAGLES Document EAG-EWG-PR.2. [http://www.issco. unige.ch/projects/ewg96/ewg96.html]. Ejerhed, E. (1985). En ytstruktur grammatik för svenska. In Allén, S., Andersson, L.-G., Löfström, J., Nordenstam, K., and Ralph, B., editors, Svenskans beskrivning 15. Göteborg. Ejerhed, E. and Church, K. (1983). Finite state parsing. In Karlsson, F., editor, Papers from the 7th Scandinavian Conference of Linguistics. University of Helsinki. No. 10(2):410-431. Ejerhed, E., Källgren, G., Wennstedt, O., and Åström, M. (1992). The Linguistic Annotation System of the Stockholm-Umeå Corpus Project. Report 33. University of Umeå, Department of Linguistics. Emig, J. (1982). Writing, composition, and rhetoric. In Mitzel, H. E., editor, Encyclopedia of Educational Research. The Free Press, New York. Flower, L. and Hayes, J. R. (1981). A cognitive process theory of writing. College Composition and Communication, 32:365 387. Garme, B. (1988). Text och tanke. Liber, Malmö. Graves, D. H. (1983). Writing: Teachers and Children at Work. Heinemann, Portsmouth. Grefenstette, G. (1996). Light parsing as finite-state filtering. In Kornai, A., editor, ECAI 96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Grundin, H. (1975). Läs och skrivförmågans utveckling genom skolåren. Utbildningsforskning 20. Liber, Stockholm. Gunnarsson, B.-L. (1992). Skrivande i yrkeslivet: en sociolingvistisk studie. Studentlitteratur, Lund. Göransson, A.-L. (1998). Hur skriver vuxna? Språkvård 3. Haage, H. (1954). Rättskrivningens psykologiska och pedagogiska problem. Folkskolans metodik.

266 BIBLIOGRAPHY Haas, C. (1989). Does the medium make a difference? Two studies of writing with pen and paper and with computers. Human-Computer Interaction, 4:149 169. Hallencreutz, K. (2002). Särskrivningar och andra skrivningar - nu och då. Språkvårdssamfundets skrifter 33. Halliday, M. A. K. (1985). Spoken and Written Language. Oxford University Press, Oxford. Hammarbäck, S. (1989). Skriver, det gör jag aldrig. In Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 103 113. Svenska föreningen för tillämpad språkvetenskap, Uppsala. Hansen, W. J. and Haas, C. (1988). Reading and writing with computers: a framework for explaining differences in perfomance. Communications of the ACM, 31, Sept, pages 1080 1089. Hawisher, G. E. (1986). Studies in word processing. Computers and Composition, 4:7 31. Hayes, J. R. and Flower, L. (1980). Identifying the organisation of the writing process. In Gregg, L. W. and Steinberg, E. R., editors, Cognitive processes in writing. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Heidorn, G. (1993). Experience with an easily computed metric for ranking alternative parses. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Herring, S. C. (2001). Computer-mediated discourse. In Tannen, D., Schiffrin, D., and Hamilton, H., editors, Handbook of Discourse Analysis. Oxford, Blackwell. Hersvall, M., Lindell, E., and Petterson, I.-L. (1974). Om kvalitet i gymnasisters skriftspråk. Pedagogisk-psykologiska problem 253. Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, New York. Hultman, T. G. (1989). Skrivande i skolan: sett i ett utvecklingsperspektiv. In Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 69 89. Svenska föreningen för tillämpad språkvetenskap, Uppsala.

BIBLIOGRAPHY 267 Hultman, T. G. and Westman, M. (1977). Gymnasistsvenska. Liber, Lund. Hunt, K. W. (1970). Recent measures in syntactic development. In Lester, M., editor, Readings in Applied Transformational Grammar. New York. Håkansson, G. (1998). Språkinlärning hos barn. Studentlitteratur, Lund. Hård af Segerstad, Y. (2002). Use and Adaptation of Written Language to the Conditions of Computer-Mediated Communication. PhD thesis, Göteborg University, Department of Linguistics. Ingels, P. (1996). A Robust Text Processing Technique Applied to Lexical Error Recovery. Licentiate Thesis. Linköping University, Sweden. Jensen, K. (1993). PEG: The PLNLP English Grammar. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Jensen, K., Heidorn, G., Miller, L., and Ravin, Y. (1993a). Parse fitting and prose fixing. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Jensen, K., Heidorn, G., and Richardson, S. D., editors (1993b). Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Josephson, O., Melin, L., and Oliv, T. (1990). Elevtext. Analyser av skoluppsatser från åk 1 till åk 9. Studentlitteratur, Lund. Joshi, A. K. (1961). Computation of syntactic structure. Advances in Documentation and Library Science, Vol. III, Part 2. Joshi, A. K. and Hopely, P. (1996). A parser from antiquity: An early application of finite state transducers to natural language parsing. In Kornai, A., editor, ECAI 96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Järvinen, T. and Tapanainen, P. (1998). Towards an implementable dependency grammar. In Kahane, S. and Polguere, A., editors, The Proceedings of COLING- ACL 98, Workshop on Processing of Dependency-Based Grammars, pages 1 10. Universite de Montreal, Canada. Karlsson, F. (1990). Constraint grammar as a system for parsing running text. In The Proceedings of the International Conference on Computational Linguistics, COLING 90, pages 168 173, Helsinki.

268 BIBLIOGRAPHY Karlsson, F. (1992). SWETWOL: Comprehensive morphological analyzer for Swedish. Nordic Journal of Linguistics, 15:1 45. Karlsson, F., Voutilainen, A., Heikkilä, J., and Anttila, A. (1995). Constraint Grammar: a language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin. Karttunen, L. (1993). Finite-State Lexicon Compiler. Technical Report ISTL- NLTT-1993-04-02, Xerox PARC. April 1993. Palo Alto, California. Karttunen, L. (1995). The replace operator. In The Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. ACL 95, pages 16 23, Boston, Massachusetts. Karttunen, L. (1996). Directed replacement. In The Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL 96, Santa Cruz, California. Karttunen, L., Chanod, J.-P., Grefenstette, G., and Schiller, A. (1997a). Regular expressions for language engineering. Natural Language Engineering 2(4), pages 305 328. Cambrigde University Press. Karttunen, L., Gaál, T., and Kempe, A. (1997b). Xerox Finite-State Tool. Technical report, Xerox Research Centre Europe, Grenoble. June 1997. Maylan, France. Karttunen, L., Kaplan, R. M., and Zaenen, A. (1992). Two-level morphology with composition. In The Proceedings of the International Conference on Computational Linguistics, COLING 92. Vol. I, pages 141 148, July 25-28, Nantes France. Kempe, A. and Karttunen, L. (1996). Parallel replacement in the finite-state calculus. In The Proceedings of the Sixteenth International Conference on Computational Linguistics, COLING 96, Copenhagen, Denmark. Kirschner, Z. (1994). CZECKER - a maquette grammar-checker for Czech. The Prague Bulletin of Mathematical Linguistics 62. Universita Karlova, Praha. Knutsson, O. (2001). Automatisk språkgranskning av svensk text. Licentiatavhandling, KTH, Institutionen för numerisk analys och datalogi, Stockholm. Kokkinakis, D. and Johansson Kokkinakis, S. (1999). A cascaded finite-state parser for syntactic analysis of Swedish. In EACL 99, pages 245 248.

BIBLIOGRAPHY 269 Kollberg, P. (1996). S-notation as a tool for analysing the episodic structure of revisions. In European writing conferences, Barcelona, October 1996. Koskenniemi, K., Tapanainen, P., and Voutilainen, A. (1992). Compiling and using finite-state syntactic rules. In The Proceedings of the International Conference on Computational Linguistics, COLING 92. Vol. I, pages 156 162, Nantes, France. Kress, G. (1994). Learning to write. Routledge, London. Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377 439. Laporte, E. (1997). Rational transductions for phonetic conversion and phonology. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Larsson, K. (1984). Skrivförmåga: studier i svenskt elevspråk. Liber, Malmö. Ledin, P. (1998). Att sätta punkt. hur elever på låg- och mellanstadiet använder meningen i sina uppsatser. Språk och stil, 8:5 47. Leijonhielm, B. (1989). Beskrivning av språket i brottsanmälningar. In Gunnarsson, B.-L., Liberg, C., and Wahlén, S., editors, Skrivande. Rapport från ASLA:s nordiska symposium, Uppsala, 10-12 november 1988. Svenska föreningen för tillämpad språkvetenskap, Uppsala. Liberg, C. (1990). Learning to Read and Write. RUUL 20. Reports from Uppsala University, Department of Linguistics. Liberg, C. (1999). Elevers möte med skolans textvärldar. ASLA-information 25:2, pages 40 44. Lindell, E. (1964). Den svenska rättskrivningsmetodiken: bidrag till dess pedagogisk-psykologiska grundval. Studia psychologica et paedagogica 12. Lindell, E., Lundquist, B., Martinsson, A., Nordlund, A., and Petterson, I.-L. (1978). Om fri skrivning i skolan. Utbildningsforskning 32. Liber, Stockholm. Linell, P. (1982). The written language bias in linguistics. Department of Communication Studies, Univerity of Linköping. Ljung, B.-O. (1959). En metod för standardisering av uppsatsbedömning. Pedagogisk forskning 1. Universitetsforlaget, Oslo.

270 BIBLIOGRAPHY Ljung, M. and Ohlander, S. (1993). Allmän Grammatik. Gleerups Förlag, Surte. Loman, B. and Jörgensen, N. (1971). Manual för analys och beskrivning av makrosyntagmer. Studentlitteratur, Lund. Lundberg, I. (1989). Språkutveckling och läsinlärning. In Sandqvist, C. and Teleman, U., editors, Språkutveckling under skoltiden. Studentlitteratur, Lund. MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk, Vol. 1: Transcription Format and Programs. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Magerman, D. M. and Marcus, M. P. (1990). Parsing a natural language using mutual information statistics. In AAAI 90, Boston, Ma. Manzi, S., King, M., and Douglas, S. (1996). Working towards user-orineted evaluation. In The Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLP+IA 96), pages 155 160. Moncton, New-Brunswick, Canada. Matsuhasi, A. (1982). Explorations in the realtime production of written discourse. In Nystrand, M., editor, What writers know: the language, process, and structure of written discourse. Academic Press, New York. Mattingly, I. G. (1972). Reading, the linguistic process and linguistic awareness. In Kavanagh, J. F. and Mattingly, I. G., editors, Language by Ear and by Eye, pages 133 147. MIT Press, Cambridge. Moffett, J. (1968). Teaching the Universe of Discourse. Houghton Mifflin Company, New York. Mohri, M., Pereira, F. C. N., and Riley, M. (1998). A rational design for a weighted finite-state transducer library. Lecture Notes in Computer Science 1436. Mohri, M. and Sproat, R. (1996). An efficient compiler for weighted rewrite rules. In The Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL 96, Santa Cruz, California. Montague, M. (1990). Computers and writing process instruction. Computers in the Schools 7(3). Nauclér, K. (1980). Perspectives on misspellings. A phonetic, phonological and psycholinguistic study. Liber Läromedel, Lund.

BIBLIOGRAPHY 271 van Noord, G. and Gerdemann, D. (1999). An extendible regular expression compiler for finite-state approaches in natural language processing. In Workshop on Implementing Automata 99, Postdam, Germany. Nyström, C. (2000). Gymnasisters skrivande. En studie av genre, textstruktur och sammnahang. Doktorsavhandling, Institutionen för Nordiska språk, Uppsala Universitet. Näslund, H. (1981). Satsradningar i svenskt elevspråk. FUMS 95: Forskningskommittén i Uppsala för modern svenska. Institutionen för nordiska språk, Uppsala universitet. Olevard, H. (1997). Tonårsliv. en pilotstudie av 60 elevtexter från standardproven för skolår 9 åren 1987 och 1996. Svenska i utveckling nr 11. FUMS Rapport nr 194. Paggio, P. and Music, B. (1998). Evaluation in the SCARRIE project. In First International Conference on Language Resources and Evaluation, Granada, Spain, pages 277 281. Paggio, P. and Underwood, N. L. (1998). Validating the TEMAA LE evaluation methodology: a case study on danish spelling checkers. Natural Language Engineering, 4(3):211 228. Cambridge University Press. Pereira, F. C. N. and Riley, M. D. (1997). Speech recognition by composition of weighted finite automata. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Pettersson, A. (1980). Hur gymnasister skriver. Svensklärarserien 184. Pettersson, A. (1989). Utvecklingslinjer och utvecklingskrafter i elevernas uppsatser. In Sandqvist, C. and Teleman, U., editors, Språkutveckling under skoltiden. Studentlitteratur, Lund. Pontecorvo, C. (1997). Studying writing and writing acquisition today: A multidisciplinary view. In Pontecorvo, C., editor, Writing development: An interdisciplinary view. John Benjamins Publishing Company. Povlsen, C., Sågvall Hein, A., and de Smedt, K. (1999). Final Project Report. Reports from the SCARRIE project, Deliverable 0.4. [http://fasting. hf.uib.no/ desmedt/scarrie/final-report.html]. Ravin, Y. (1993). Grammar errors and style weakness in a text-critiquing system. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht.

272 BIBLIOGRAPHY Richardson, S. D. (1993). The experience of developing a large-scale natural language processing system: Critique. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. van Rijsbergen, C. J. (1979). Information Retrieval. London. Robbins, A. D. (1996). AWK Language Programming. A User s Guide for GNU AWK. Free Software Foundation, Boston. Roche, E. (1997). Parsing with finite-state transducers. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Sandström, G. (1996). Språklig redigering på en dagstidning. Språkvård 1. de Saussure, F. (1922). Course in General Linguistics. Translation by Roy Harris. Duckworth, London. Scardamalia, M. and Bereiter, C. (1986). Research on written composition. In Wittrock, M. C., editor, Handbook of Research of Teaching. Third edition. A project of the american Educational Research Association. Macmillan Publishing Company, New York. Schiller, A. (1996). Multilingual finite-state noun phrase extraction. In ECAI 96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Senellart, J. (1998). Locating noun phrases with finite state transducers. In The Proceedings of COLING-ACL 98, pages 1212 1219. Severinson Eklundh, K. (1990). Global strategies in computer-based writing: the use of logging data. In 2nd Nordic Conference on Text Comprehension in Man and Machine, Täby. Severinson Eklundh, K. (1993). Skrivprocessen och datorn. IPLab 61, KTH, Institutionen för numerisk analys och datalogi, Stockholm. Severinson Eklundh, K. (1994). Electronic mail as a medium for dialogue. In van Waes, L., Woudstra, E., and van den Hoven, P., editors, Functional Communication Quality. Rodopi Publishers, Amsterdam/Atlanta. Severinson Eklundh, K. (1995). Skrivmönster med ordbehandlare. Språkvård 4.

BIBLIOGRAPHY 273 Severinson Eklundh, K. and Sjöholm, K. (1989). Writing with a computer. A longitudinal survey of writers of technical documents. IPLab 19, KTH, Department of Numerical Analysis and Computing Science, Stockholm. Skolverket (1992). LEXIN: språklexikon för invandrare. Nordsteds Förlag. Sofkova Hashemi, S. (1998). Writing on a computer and writing with a pencil and paper. In Strömqvist, S. and Ahlsén, E., editors, The Process of Writing - a progress report, pages 195 208. Göteborg University, Department of Linguistics. Starbäck, P. (1999). ScarCheck - a software for word and grammar checking. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Strömquist, S. (1987). Styckevis och helt. Liber, Malmö. Strömquist, S. (1989). Skrivboken. Liber, Malmö. Strömquist, S. (1993). Skrivprocessen. Teori och tillämpning. Studentlitteratur, Lund. Strömqvist, S. (1996). Discourse flow and linguistic information structuring: Explorations in speech and writing. Gothenburg Papers in Theoretical Linguistics 78. Göteborg University, Department of Linguistics. Strömqvist, S. and Hellstrand, Ȧ. (1994). Tala och skriva i lingvistiskt och didaktiskt perspektiv - en projektbeskrivning. Didaktisk Tidskrift, Nr 1-2. Strömqvist, S., Johansson, V., Kriz, S., Ragnarsdottir, H., Aisenman, R., and Ravid, D. (2002). Towards a crosslinguistic comparisson of lexical quanta in speech and writing. Written language and literacy Vol 5, N:o 1, pages 45 68. Strömqvist, S. and Karlsson, H. (2002). ScriptLog for Windows - User s Manual. Department of Linguistics and University College of Stavanger: Centre for Reading Research. Strömqvist, S. and Malmsten, L. (1998). ScriptLog Pro 1.04 - User s Manual. Göteborg University, Department of Linguistics. Svenska Akademiens Ordlista (1986). 11 uppl. Norstedts förlag, Stockholm. Sågvall Hein, A. (1981). An Overview of the Uppsala Chart Parser Version I (UCP-1). Uppsala University, Department of Linguistics.

274 BIBLIOGRAPHY Sågvall Hein, A. (1983). A Parser for Swedish. Status Report for SveUcp. (UCDL- R-83-2). Uppsala University, Department of Linguistics. February 1983. Sågvall Hein, A. (1998a). A chart-based framework for grammar checking: Initial studies. In The 11th Nordic Conference in Computational Linguistics, NODAL- IDA 98. Sågvall Hein, A. (1998b). A specification of the required grammar checking machinery. In Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.5.2, June 1998. Uppsala University, Department of Linguistics. Sågvall Hein, A. (1999). A grammar checking module for Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.6.3, June 1999. Uppsala University, Department of Linguistics. Sågvall Hein, A., Olsson, L.-G., Dahlqvist, B., and Mats, E. (1999). Evaluation report for the Swedish prototype. In Sågvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 8.1.3, June 1999. Uppsala University, Department of Linguistics. Teleman, U. (1974). Manual för beskrivning av talad och skriven svenska. Lund. Teleman, U. (1979). Språkrätt. Gleerups, Malmö. Teleman, U. (1991a). Lära svenska: Om språkbruk och modersmålsundervisning. Skrifter utgivna av Svenska Språknämnden, Almqvist and Wiksell, Solna. Teleman, U. (1991b). Vad kan man när man kan skriva? In Malmgren and Sandqvist, editors, Skrivpedagogik. Teleman, U., Hellberg, S., and Andersson, E. (1999). Svenska Akademiens grammatik. Svenska Akademien. Vanneste, A. (1994). Checking grammar checkers. Utrecht Studies and Communication, 4. Vernon, A. (2000). Computerized grammar checkers 2000: Capabilities, limitations, and pedagogical possibilities. Computers and Composition 17, pages 329 349. Vosse, T. G. (1994). The Word Connection. Grammar-based Spelling Error Correction in Dutch. PhD thesis, Neslia Paniculata, Enschede. Voutilainen, A. (1995). NPtool, a detector of English noun phrases. In the Proceedings of Workshop on Very Large Corpora, Ohio State University.

BIBLIOGRAPHY 275 Voutilainen, A. and Padró, L. (1997). Developing a hybrid NP parser. In ANLP 97, Washington. Voutilainen, A. and Tapanainen, P. (1993). Ambiguity resolution in a reductionistic parser. In EACL-93, pages 394 403, Utrecht. Wallin, E. (1962). Bidrag till rättstavningsförmågans psykologi och pedagogik. Göteborgs Universitet, Pedagogiska Institutionen. Wallin, E. (1967). Spelling. Factorial and experimental studies. Almqvist and Wiksell, Stockholm. Wedbjer Rambell, O. (1999a). Error typology for automatic proof-reading purposes. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O. (1999b). Swedish phrase constituent rules. A formalism for the expression of local error rules for Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O. (1999c). Three types of grammatical errors in Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O., Dahlqvist, B., Tjong Kim Sang, E., and Hein, N. (1999). An error database of Swedish. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Weijnitz, P. (1999). Uppsala Chart Parser Light: system documentation. In Sågvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wengelin, Ȧ. (2002). Text Production in Adults with Reading and Writing Difficulties. PhD thesis, Göteborg University, Department of Linguistics. Wikborg, E. (1990). Composing on the computer: a study of writing habits on the job. In Nordtext Symposium, Text structuring - reception and production strategies, Hanasaari, Helsinki. Wikborg, E. and Björk, L. (1989). Sammanhang i text. Hallgren and Fallgren Studieförlag AB, Uppsala. Wresch, W. (1984). The computer in composition instruction. National Council of Teachers of English.

276 BIBLIOGRAPHY Öberg, H. S. (1997). Referensbindning i elevuppsatser. en preliminär modell och en analys i två delar. Svenska i utveckling nr 7. FUMS Rapport nr 187. Östlund-Stjärnegårdh, E. (2002). Godkänd i svenska? Bedömning och analys av gymnasieelevers texter. Doktorsavhandling, Institutionen för Nordiska språk, Uppsala Universitet.

Appendices

278

Appendix A Grammatical Feature Categories GENDER: com neu masc fem DEFINITENESS: def indef wk str CASE: nom acc gen NUMBER: sg pl TENSE: imp inf pres pret perf past perf sup past part untensed VOICE: pass OTHER: adj adv common gender neuter gender masculine gender feminine gender definite form indefinite form weak form of adjective strong form of adjective nominative case accusative case genitive case singular plural imperative infinitive present preterite perfect past perfect supine past participle non-finite verb passive adjective adverb

280

Appendix B Error Corpora This Appendix presents the errors found in Child Data and consists of three corpora: B.1 Grammar Errors B.2 Misspelled Words B.3 Segmentation Errors Every listed instance of an error (ERROR) is indexed and followed by a suggestion for possible correction (CORRECTION) and information about which sub-corpora (CORP) it originates from, who the writer was (SUBJ), the writer s age (AGE) and sex (SEX; m for male and f for female). The different sub-corpora are abbreviated as DV Deserted Village, CF Climbing Fireman, FS Frog Story, SN Spencer Narrative, SE Spencer Expository.

282 Appendix B. B.1 Grammar Errors Grammar errors are categorized by the type of error that occurred. ERROR CORRECTION CORP SUBJ AGE SEX 1 AGREEMENT IN NOUN PHRASE 1.1 Definiteness agreement Indefinite head with definite modifier 1.1.1 Jag tar den närmsta handduk och slänger den i vasken och blöter den, 1.1.2 En gång blev den hemska pyroman utkastad ur stan. 1.1.3 Jag såg på ett TV program där en metod mot mobbing var att satta mobbarn på den stol och andra människor runt den personen och då fråga varför. handuken CF alhe 9 f pyromanen CF frma 9 m en stol/den stolen SE wj16 13 f Definite head with possessive modifier 1.1.4 Pär tittar på sin klockan och det var tid för klocka DV frma 9 m familjen att gå hem. 1.1.5 hunden sa på pojkens huvet. huve/ huvud FS haic 11 f Definite head with modifier denna 1.1.6 Nu när jag kommer att skriva denna uppsatsen så kommer jag ha en rubrik om några problem och... Definite head with indefinite modifier 1.1.7 Men senare ångrade dom sig för det var en räkningen på deras lägenhet. 1.1.8 Man ska inte fråga en kompisen om något arbete, man ska fråga läraren. uppsats SE wj03 13 f räkning DV jowe 9 f kompis SE wg05 10 m 1.2 Gender agreement Wrong article 1.2.1 pojken fick en grodbarn ett FS haic 11 f Wrong article in partitive 1.2.2 Virginias mamma hade öppnat en tyg affär i en av Dom gamla husen. Masculine form of adjective 1.2.3 sen berätta den minsta att det va den hemske fula troll karlen tokig som ville göra mos av dom för han skulle bo i deras by. 1.2.4 nasse blev arg han gick och la sig med dom andre syskonen. 1.3 Number agreement Singular modifier with plural head 1.3.1 Den dära scenen med det tre tjejerna tyckte jag att de var taskiga som går ifrån den tredje tjejen ett DV idja 11 f fule DV alhe 9 f andra CF haic 11 f de SE wg09 10 m

Error Corpora 283 ERROR CORRECTION CORP SUBJ AGE SEX Singular noun in partitive attribute 1.3.2 Alla männen och pappa gick in i ett av huset. husen DV haic 11 f 1.3.3 en av boven tog bensinen och gick bakåt. bovarna CF haic 11 f 2 AGREEMENT IN PREDICATIVE COMPLEMENT 2.1 Gender agreement 2.1.1 då börja Urban lipa och sa: Mitt hus är blöt. blött CF caan 9 m 2.1.2 den som hörde de där stygga orden vågade kanske inte spela på en konsert för att vara rädd att bli utskrattat av avundsjuka personer. utskrattad SE wg11 10 f 2.2 Number agreement Singular 2.2.1 En som är mobbad gråter säkert varje dag känner sig menigslösa. meningslös SE wj05 13 m Plural 2.2.2 Om dom som mobbar någon gång blir mobbad själv skulle han ändras helt och hållet. 2.2.3 Sjävl tycker jag att killarnas metoder är mer öppen och ärlig men också mer elak än var tjejernas metoder är. 2.2.4 jag tror att dom som är s har själva varit ut satt någon gång och nu vill dom hämnas och... mobbade SE wj05 13 m öppna, ärliga, elaka SE wj13 13 m utsatta SE wj19 13 m 2.2.5... för folk tänker mest på sig själv. själva SE wj20 13 m 2.2.6 nasse är en gris som har massor av syskon. smutsiga CF haic 11 f nasse är skär. Men nasses syskon är smutsig. 3 DEFINITENESS IN SINGLE NOUNS 3.1.1 dom gick till by byn DV haic 11 f 3.1.2 dom som bodde på ön kanske försökte komma skeppet DV haic 11 f på skepp 3.1.3 Jag såg en ö vi gick till ö ön DV haic 11 f 3.1.4 dom sa till borgmästare vad ska vi göra! borgmästaren CF frma 9 m 3.1.5 män han hade skrikit så börjar gren röra på sig grenen FS frma 9 m 3.1.6 pojke hoppade ner till hunden pojken FS frma 9 m 4 PRONOUN CASE 4.1 Case - Objective form 4.1.1 bilarna bromsade så att det blev svarta streck dem SN wg10 10 m efter de. 4.1.2 Två av brandmännen sprang in i huset för att dem CF klma 10 f rädda de 4.1.3 jag tycker synd om de dem SE wg16 10 f 4.1.4 då kan ju den eleven som blir utsatt gå fram och honom SE wj14 13 m prata med han 4.1.5 bara för man inte vill vara med han honom SE wj14 13 m 5 FINITE MAIN VERB 5.1 Present tense Regular verbs 5.1.1 Madde och jag bestämde oss för att sova i kojan och se om vi få se vind. får CF alhe 9 f

284 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 5.1.2 När hon kommer ner undrar hon varför det luktar CF alhe 9 f lukta så bränt och varför det låg en handduk över spisen. 5.1.3 undra vad det brann nånstans jag måste i alla undrar CF erja 9 m fall larma 5.1.4 Få se nu vilken väg är det, den här. Får FS idja 11 f 5.1.5 han kommer och klappar alla på handen utan en undrar SE wj03 13 f kille undra hur han känner sig då? 5.1.6... det kan även vara att nån kan sparka eller att får SE wj08 13 f man få vara enstöring... 5.1.7... där några tjejer/killar sitter och prata. pratar SE wj08 13 f 5.1.8 men det kanske bero på att det var en mindre beror SE wj13 13 m skola 5.1.9... och inte bry sig om han man inte få vara med, får SE wj14 13 m Strong verbs 5.1.10 Att stjäla är inte bra speciellt inte om man tar en sak av en person som gick för en i ett led och inte säga till att man hittade den utan att man behåller den. säger SE wj03 13 f 5.2 Preterite Regular verbs 5.2.1 vi berätta och... berättade DV alhe 9 f 5.2.2 den äldsta som va 80 år berätta berättade DV alhe 9 f 5.2.3 jag berätta om byn berättade DV alhe 9 f 5.2.4 sen berätta den minsta berättade DV alhe 9 f 5.2.5 då börja alla i hela tunneln förutom pappa och började DV alhe 9 f ja gråta 5.2.6 sen cykla vi dit igen. cyklade DV alhe 9 f 5.2.7...gick ner och hämta min och pappas cyklar... hämtade DV alhe 9 f 5.2.8 Pappa gick och Knacka på en dörr till medan hämtade DV alhe 9 f jag hämta cyklarna 5.2.9 Pappa gick och knacka på en dörr för att vi var knackade DV alhe 9 f väldigt hungriga 5.2.10 Pappa gick och Knacka på en dörr till medan knackade DV alhe 9 f jag hämta cyklarna 5.2.11 jag knacka på dörren knackade DV alhe 9 f 5.2.12 men jag lugna mig och kände på marken lugnade DV alhe 9 f 5.2.13 dom peka på väggen av tunneln pekade DV alhe 9 f 5.2.14 jag ramla i en rutschbana ramlade DV alhe 9 f 5.2.15 långt åkte ja tills jag stanna vid en port... stannade DV alhe 9 f 5.2.16 när vi kom hem undra självklart mamma vart undrade DV alhe 9 f vi varit 5.2.17 pappa och jag undra va nycklarna va undrade DV alhe 9 f 5.2.18 sen undra han va dom bodde undrade DV alhe 9 f 5.2.19 på morgonen när vi vakna... vaknade DV alhe 9 f 5.2.20 men ingen öppna öppnade DV alhe 9 f 5.2.21 någon eller något öppna dörren öppnade DV alhe 9 f 5.2.22 vi till och med öppna pensionathem öppnade DV alhe 9 f 5.2.23 Lena Ropa mamma Lena vakna ropade DV angu 9 f 5.2.24 Plötsligt vakna Hon av att någon sa Lena Lena. vaknade DV angu 9 f 5.2.25 Per luta sig mot en lutade DV anhe 11 m

Error Corpora 285 ERROR CORRECTION CORP SUBJ AGE SEX 5.2.26 Sen Svimma jag svimmade DV anhe 11 m 5.2.27 när jag vakna satt Jag Per och Urban mitt i byn. vaknade DV anhe 11 m 5.2.28 och när vi kom hem så Vakna jag och allt var vaknade DV anhe 11 m en dröm. 5.2.29 Plötsligt börja en lavin började DV erha 10 m 5.2.30 när Gunnar öppna dörren till det stora huset rasade DV erha 10 m rasa det ihop 5.2.31 och snart rasa hela byn ihop. rasade DV erha 10 m 5.2.32 när Gunnar öppna dörren till det stora huset öppnade DV erha 10 m rasa det ihop 5.2.33 Niklas och Benny hoppa av kamelerna hoppade DV erja 9 m 5.2.34 och snabbt hoppa dom på kamelerna hoppade DV erja 9 m 5.2.35 och rusa iväg och red bort rusade DV erja 9 m 5.2.36 snabbt samla han ihop alla sina jägare samlade DV erja 9 m 5.2.37 men undra varför den är övergiven. undrade DV idja 11 f 5.2.38 Ida gick och tänkte på vad dom skulle göra hon snubblade DV jowe 9 f snubbla på nåt 5.2.39 Jag tog min väska och Madde tog sin, och vi började CF alhe 9 f börja gå mot vår koja, där vi skulle sova. 5.2.40 När vi kom fram börja vi packa upp våra grejer började CF alhe 9 f och rulla upp sovsäcken. 5.2.41 Madde vaknade av mitt skrik, hon fråga va det frågade CF alhe 9 f var för nåt. 5.2.42 På morgonen vaknade vi och klädde på oss sen packade CF alhe 9 f packa vi ner våra grejer. 5.2.43 jag sa att det inte va nåt så somna vi om. somnade CF alhe 9 f 5.2.44 För ett ögon blick trodde jag att den hästen vaktade CF alhe 9 f vakta våran koja. 5.2.45 på natten vakna jag av att brandlarmet tjöt vaknade CF angu 9 f 5.2.46 då börja Urban lipa och sa: Mitt hus är blöt. började CF caan 9 m 5.2.47 Brandkåren kom och spola ner huset spolade CF caan 9 m 5.2.48 Cristoffer stod och titta på ugglan i trädet tittade FS alca 11 f 5.2.49 Erik gick till skogen och ropa allt han kunde. ropade FS alhe 9 f 5.2.50 Rådjuret sprang iväg med honom. Och kasta kastade FS angu 9 f av pojken vid ett berg. 5.2.51 De klättra över en stock. klättrade FS angu 9 f 5.2.52 Pojken ropa groda groda var är du ropade FS angu 9 f 5.2.53 De gick ut och ropa men de fick inget svar. ropade FS angu 9 f 5.2.54 Ruff råka trilla ut ur fönstret. råkade FS angu 9 f 5.2.55 Pojken satt varje kväll och titta på grodan tittade FS angu 9 f 5.2.56 När pojken vakna nästa morgon och fann att vaknade FS angu 9 f grodan var försvunnen blev han orolig 5.2.57 Och utan att pojken visste om det hoppa grodan hoppade FS caan 9 m ur burken när han låg. 5.2.58 Nästa dag vakna pojken och såg att grodan vaknade FS caan 9 m hade rymt 5.2.59 hunden halka efter. halkade FS erge 9 f 5.2.60 När han landa så svepte massa bin över honom. landade FS erge 9 f 5.2.61 Pojken leta och leta i sitt rum. letade FS erge 9 f 5.2.62 Pojken leta och leta i sitt rum. letade FS erge 9 f 5.2.63 Hunden leta också letade FS erge 9 f 5.2.64 Pojken gick då ut och leta efter grodan letade FS erge 9 f 5.2.65 Pojken leta i ett träd letade FS erge 9 f

286 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 5.2.66 Då helt plötsligt ramla hunden ner från fönstret ramlade FS erge 9 f 5.2.67 där bodde bara en uggla som skrämde honom ramlade FS erge 9 f så han ramla ner på marken. 5.2.68 Där ställde pojken sig och ropa efter grodan ropade FS erge 9 f 5.2.69 Hej då ropa han hej då. ropade FS erge 9 f 5.2.70 Då gick pojken vidare och såg inte att binas bo trillade FS erge 9 f trilla ner. 5.2.71 när dom båda trilla i. trillade FS erge 9 f 5.2.72 Han ropa hallå var är du ropade FS haic 11 f 5.2.73 han gick upp på stora stenen ropa hallå hallå ropade FS haic 11 f 5.2.74 Då öppnade han fönstret & ropa på grodan. ropade FS jobe 10 m 5.2.75 I min förra skola hade man nåt som man kallade funkade SE wj13 13 m för kamratstödjare, Det funka väl ganska bra men... 5.2.76 man visade ingen hänsyn eller att man inte heja bråkade SE wj18 13 m eller bara bråka 5.2.77 man visade ingen hänsyn eller att man inte heja hejade SE wj18 13 m eller bara bråka 5.2.78 Var var den där överraskningen. Ni svara jag svarade SN wg07 10 f men båda tittade på varandra... 5.2.79 Ni svara jag svarade SN wg07 10 f 5.2.80 det gick inte så hon klättrade upp bredvid mig puttade SN wg16 10 f och medan jag för sökte lyfta upp mig skälv medan hon putta bort jackan från pelare. 5.2.81 medan hon putta jackan från pelaren puttade SN wg16 10 f 5.2.82 jag var på mitt land och bada badade SN wg18 10 m 5.2.83 så här börja det började SN wg18 10 m 5.2.84 där sövde dom mig och gipsa handen. gipsade SN wj05 13 m 5.2.85 Hon hade bara kladdskrivit den uppsats jag lämnade SN wj16 13 f lämna in... 5.2.86...så jag ångra verkligen att jag tog hennes ångrade SN wj16 13 f uppsats... 5.2.87 När jag gick förbi den djupa avdelningen så kom en annan kille och putta i mig puttade SN wj20 13 m Supine 5.2.88 det låg massor av saker runtomkring jag försökt att kom till fören 5.2.89 Han tittade på hunden, hunden försökt att klättra ner försökte DV haic 11 f försökte FS haic 11 f Participle 5.2.90 Fönstrena ser lite blankare ut där uppe sa Virginia började DV idja 11 f och börjad klättra upp för den ruttna stegen. 5.2.91 älgen sprang med olof till ett stup och kastad kastade FS frma 9 m ner olof och hans hund 5.2.92 dom letad överallt letade FS frma 9 m 5.2.93 när han letad kollade en sork upp letade FS frma 9 m 5.2.94 han letad bakom stocken letade FS frma 9 m 5.2.95 alla pratad om borgmästaren pratade CF frma 9 m 5.2.96 hunden råkade skakad ner ett getingbo skaka FS frma 9 m 5.2.97 det var en liten pojke som satt och snyftad snyftade DV haic 11 f 5.2.98 svarad han svarade DV alco 9 f

Error Corpora 287 ERROR CORRECTION CORP SUBJ AGE SEX 5.2.99 Jag tittade på Virginia som torkad av sin näsa som var blodig på tröjarmen. torkade DV idja 11 f Strong verbs 5.2.100 Nästa dag så var en ryggsäck borta och mera grejer försvinna 6 VERB CLUSTER 6.1 Verb form after auxiliary verb Present 6.1.1 Och i morgon är det brandövning men kom ihåg att det inte ska blir någon riktig brand. 6.1.2 Ibland får man bjuda på sig själv och låter henne/honom vara med! Preterite 6.1.3 hon ville inte att jag skulle följde med men med lite tjat fick jag. försvann DV erge 9 f bli CF klma 10 f låta SE wj17 13 f följa DV alhe 9 f Imperative 6.1.4 Men de var fult med buskar utan för som vi fick rida DV idja 11 f rid igenom. 6.1.5 han råkade bara kom i mot getingboet. komma FS haic 11 f 6.1.6 Det är något som vi alla nog skulle gör om vi göra SE wj20 13 m inte hade läst på ett prov. 6.1.7 Jag skrattade och undrade hur Tromben skulle ha kom igenom det lilla hålet. kommit DV idja 11 f 6.2 Missing auxiliary verb Temporal ha 6.2.1 ni måste hjälpa mig om ni ska få henne. och dom lovat att bygga upp staden och de blev hotell 6.2.2 Men pappa frågat mig om jag ville följa med har/hade DV erge 9 f har/hade DV haic 11 f 7 INFINITIVE PHRASE 7.1 Verbform after infinitive marker Present 7.1.1 Men hunden klarar att inte slår sig. slå FS haic 11 f Imperative 7.1.2 glöm inte att stäng dörren stänga DV hais 11 f 7.1.3 jag försökt att kom till fören komma DV haic 11 f 7.1.4 Åt det går det nog inte att gör så mycket åt. göra SE wj20 13 m 7.2 Missing infinitive marker 7.2.1 Men det vågar man kanske inte i första taget för då kan man ju bli rädd att man kommer få ett kännetecken som skolans skvallerbytta eller något sånt! 7.2.2... tänkte jag att om man ska hålla på så kommer det inte gå bra i skolan. kommer att få kommer det inte att gå E13 wj01 13 f E13 wj06 13 f

288 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 7.2.3 Nu när jag kommer att skriva denna uppsatsen så kommer jag ha en rubrik om några problem och vad man kan göra för att förbättra dom. kommer jag att ha E13 wj03 13 f 8 WORD ORDER 8.1.1 När han kom hem så åt han middag gick och sedan och FS jowe 9 f borstade tänderna och gick och sedan lade sig. 8.1.2 för då kan man inte något ting bara kan gå på kan bara SE wg03 10 f stan det då fattar hjärnan ingenting 8.1.3 Jag den dan gjorde inget bättre. Jag gjorde SN wg07 10 f inget bättre den dan. 8.1.4 att jag har ett problem att jag måste hela tiden på matten SE wg10 10 m fuska på proven annars med på matten nog alla lektioner måste jag fuska och alltid bråka för att få uppmärksamhet. med 8.1.5 kompisarna gör det inte men om tvingar dom dom inte SE wj12 13 f inte dig till att göra det tvingar 9 REDUNDANCY 9.1 Doubled word Following directly 9.1.1 Han tittade på sin hund hund oliver hund FS alhe 9 f 9.1.2 Kompisen ska få titta på en ibland också men,, men SE wj17 13 f men det får inte bli regelbundet för då... 9.1.3 många som mobbar har har det oftast dåligt har SE wj19 13 m hemma 9.1.4 vi skall i alla fall träffas idag 20 mars 1999 också SN wg04 10 m måndagen kanske imorgon också också 9.1.5 Jag hade tur jag klarade klarade mig klarade SN wg10 10 m 9.1.9 Nasse sprang efter som en liten fnutknapp efter Bovarna. 9.2 Redundant word 9.2.1 Kalle som blev jätte rädd och sprang till närmaste hus som låg 9, kilometer bort 9.2.2 för då kan man inte något ting bara kan gå på stan det då fattar hjärnan ingenting 9.2.3 Hon och han borde pratat med en vuxen person (läraren). Eller pratat med föräldrarna. Word between 9.1.6 jag tycker jag att alla måste få vara med jag tycker att alla måste få vara med 9.1.7 jag fick jag hjälp med det. jag fick hjälp med det 9.1.8 Åt det går det nog inte att gör så mycket åt. Åt det går det nog inte att gör så mycket. Nasse sprang som en liten fnutknapp efter bovarna. SE wg18 10 m SN wj11 13 f SE wj20 13 m CF haic 11 f CF anhe 11 m SE wg03 10 f SE wg12 10 f

Error Corpora 289 ERROR CORRECTION CORP SUBJ AGE SEX 9.2.4 när De kom till en övergiven by va Tor och jag var rädda DV haic 11 f 10 Missing Constituents 10.1 Subject 10.1.1 undra vad det brann nånstans jag måste i jag CF erja 9 m alla fall larma 10.1.2 vidare hoppas att vi kommer att vara kompisar jag SN wg04 10 m rätt länge 10.1.3 Jag tror skulle hjälpa dem är att... något/det SE wg08 10 f 10.1.4 I början på filmen var det massa kollade på folk som SE wg14 10 m den andras papper på uppgiften 10.1.5 man försöker att lära barnen att om fuskar de SE wg19 10 m med t ex ett prov då... 10.1.6 han kommer och klappar alla på handen utan en jag SE wj03 13 f kille undra hur han känner sig då? 10.1.7 När jag var ungefär 5 år och gick på dagis så jag/vi SN wj09 13 m skulle åka på ett barnkalas hos en tjej med dagiset. 10.1.8 När man tror att man har kompisar blir ledsen man SE wj19 13 m när man bara går där ifrån om just kom dit 10.1.9 När man tror att man har kompisar blir ledsen man SE wj19 13 m när man bara går där ifrån om just kom dit 10.1.10 Dom satte av efter Billy och Åke som suttit i ett träd men blivit nerputtad av en uggla blev nästan nertrampad. han FS mawe 11 f 10.2 Object or other NPs 10.2.1 Om dom bråkar som är det inte så mycket man kan göra åt saken 10.2.2 jag viste att han skulle bli lite ledsen då efter som vi hade bestämt. 10.2.3 Om man sätter barn som är lika bra som på samma ställe blir det bättre för... 10.3 Infinitive marker 10.3.1 Efter ha sprungit igenom häckarna två gånger så vilade vi lite... 10.4 (att) Verb 10.4.1 en port som va helt glittrig och 2 guldögon och silver mun. 10.4.2 sedan skuttade han fram vidare till den öppna burken där grodan han. Nosade förundrat på grodan 10.4.3 Jag tycker att det har med ens uppfostran om man nu ger eller inte ger hon/han den saken som man tappade. 10.4.4... så kom det några utlänningar och tog bollen och vi inte tillbaka den. 10.4.5 då bar det av i 14 dagar och 14 äventyrsfyllda nätter jagade av älg kompis med huggorm trampat på igelkott mycket hände verkligen. de? SE wg03 10 f det SN wg06 10 f varandra SE wg18 10 m att SN wj03 13 f hade DV alhe 9 f var FS hais 11 f att göra SE wj07 13 f fick SN wj13 13 m, blev (?) DV hais 11 f

290 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 10.5 Adverb 10.5.1 tuni hade jätte ont i knät men hon ville sluta för det. inte SN wj03 13 f 10.6 Preposition 10.6.1 Gunnar var på semester Norge och åkte i DV erha 10 m skidor. 10.6.2 dom bär massor av sken smycken massor av DV haic 11 f saker 10.6.3 det ena huset efter det andra gjordes ordning i DV hais 11 f 10.6.4 Hunden hoppade ner ett getingbo. i FS anhe 11 m 10.6.5 Nej det var inte grodan som bodde hålet. i FS haic 11 f 10.6.6 Pojken som var på väg upp ett träd fick i FS idja 11 f slänga sig på marken... 10.6.7 att de som kollade på den andras papper skall på SE wg14 10 m träna mer sin uppgift 10.6.8... så tänkte jag att det är verklighet sånt i SE wj06 13 f händer verkligheten 10.6.9 Mobbning handlar nog mycket att man om SE wj20 13 m inte förstår olika människor. 10.6.10 men jag blev alla fall jätte rädd för... i SN wg18 10 m 10.6.11 mobbing är det värsta som finns och dom som gör det saknas det säkert någonting i huvudet. hos SE wj05 13 m 10.7 Conjunction and subjunction 10.7.1 han gick upp på stora stenen ropa hallå! och FS haic 11 f hallå! 10.7.2 Simon klädde på sig åt frukost. och FS hais 11 f 10.7.3 Det som flickan gjorde när det var en vuxen som SE wg14 10 m svarade i sin mobiltelefon som tappade en 100 lapp. 10.7.4...till exempel den här killen gör så igen så... om SE wj03 13 f 10.7.5 om det är en tjej man inte alls är bra kompis med kommer och sätter sig på bänken som SE wj17 13 f 10.8 Other 10.8.1 Alla blev rädda för hans skrik hans hämnd hur CF frma 9 m kunde vara som helst... hemsk/vad 10.8.2 dom gick ut på kullek och letade. och på de letade? FS hais 11 f marken och i luften. 10.8.3 De körde långt bort och till slut kom de fram till där DV alca 11 f en gärdsgård och det var massor av hus. 10.8.4 sen levde vi lyckliga våra dagar i alla DV hais 11 f 10.8.5 att jag har ett problem att jag måste hela tiden (?) SE wg10 10 m fuska på proven annars med på matten nog alla lektioner måste jag fuska och alltid bråka för att få uppmärksamhet. 10.8.6 den som hörde de där stygga orden vågade kanske inte spela på en konsert för att vara rädd att bli utskrattat av avundsjuka personer. han/hon var SE wg11 10 f

Error Corpora 291 ERROR CORRECTION CORP SUBJ AGE SEX 10.8.7 Om man inte kan det man ska göra och tittar på någon annan visar någon annans resultat sen. 10.8.8 För att förbättra det är nog att man ska prata med en lärare eller förälder så... (?) SE wj05 13 m det bästa? SE wj07 13 f 11 WORD CHOICE 11.1 Prepositions and particles 11.1.1 dom peka på väggen av tunneln i DV alhe 9 f 11.1.2 Vi sprang allt vad vi orkade ner till sjön och av DV idja 11 f slängde ur oss kläderna. 11.1.3 Jag kom ihåg allt som hänt innan jag trillat från CF jowe 9 f ifrån grenen. 11.1.4 Han ropade ut igenom fönstret men inget kvack genom FS caan 9 m kom tillbaka. 11.1.5 sen var det problem på klass fotot med SE wg18 10 m 11.1.6 Jag tycker att om man har svårigheter för att med SE wj11 13 f skriva eller nåt annat skall man visa det... 11.1.7 vi var väldigt lika på sättet alltså vi tyckte om till SN wg04 10 m samma saker 11.1.8 Jag blev glad på Malin att hon hjälpte mig att (?) SN wg06 10 f säga det till honom för... 11.1.9 han kommer och klappar alla på handen utan utom SE wj03 13 f en kille 11.1.10 När vi skulle gå av satt jag och dagdrömde och så gick alla av utan jag. utom SN wj09 13 m 11.2 Adverb 11.2.1 Jag undrar ibland vart mamma är men det är var CF erge 9 f ingen som vet. 11.2.2 Men vart ska jag bo? var CF erge 9 f 11.2.3 Men vart dom en letade hittade dom ingen groda. var FS anhe 11 m 11.3 Infinitive marker 11.3.1 det var onödigt och skrika pappa att DV alhe 9 f 11.3.2 sen gick jag in och la mig och sova att DV alhe 9 f 11.3.3 men jag vet inte hur man ska få dom och göra att SE wg18 10 m det. 11.3.4... men om man vill försöka bli kompis med att SE wj08 13 f några tjejer/killar och kanske försöker och gå fram... 11.3.5... det fick en och tänka till hur man kan hjälpa såna som är utsatta. att SE wj16 13 f 11.4 Pronoun 11.4.1 vad skulle dom göra dess pengar tog nästan slut deras DV jowe 9 f 11.4.2 Det är vanligt att om man har problem hemma att man lätt blir arg och det går då ut över sina kompisar. ens SE wj12 13 f

292 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 11.5 Blend 11.5.1 när dom kommer hem så märker inte föräldrarna något även fast att man luktar rök och sprit 11.5.2 Det är nog inte ett ända barn som inte har något problem även fast att man inte har så stora även om/fastän SE wj12 13 f även SE wj12 13 f om/fastän 11.5.3 jag sprang så fort så mycket jag var värd allt vad DV haic 11 f 11.6 Other 11.6.1 Hon satte sig på det guldigaste och mjukaste gräset i hela världen. 11.6.2 men se där är ni ju det lilla följet bestående av snutna djur från djuraffären. 11.6.3 Jag tittade på Virginia som torkad av sin näsa som var blodig på tröjarmen. 11.6.4 jag förstår inte vad fröken menar med grammatik näringsväv och allt de andra. 11.6.5 Nasse sprang efter som en liten fnutknapp efter Bovarna. 12 REFERENCE 12.1 Erroneous referent Number 12.1.1 Lena fick en kattunge...och Alexander fick ett spjut. sen gav den sej iväg när de gått och gått så hände något 12.1.2 långt bort skymtade ett gult hus. vi närmade oss de sakta 12.1.3 Att Urban hade en fru. och en massa ungar hade det. 12.1.4 Oliver försökte få av sig burken så aggressivt så han ramlade över kanten. Erik tittade efter honom med en frågande min När Oliver hade dom i baken så hopade Erik ner. mest gulda DV angu 9 f stulna DV hais 11 f ärmen DV idja 11 f näringslära? CF angu 9 f? CF haic 11 f de DV angu 9 f det DV hais 11 f de FS alhe 9 f den FS alhe 9 f Gender 12.1.5...vad heter din mamma? Det stod bara helt hon CF hais 11 f still i huvudet vad var det han hette nu igen? 12.1.6 Om nu någon tappar någon som pengar... något SE wj07 13 f 12.2 Change of referent 12.2.1 spring ut nu vi har besökare när ni kom ut... vi DV hais 11 f 12.2.2 Om dom som mobbar någon gång blir mobbad själv skulle han ändras helt och hållet. dom/han (?) SE wj05 13 m 13 OTHER 13.1 Adverb 13.1.1 När jag var liten mindre... lite SN wj11 13 f 13.2 Strange construction 13.2.1 så Pär var läggdags DV frma 9 m 13.2.2 god natt på er Ses i morgon i går god natt DV hais 11 f 13.2.3 när vi rast skulle stänga affären så gömde jag mig. DV hais 11 f

Error Corpora 293 B.2 Misspelled Words Errors are categorized by part-of-speech and then by the part-of-speech they are realized in, indicated by an arrow (e.g. Noun Noun a noun becoming another noun). ERROR CORRECTION CORP SUBJ AGE SEX 1 NOUN 1.1 Noun Noun 1.1.1 Medan Oliver hoppade efter bot. boet FS alhe 9 f 1.1.2 Grävde sig Erik längre ner i bot boet FS alhe 9 f 1.1.3 men upp ur bot kom ett djur upp. boet FS alhe 9 f 1.1.4 Erik sprang i väg medan Oliver välte ner det boet FS alhe 9 f surande bot. 1.1.5 Bina som bodde i bot rusade i mot Oliver boet FS alhe 9 f 1.1.6 men hunden hade fastnat i buken burken FS frma 9 m 1.1.7 att dom bot i en jätte fin dy by DV alhe 9 f 1.1.8 det va deras dy. by DV alhe 9 f 1.1.9 Det KaM Till EN övergiven Bi by DV erja 9 m 1.1.10 dam bodde i en bi by DV erja 9 m 1.1.11 pappa i har hittat än övergiven bi by DV erja 9 m 1.1.12 de var en by en öde dy. by DV frma 9 m 1.1.13 både pappa och jag kom då att tänka på den dyn byn DV alhe 9 f vi va i 1.1.14 på vägen hem undrade pär hur dyn hade kommit byn DV frma 9 m till. 1.1.15 jag sprang till boten båten DV haic 11 f 1.1.16 sen vaknade vi i botten båten DV haic 11 f 1.1.17 Den där scenen med dammen som tappade damen SE wg09 10 m sedlarna 1.1.18 Renen sprang tills dom kom till en dam damm FS alhe 9 f 1.1.19 kastad ner olof och hans hund i en dam damm FS frma 9 m 1.1.20 En dag när han var vid damen drog han med dammen FS alhe 9 f håven i vattnet och fick upp en groda. 1.1.21 Men damen är inte så djup. dammen FS jobe 10 m 1.1.22 Vi kom Över Molnen Jag och Per på en flygande gris??? DV caan 9 m fris som hette Urban. 1.1.23 pojken och huden kom i vattnet. hunden FS haic 11 f 1.1.24 de lät precis som Fjory hennes hast häst DV alco 9 f 1.1.25 August rosen gren har lämnat hjorden... jorden DV hais 11 f 1.1.26 därför skulle dom andra i klasen visa hur duktiga klassen SE wg02 10 f dom var. 1.1.27 Den brinnande makan mackan CF caan 9 m 1.1.28 huset brann upp för att makan hade tagit eld. mackan CF caan 9 m 1.1.29 En dag tänkte Urban göra varma makor. mackor CF caan 9 m 1.1.30 Manen var tjock och rökte cigarr. mannen CF alco 9 f 1.1.31 Ni har en son som ringt efter oss sa manen. mannen CF idja 11 f 1.1.32 Den gamla manen Berättade om en by han Bot mannen DV angu 9 f i för länge sedan 1.1.33 den här gamla manen har tagit hand om oss. mannen DV angu 9 f 1.1.34 manen kom ut med tre skålar härlig soppa. mannen DV angu 9 f 1.1.35 men så en dag kom en man som hette svarta manen mannen DV angu 9 f

294 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 1.1.36 för manen hade många djur mannen DV angu 9 f 1.1.37 Det var nog den här byn manen talade om mannen DV angu 9 f 1.1.38 det var svarta manen. mannen DV angu 9 f 1.1.39 Lena gick fram till svarta manen mannen DV angu 9 f 1.1.40 svarta manen blev rädd mannen DV angu 9 f 1.1.41 svarta manen sprang sin väg mannen DV angu 9 f 1.1.42 det log maser av saker runtomkring massor DV haic 11 f 1.1.43 dom bär maser av sken smycken massor DV haic 11 f 1.1.44... men plötsligt tog matten slut. maten DV erge 9 f 1.1.45 alla menen och Pappa gick in i ett av huset männen DV haic 11 f 1.1.46 pojken skrek ett tupp! stup FS haic 11 f 1.1.47 ja tak tack DV haic 11 f 1.1.48 just då ringde telefånen och pappa svarade: telefonen CF erge 9 f 1.1.49 Fram ur vasen kom det något vassen FS idja 11 f 1.1.50 Sen gick jag ut, och fram för mig stod värdens världens CF alhe 9 f finaste häst. 1.1.51 dom som borde på örn kanske försökte koma på skepp ön DV haic 11 f 1.2 Noun Adjective 1.2.1 man kunde rida fyra i bred bredd DV idja 11 f 1.2.2 kale som blev jätte rädd... Kalle CF anhe 11 m 1.2.3... och där fans ett tempel fult med matt. mat DV erge 9 f 1.3 Noun Pronoun 1.3.1 Men det han höll i var ett par hon som i sin tur satt fast i en hjort. horn FS anhe 11 m 1.4 Noun Numeral 1.4.1 olof som klättrade i ett tre träd FS frma 9 m 1.5 Noun Verb 1.5.1 pappa gick och knacka på en dör till dörr DV alhe 9 f 1.5.2 och knacka på en dör dörr DV alhe 9 f 1.5.3 Lena var en flika som var 8 år. flicka DV angu 9 f 1.5.4 Han letade i ett hål medans hunden skällde på massa FS erge 9 f masa bin. 1.5.5 När han landa så svepte masa bin över honom. massa FS erge 9 f 1.5.6 hunden hade hittat masa getingar massa FS haic 11 f 1.5.7 där va en masa människor massa DV alhe 9 f 1.5.8 Jag tycker att om man inte gillar en viss person ska man inte visa det på ett så taskigt sett. sätt SE wg17 10 f 1.6 Noun Preposition 1.6.1 Då fick muffins syn på en massa in och började bin FS jowe 9 f jaga dom. 1.6.2 dam flyttade naturligtvis till den övergivna b in byn DV erja 9 m 1.7 Noun More than one category 1.7.1 Jag hade en jacka på mig som det var ett litet håll i... 1.7.2 Hur ska men kunna göra för att förbättra dessa problem? hål SN wg16 10 f man SE wj03 13 f

Error Corpora 295 ERROR CORRECTION CORP SUBJ AGE SEX 1.7.3...och vad men kan göra för att förbättra dom. man SE wj03 13 f 1.7.4 Att utfrysa en kompis eller någon annan kan man SE wj03 13 f vara det värsta men någonsin kan göra tycker jag. 1.7.5 Precis då kom pappa och hans men. män DV haic 11 f 2 ADJECTIVE 2.1 Adjective Adjective 2.1.1 Pappa du har glömt att tända brasan och det är kallt CF erge 9 f kalt. 2.1.2 det är den plikt att få ås att bli dryga trygga CF frma 9 m 2.2 Adjective Noun 2.2.1 när hon var som best bäst CF hais 11 f 2.2.2... men inte en ända människa syntes till. enda CF idja 11 f 2.2.3 det här brevet är det ända jag kan ge dig idag enda DV jowe 9 f 2.2.4 Det är nog inte ett ända barn som inte har något enda SE wj12 13 f problem även fast att man inte har så stora 2.2.5 Det ända jag vet om grov mobbing är det jag enda SE wj13 13 m har sett på tv! 2.2.6... för det var det ända sättet att komma upp till enda SN wg19 10 m en koja 2.2.7 kalle som blev jätte räd rädd CF anhe 11 m 2.2.8 han blev så räd rädd FS frma 9 m 2.2.9 han var lite räd för kråkan rädd FS frma 9 m 2.2.10 jag blev alla fall jätte räd rädd SN wg18 10 m 2.2.11 alla var reda rädda CF frma 9 m 2.2.12 man behöver inte vara tycken bara för man inte vill vara med han. tyken SE wj14 13 m 2.3 Adjective Verb 2.3.1 Och kanske var det ett barn till hans föra groda. förra FS idja 11 f 2.3.2... och spökena blev skända... kända DV erge 9 f 2.3.3 jag tror man ska ta ett lett prov först men... lätt SE wg03 10 f 2.3.4 pojken blev red rädd FS erja 9 m 3 PRONOUN 3.1 Pronoun Pronoun 3.1.1 fortsatte det att ringa i alle fall alla DV idja 11 f 3.1.2 och en massa ungar hade det. de FS alhe 9 f 3.1.3 Han sa till hunden att vara tyst för att det skull de FS caan 9 m titta efter. 3.1.4 Det kom till en övergiven by de DV alco 9 f 3.1.5 Det KaM Till EN övergiven Bi de DV erja 9 m 3.1.6 när det kam hem sade pappa... de DV erja 9 m 3.1.7 när det hade kommit en liten bit sa pappa... de DV erja 9 m 3.1.8 då hörde det att det bubblade... de DV erja 9 m 3.1.9 Det kom till en, plats som de aldrig hade varit, de DV frma 9 m på. 3.1.10 Det kom till en övergiven by de DV jobe 10 m 3.1.11 jag förstår inte vad fröken menar med grammatik det CF angu 9 f näringsväv och allt de andra. 3.1.12 Och sen den dagen de brann i Kamillas lägenhet leker vi alltid brandmän. det CF angu 9 f

296 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 3.1.13 de börjar att skymma det CF frma 9 m 3.1.14 De var han och han hade hittat en partner. det FS caan 9 m 3.1.15... men de kom ingen groda den här gången det FS frma 9 m heller 3.1.16 De va en pojke som hette olof det FS frma 9 m 3.1.17 de va en älg det FS frma 9 m 3.1.18 mormor berättade att de fanns en by bortom det DV alco 9 f solens rike 3.1.19 där de fanns små röda hus med vita knutar det DV alco 9 f 3.1.20 ja men nu är de läggdags sa mormor. det DV alco 9 f 3.1.21 Anna funderade halva natten över de där med det DV alco 9 f morfar 3.1.22 de lät precis som Fjory hennes häst det DV alco 9 f 3.1.23 de såg faktiskt ut som en övergiven by det DV alco 9 f 3.1.24 de var bara ett fönster som lyste det DV alco 9 f 3.1.25 De var en kväll som Lisa jag alltså ville höra en det DV erge 9 f saga... 3.1.26 och dom lovat att bygga upp staden och de blev det DV erge 9 f hotell 3.1.27 de var en by en öde by. det DV frma 9 m 3.1.28 de var tid för familjen att gå hem. det DV frma 9 m 3.1.29 Det var dåligt väder de blåste och regnade. det DV hais 11 f 3.1.30 de blåste mer och mer det DV idja 11 f 3.1.31 Men de var fullt med buskar utanför det DV idja 11 f 3.1.32 Dom gick in genom dörren och blev förvånade det DV mawe 11 f av de dom såg. 3.1.33 de kunde berott på att dom gillade samma tjej. det SE wg07 10 f 3.1.34 När jag får se en son här film tänker jag på att det SE wg20 10 m de nog är så i de flesta skolorna 3.1.35... för de är nog något typiskt med de det SE wg20 10 m 3.1.36... för de är nog något typiskt med de det SE wg20 10 m 3.1.37 de får man nog för man får så mycket att göra det SE wg20 10 m när man blir större 3.1.38 Den är ju inte heller säkert att den kompisen det SE wj17 13 f man kollar på har rätt 3.1.39 De var bara ungdomar inga vuxna. det SE wj18 13 m 3.1.40 De hela började med att jag och min morfar det SN wg10 10 m skulle cykla ner till sjön för... 3.1.41 de verkade lugnt. det SN wg11 10 f 3.1.42 de va en vanlig måndag det SN wg20 10 m 3.1.43... efter som de fanns en hel del snälla kompisar det SN wg20 10 m i min klass så hjälpte dom mig... 3.1.44 När jag kom på fötter igen så hade de kommit det SN wj10 13 m cirka tolv stycken i min klass och hjälpte mig 3.1.45 det är den plikt att få ås att bli dryga din CF frma 9 m 3.1.46 Dem kom med en stegbil och hämtade oss. dom CF jobe 10 m 3.1.47 Nästa dag gick dem upp till en grotta dom DV angu 9 f 3.1.48 där fick dem var sin korg med saker i dom DV angu 9 f 3.1.49 Dem hade ett privatplan dom DV jobe 10 m 3.1.50 nu slår dem upp tältet för att vila... dom DV jobe 10 m 3.1.51 nästa morgon går dem långt långt dom DV jobe 10 m 3.1.52 men till slut kom dem till en övergiven by. dom DV jobe 10 m 3.1.53 där stannade dem och bodde där resten av livet dom DV jobe 10 m

Error Corpora 297 ERROR CORRECTION CORP SUBJ AGE SEX 3.1.54 dem kanske bodde i ett hus som dem fick hyra dom SE wg01 10 f 3.1.55 dem kanske bodde i ett hus som dem fick hyra dom SE wg01 10 f 3.1.56... dem måste få höga betyg annars får de skäll dom SE wg01 10 f av sina föräldrar. 3.1.57 Dem andra människorna som kollade på sina dom SE wg01 10 f kompisars provpapper, 3.1.58... när dem började bråka, dom SN wg01 10 f 3.1.59 dem kunde väl hjälpa varandra. dom SN wg01 10 f 3.1.60 Men dem fortsatte. dom SN wg01 10 f 3.1.61 Men jag fortsatte kämpa för dem två skulle kunna se på varan utan att vända bort huvudet, dom SN wg07 10 f 3.2 Pronoun Noun 3.2.1... för att du är ju alt jag har. allt CF erge 9 f 3.2.2... och alt var en dröm. allt DV caan 9 m 3.2.3 Någon anan la mig på en bår... annan CF erge 9 f 3.2.4 och gick till en anan tunnel annan DV alhe 9 f 3.2.5 Det finns nog en anan väg... annan DV idja 11 f 3.2.6 så jag fik åka med en anan som skulle också annan SN wg20 10 m hänga med 3.2.7 var är set... det DV hais 11 f 3.2.8 var är set här det DV hais 11 f 3.2.9 snabbt springer dam ut ur brand bilarna dom CF erja 9 m 3.2.10 snabbt tar dam fram stegen dom CF erja 9 m 3.2.11 dam ramlar rakt ner i en damm dom FS erja 9 m 3.2.12 då är dam ännu närmare ljudet dom FS erja 9 m 3.2.13 dam bodde i en by dom DV erja 9 m 3.2.14 dam tåg och så med sig sina två tigrar dom DV erja 9 m 3.2.15 när dam hade kommit än bit in i skogen dom DV erja 9 m 3.2.16 å dam två tigrarna följde också med dom DV erja 9 m 3.2.17 dam red bod dom DV erja 9 m 3.2.18 när dam kam hem dom DV erja 9 m 3.2.19 dam flyttade naturligtvis till den övergivna in dom DV erja 9 m 3.2.20 där levde dam lyckliga dom DV erja 9 m 3.2.21 tillslut blev dam två kamelerna så trötta... dom DV erja 9 m 3.2.22 när dam kam hem var kl. 12 dom DV frma 9 m 3.2.23 hon fråga va det var för not nåt CF alhe 9 f 3.2.24 och efter som det inte fans not lock på burken nåt FS alhe 9 f 3.2.25 han har fot syn på not nåt FS frma 9 m 3.2.26 om det skulle hända not nåt DV alhe 9 f 3.2.27 om man såg en älg eller räv och not anat stort nåt DV alhe 9 f djur 3.2.28 en poäng alltid not nåt DV alhe 9 f 3.2.29 ni får gärna bo hos oss under tid en ni inte har nåt DV idja 11 f not att bo i. 3.2.30 det är den plikt att få ås att bli dryga oss CF frma 9 m 3.2.31 och la os på varsin sida av den spikiga toppen oss DV alhe 9 f 3.2.32 och utrusta os oss DV alhe 9 f 3.2.33 sa Desere med en son skarp röst hon alltid sån DV hais 11 f använde. 3.2.34 gick vi upp till utgången av tältet men upptäckte varann DV alhe 9 f varan och vi blev så rädda 3.2.35 Visa i filmen gillade inte varan varann SE wg06 10 f

298 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 3.2.36 det första problemet är att dom kollar på varan varann SE wg18 10 m 3.2.37 för då tittar man inte på varan. varann SE wg18 10 m 3.2.38 Men jag fortsatte kämpa för dem två skulle kunna se på varan utan att vända bort huvudet, varann SN wg07 10 f 3.3 Pronoun Verb 3.3.1 om man såg en älg eller räv och not anat stort annat DV alhe 9 f djur 3.3.2 Vi såg ormar spindlar krokodiler ödlor och annat DV caan 9 m anat. 3.3.3 hanns groda var försvunnen. hans FS alhe 9 f 3.3.4 hanns mamma hade slängt ut den. hans FS alhe 9 f 3.3.5 som nu satt på hanns huvud. hans FS alhe 9 f 3.3.6 för att hanns kruka hade gått sönder hans FS alhe 9 f 3.3.7 kastad ner olof och hanns hund i en dam hans FS frma 9 m 3.3.8 jag fick låna hanns mobiltelefon. hans SN wg14 10 m 3.3.9 han frågade honom nått nåt DV haic 11 f 3.3.10... den killen eller tjejen måste ha nått problem nåt SE wj08 13 f eller... 3.3.11 om det kommer nån ny till klassen eller nått nåt SE wj08 13 f 3.3.12...så hon hamnade inne i skogen på nått konstigt nåt SN wj08 13 f sätt... 3.3.13 När det var två flickor som satt på en bänk så sig SE wg14 10 m kom det en annan flicka som satte säg bredvid 3.3.14 Det var också väldigt roligt för att man kände sig SN wj11 13 f säg inte ensam om det. 3.3.15 man får nog mer sona problem när man kommer högre upp i skolan såna SE wg20 10 m 3.4 Pronoun Preposition 3.4.1 vi bar allt till mamma hos sa... hon DV haic 11 f 3.4.2 sen när in kompis skulle hoppa så... min SN wj08 13 f 3.5 Pronoun Interjection 3.5.1 va fiffigt tänkte ja jag DV alhe 9 f 3.5.2 då börja alla i hela tunneln förutom pappa och jag DV alhe 9 f ja gråta 3.5.3 vilken fin klänning ja har jag DV angu 9 f 3.5.4 Madde vaknade av mitt skrik, hon fråga va det var för nåt. vad CF alhe 9 f 3.6 Pronoun More than one category 3.6.1 Det var än gång än man som hette Gustav en CF erja 9 m 3.6.2 Det var än gång än man som hette Gustav en CF erja 9 m 3.6.3 än dag när Gustav var på jobbet ringde det en CF erja 9 m 3.6.4 han trycker på än knapp en CF erja 9 m 3.6.5 Gustav sitter i än av brand bilarna en CF erja 9 m 3.6.6 där e än en CF erja 9 m 3.6.7 där uppe på än balkong står det ett barn en CF erja 9 m 3.6.8 han hade än groda en FS erja 9 m 3.6.9 män än natt klev grodan upp ur glas burken en FS erja 9 m 3.6.10 det var än gång två pojkar en DV erja 9 m 3.6.11 dam bodde i än bi. en DV erja 9 m

Error Corpora 299 ERROR CORRECTION CORP SUBJ AGE SEX 3.6.12 pappa vi har hittat än övergiven bi. en DV erja 9 m 3.6.13 än dag sa Niklas ska vi rida ut en DV erja 9 m 3.6.14 när dam hade kommit än bit in i skogen en DV erja 9 m 3.6.15 än liten bit in i skogen såg dom än övergiven en DV erja 9 m by 3.6.16 än liten bit in i skogen såg dom än övergiven en DV erja 9 m by 3.6.17 Man ska vara en bra kompis, när någon vill vara en SE wg05 10 m än själv. 3.6.18 jag satt ner men packning min DV haic 11 f 3.6.19 Men var nu då? dörren går inte upp. vad CF idja 11 f 3.6.20 När simon kom ut och såg var som hade hänt... vad FS hais 11 f 3.6.21 Hans hund Taxi var nyfiken på var det var för vad FS idja 11 f något i burken. 3.6.22 Men var är det för ljud? vad FS idja 11 f 3.6.23 var fan gör du vad SE wg07 10 f 3.6.24 Sjävl tycker jag att killarnas metoder är mer vad SE wj13 13 m öppen och ärlig men också mer elak än var tjejernas metoder är. 3.6.25 Hjälp det brinner vad nånstans var CF erja 9 m 3.6.26 undra vad det brann nånstans jag måste i alla var CF erja 9 m fall larma 3.6.27 Jag visste inte att brandbilen vad på väg förbi var CF jowe 9 f min egen by. 3.6.28 Lena sa vad är vi hon såg sig omkring var DV angu 9 f 3.6.29 Visa i filmen gillade inte varan Vissa SE wg06 10 f 3.6.30 dom bråkade och lämnade visa utanför. vissa SE wg06 10 f 4 VERB 4.1 Verb Verb 4.1.1 Upp ur hålet kom en grävling och bett pojken i bet FS angu 9 f näsan 4.1.2 dom som borde på örn kanske försökte koma bodde DV haic 11 f på skepp 4.1.3 När Oliver hade dom i baken så hopade Erik hoppade FS alhe 9 f ner. 4.1.4 Och pojken hopade efter hunden. hoppade FS anhe 11 m 4.1.5 Vi hopade upp på hästarna... hoppade DV idja 11 f 4.1.6...för att hälla henne sällskap. hålla SN wj12 13 f 4.1.7 först försökte hon att lufta mig... lyfta SN wg16 10 f 4.1.8 det log maser av saker runtomkring låg DV haic 11 f 4.1.9 han behöver inte lossas om som ingenting har låtsas SE wj14 13 m hänt, 4.1.10 brand männen rykte ut och släkte elden ryckte CF anhe 11 m 4.1.11 hunden sa på pojkens huvet. satt FS haic 11 f 4.1.12 då surade bina rakt över pojken surrade FS erja 9 m 4.1.13 sett dig hon gjorde som mannen sa sätt DV alco 9 f 4.2 Verb Noun 4.2.1 Och problemet kanske bror på att kompisarna inte tyckte om den personen 4.2.2 Den gamla manen Berättade om en by han Bot i för länge sedan beror SE wg12 10 f bott DV angu 9 f

300 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 4.2.3 Men konstigt nog ville jag se den hästen fastän fanns CF alhe 9 f den inte fans. 4.2.4 Det fans en doktor som pratade vänligt med fanns CF erge 9 f mig, 4.2.5 och efter som det inte fans not lock på burken fanns FS alhe 9 f 4.2.6 Men i hålet fans bara... fanns FS erge 9 f 4.2.7 mormor berättade att de fans en by bortom fanns DV alco 9 f solens rike 4.2.8 därde fans små röda hus med vita knutar där fanns DV alco 9 f Annas morfar hade bott 4.2.9... och där fans ett tempel fult med matt. fanns DV erge 9 f 4.2.10 men efter som de fans en hel del snälla kompisar fanns SN wg20 10 m i min klass 4.2.11 när jag kom ut ur huset sa Kamilla att jag fik fick CF angu 9 f hunden... 4.2.12 Så fik pojken ett grodbarn fick FS caan 9 m 4.2.13 Och vad fik dom se? fick FS erge 9 f 4.2.14 men med lite tjat fik jag fick DV alhe 9 f 4.2.15 och för varje djur fik man 1 eller 3 poäng fick DV alhe 9 f 4.2.16 fik man tio poäng fick DV alhe 9 f 4.2.17 först fik jag panik fick DV alhe 9 f 4.2.18 hon hoppade till när hon fik syn på oss fick DV hais 11 f 4.2.19 Men de var fult med buskar utan för som vi fik fick DV idja 11 f rid igenom. 4.2.20 så jag fik åka med en anan som skulle också fick SN wg20 10 m hänga med 4.2.21 han har fot syn på not fått FS frma 9 m 4.2.22... som dom hade fot tillsammans. fått FS haic 11 f 4.2.23 På morgonen vaknade vi och kläde på oss klädde CF alhe 9 f 4.2.24 Madde sprang upp till sitt rum och kläde på sig klädde CF alhe 9 f 4.2.25 Han kläde på sig klädde FS haic 11 f 4.2.26 Det Kam Till EN övergiven Bi kom DV erja 9 m 4.2.27 när det kam hem sade pappa... kom DV erja 9 m 4.2.28 när Niklas och Bennys halva kam fram till en kom DV erja 9 m damm 4.2.29 upp ur dammen kam två krokodiler kom DV erja 9 m 4.2.30 när dam kam hem kom DV erja 9 m 4.2.31 när dam kam hem var kl. 12 kom DV frma 9 m 4.2.32 då ko min bror kom SN wg18 10 m 4.2.33 När jag kom ut såg jag en liten eld låga koma komma CF alhe 9 f ut genom fönstret, 4.2.34 det tog en timme att koma ditt komma CF anhe 11 m 4.2.35 Pojken som var på väg upp ett träd fick slänga komma FS idja 11 f sig på marken för att inte koma i vägen för bin. 4.2.36 dom som borde på örn kanske försökte koma komma DV haic 11 f på skepp 4.2.37 hans hämnd kund vara som helst kunde CF frma 9 m 4.2.38 på vägen till pappa möte jag en katt mötte DV alhe 9 f 4.2.39 Jag gick in och sate mig vid bordet och åt. satte CF alhe 9 f 4.2.40 Han sate sig upp och lyssnade satte FS alhe 9 f 4.2.41 Hon sate sej på det guldigaste och mjukaste gräset i hela världen. satte DV angu 9 f

Error Corpora 301 ERROR CORRECTION CORP SUBJ AGE SEX 4.2.42 Redan nästa dag sate vi igång med reparationen satte DV idja 11 f av byn. 4.2.43 Då såg jag nåt som jag aldrig har set sett DV caan 9 m 4.2.44 Jag tycker att hon skal prata med dom. skall SE wg02 10 f 4.2.45 brandmännen släkte elden släckte CF frma 9 m 4.2.46 där nere i det höga gräset låg dalmatinen tess, sov DV hais 11 f grisen kalle-knorr... och sav 4.2.47 Ring till Börje sej att vi låst oss ute. säg CF idja 11 f 4.2.48 dam tåg och så med sig sina två tigrar tog DV erja 9 m 4.2.49... att vi åkt ner från berget och åkt så långt att var DV alhe 9 f vi inte viste va vi va. 4.2.50 typ när man pratar om grejer som inte man villa vill SE wj17 13 f att alla ska höra! 4.2.51... att Mia inte viste om att mamma var en visste CF hais 11 f strandskata. 4.2.52 Och utan att pojken viste om det hoppa grodan visste FS caan 9 m ur burken när han låg. 4.2.53 jag viste att han skulle bli lite ledsen då efter visste SN wg06 10 f som vi hade bestämt. 4.2.54 då viste jag inte vad jag skulle göra visste SN wg20 10 m 4.2.55 hon kan ju inte skylla på att hon inte märker nåt för det ärr alltid tydligt. är SE wj13 13 m 4.3 Verb Pronoun 4.3.1 mer han jag inte tänka... hann DV idja 11 f 4.4 Verb Adjective 4.4.1 å älgen bara gode glodde? FS frma 9 m 4.4.2 Niklas och Benny kunde inte hala emot hålla DV erja 9 m 4.4.3 han höll sig i och road ropade? FS frma 9 m 4.4.4 Jag såg på ett TV program där en metod mot sätta SE wj16 13 f mobbing var att satta mobbarn på den stol och andra människor runt den personen och då fråga varför. 4.4.5 Hade Erik vekt en uggla väckt FS alhe 9 f 4.5 Verb Interjection 4.5.1 jag blev jätte besviken för jag trodde att klockan var CF alhe 9 f va sådär 7. 4.5.2 men jag va visst jätte ledsen så jag gick ut. var CF alhe 9 f 4.5.3 Vi kom tillbaks vid 6 tiden, och då va vi jätte var CF alhe 9 f trötta och hungriga. 4.5.4 Klockan va ungefär 12 när jag vaknade, och va var CF alhe 9 f får jag se om inte hästen. 4.5.5 Klockan va ungefär 12 när jag vaknade, och va var CF alhe 9 f får jag se om inte hästen. 4.5.6 jag sa att det inte va nåt så somna vi om. var CF alhe 9 f 4.5.7 alla va överens var CF frma 9 m 4.5.8 De va en pojke som hette olof var FS frma 9 m 4.5.9 de va en älg var FS frma 9 m 4.5.10 Nu va det bara att hoppa ut från fönstret. var FS haic 11 f 4.5.11... att vi åkt ner från berget och åkt så långt att var DV alhe 9 f vi inte viste va vi va. 4.5.12 pappa och jag undra va nycklarna va var DV alhe 9 f

302 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 4.5.13 Det börjar med att pappa och jag va ute och var DV alhe 9 f cyklade på landet... 4.5.14... att vi inte va på toppen av berget utan i en by var DV alhe 9 f 4.5.15 han va för tung var DV alhe 9 f 4.5.16 vi va i en jätte liten och fin by var DV alhe 9 f 4.5.17 nej det va en blåmes var DV alhe 9 f 4.5.18 Sen sa pappa att vi va tvungna att leta. var DV alhe 9 f 4.5.19 om dom va öppna var DV alhe 9 f 4.5.20 När jag kom dit va redan pappa där var DV alhe 9 f 4.5.21 en port som va helt glittrig var DV alhe 9 f 4.5.22 en katt som va svart och len var DV alhe 9 f 4.5.23 en platta som nästan va omringad av lava var DV alhe 9 f 4.5.24 där va en massa människor som va fastkedjade var DV alhe 9 f med tjocka kedjor 4.5.25 där va en massa människor som va fastkedjade var DV alhe 9 f med tjocka kedjor 4.5.26 den äldsta som va 80 år berätta att... var DV alhe 9 f 4.5.27 den byn vi va i var DV alhe 9 f 4.5.28 det va deras by var DV alhe 9 f 4.5.29 det va den hemske fula trollkarlen tokig var DV alhe 9 f 4.5.30 som tur va gick hästarna i hagen. var DV idja 11 f 4.5.31... då vill ju han vara med den kompisen som var SE wg12 10 f han va med innan. 4.5.32... men eftersom det inte va så mycket mobbing var SE wj13 13 m så... 4.5.33 Det var i somras när jag, min syster och två andra var SN wj06 13 f kompisar va på vårat vanliga ställe... 4.5.34 Vi va kanske inte så bra på det utan vi ramlade var SN wj07 13 f ganska ofta. 4.5.35 det kunde ju va att en sjusovare bor där inne vara DV alhe 9 f 4.5.36... utan det kan även vara att nån kan sparka vara SE wj08 13 f eller att man få vara enstöring och sitta själv hela tiden eller kanske spotta eller bara kanske va taskiga mot den personen 4.5.37... att försöka va tuff hela tiden (eller?) vara SE wj08 13 f 4.5.38 det kan ju va att den som blir mobbad inte vara SE wj13 13 m uppför sig på rätt sätt, 4.5.39 dom vill inte va kompis med hon/han. vara SE wj19 13 m 4.5.40 Då måste man fråga dom som inte vill va vara SE wj19 13 m kompis med en vad man gör får fel... 4.5.41 Och om kompisarna tycker att man är ful och vara SE wj19 13 m inte vill va med en som är ful så... 4.5.42 Marianne sa fort farande hur jag kunde va med henne vara SN wg07 10 f 4.6 Verb More than one category 4.6.1 så kommer det att vara svårare att skaffa jobb gått SE wg03 10 f om dom inte har gott i skolan 4.6.2 han fick hetta Hubert. heta FS haic 11 f 4.6.3 Men pojken är inte så glad för nu måste han hitta FS haic 11 f hetta en ny glasburk. 4.6.4 Men sen så dom att det var små grodor. såg FS idja 11 f 4.6.5...vi hade precis gått förbi skolan när vi så ett gäng på ca tio personer komma emot oss. såg SN wj15 13 m

Error Corpora 303 ERROR CORRECTION CORP SUBJ AGE SEX 4.6.6 Hela majs fältet vad svart var CF jowe 9 f 4.6.7 Oliver bodde i en liten stuga en liten bit i från var FS jowe 9 f skogen och vad väldigt intresserad av djur. 4.6.8 Hans älsklings färg vad grön var FS jowe 9 f 4.6.9 För han vad mycket trött. var FS jowe 9 f 4.6.10 till slut vad han uppe på stocken med stort var FS jowe 9 f besvär. 4.6.11 när jag senare vad klar kom grannen och var DV jowe 9 f skrek... 4.6.12 För att komma till Strömstad vad de tvungna var DV klma 10 f att åka från Göteborg... och sedan Strömstad. 4.6.13 Det var en ganska dålig lärare som inte märkte hans fusklapp han hade i pennfacket eller vad det vad. var SE wj07 13 f 5 PARTICIPLE 5.1 Participle Participle 5.1.1 Erik sprang i väg medan Oliver välte ner det surande bot. surrande FS alhe 9 f 6 ADVERB 6.1 Adverb Noun 6.1.1 snabbt hoppa dom på kamelerna och rusa iväg bort DV erja 9 m och red bod till pappa 6.1.2 dam red bod bort DV erja 9 m 6.1.3 ingen sov got den natten gott CF frma 9 m 6.1.4 Oliver hjälpte till så got han kunde. gott FS alhe 9 f 6.1.5 att säga ifrån och förklara ur den utsatta skall hur SE wj13 13 m uppföra sig. 6.1.6 När de gick ifrån tjejen som kom så var det väll väl SE wg08 10 f för att hon inte hjälpte dem med provet 6.1.7...men sen måste dom väll få skuld känslor. väl SE wj04 13 m 6.1.8 så kan man väll fortfarande vara kompis med väl SE wj07 13 f han hon. 6.1.9 det gick väll ganska bra. väl SN wj08 13 f 6.1.10 jag får väll ta av min snowboard. väl SN wj08 13 f 6.2 Adverb Adjective 6.2.1... och där fans ett tempel fult med matt. fullt DV erge 9 f 6.2.2 Men de var fult med buskar utan för som vi fik fullt DV idja 11 f rid igenom. 6.2.3 men det är ju mycket coolare att säga nej tack inte SE wj12 13 f jag röker inte en att säga ja jag är väl inre feg. 6.2.4 ny vänta nu kommer hon nu DV hais 11 f 6.2.5 ny öppna inte garderoben nu DV hais 11 f 6.2.6 Det var rät blåsigt. rätt CF idja 11 f 6.2.7... men jag va vist jätte ledsen såjag gick ut. visst CF alhe 9 f 6.2.8 det började vist brinna visst CF jobe 10 m 6.2.9 dom hade vist ungar och där var hans groda. visst FS erge 9 f 6.2.10 då får vi Natta över i byn vist. visst DV haic 11 f 6.2.11 Och så landade du vist i en möglig ko skit också. visst DV idja 11 f

304 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 6.3 Adverb Pronoun 6.3.1 det tog en timme att koma ditt dit CF anhe 11 m 6.3.2 Men vart dom en letade hittade dom ingen än FS anhe 11 m groda. 6.3.3 men hur han en lockade så kom den inte. än FS erge 9 f 6.3.4 Det beror på att den andra har jobbat bättre en än SE wg03 10 f den andra den som kollade på honom. 6.3.5 men det kan ju vara andra saker en bara skolan? än SE wg03 10 f 6.3.6 men det är ju mycket coolare att säga nej tack jag röker inte en att säga ja jag är väl inre feg. än SE wj12 13 f 6.4 Adverb Verb 6.4.1 förts att vi inte sögs med tromben först DV idja 11 f 6.4.2 som jag förts trodde först SN wj16 13 f 6.4.3 så har gick det till: här DV hais 11 f 6.4.4 är ett sånt problem uppstår försöker man klart hjälpa till. När SE wg07 10 f 6.5 Adverb Interjection 6.5.1... att vi åkt ner från berget och åkt så långt att var DV alhe 9 f vi inte viste va vi va. 6.5.2 pappa och jag undra va nycklarna va var DV alhe 9 f 6.5.3 sen undra han va dom bodde var DV alhe 9 f 6.6 Adverb More than one category 6.6.1 Hunden hade skäll t så mycket att geting boet hade ramlat när. 7 PREPOSITION 7.1 Preposition Verb 7.1.1 Min kompis tänkte hämta hjälp så han hängde sig i viadukten och hoppa ber sprang till närmaste huset och sa att det var en som hade trillat ner och att han skulle ringa ambulansen. ner FS caan 9 m ner SN wj05 13 m 7.2 Preposition More than one category 7.2.1 kan vi inte gå nu sa Filippa men darrig röst med DV hais 11 f 7.2.2 Man beslöt att börja men marknaderna igen. med DV mawe 11 f 8 CONJUNCTION 8.1 Conjunction Noun 8.1.1 pojken fick nästan inte resa på sig fören en förrän FS haic 11 f uggla kom. 8.1.2 Pojken hinner knappt resa sig upp fören en förrän FS idja 11 f uggla kommer flygande mot honom. 8.1.3 fören pappa kom in rusande i mitt rum. förrän DV idja 11 f 8.1.4 inte fören när jag skulle gå ner märkte jag att förrän SN wg16 10 f jag hade fastnat, 8.1.5 män än natt klev grodan upp ur glas burken men FS erja 9 m 8.1.6 män plötsligt hoppade hunden ut ur fönstret men FS erja 9 m 8.1.7 män då hoppade pojken efter men FS erja 9 m 8.1.8 gick vi upp till utgången av tältet mer upptäckte men DV alhe 9 f varan och vi blev så rädda 8.1.9 män han hade skrikit så... men/medan FS frma 9 m

Error Corpora 305 ERROR CORRECTION CORP SUBJ AGE SEX 8.1.10... å ställde cyklarna på den utskurna plattan. och CF alhe 9 f 8.1.11 Vi bor i samma hus jag och Kamilla å hennes och CF angu 9 f hund. 8.1.12 Så vi fick vänta tills pappa kom hem å då skulle och CF hais 11 f jag visa pappa mamma 8.1.13 å älgen bara gode och FS frma 9 m 8.1.14 å dam två tigrarna följde också med och DV erja 9 m 8.2 Conjunction More than one category 8.2.1 Då måste man fråga dom som inte vill va för SE wj19 13 m kompis med en vad man gör får fel... 8.2.2 då skulle vi samlas 11.30 får bussen gick lite för SN wg20 10 m senare 8.2.3 vi har så mycket saker så vi kan ha i byn som??? DV haic 11 f 9 INTERJECTION 9.1 Interjection Adjective 9.1.1 när vi kom in till mig så stod mamma och pappa i dörren och sa gratis till mig när jag kom. grattis CF alhe 9 f 10 OTHER 10.1.1 där e huset som brinner är CF erja 9 m 10.1.2 nu e nog alla människor ute är CF erja 9 m 10.1.3 där e än är CF erja 9 m 10.1.4 då e dam ännu närmare ljudet är CF erja 9 m 10.1.5 Att bli mobbad e nog det värsta som finns, är SE wj08 13 f 10.1.6 Han slog då till mig över kinden så att jag fick ett R. ärr SN wg15 10 m

306 Appendix B. B.3 Segmentation Errors Errors are categorized by part-of-speech. ERROR CORRECTION CORP SUBJ AGE SEX 1 NOUN 1.1.1 VI VAR PÅ BORÅS BAD HUS badhus SN wg13 10 m 1.1.2... har hunden fått syn på en bi kupa. bikupa FS klma 10 f 1.1.3 Han hoppar upp på bi kupan bikupan FS klma 10 f 1.1.4... så att bi kupan börjar att skaka bikupan FS klma 10 f 1.1.5 bi kupan ramlar ner till marken! bikupan FS klma 10 f 1.1.6 då kom det en bi svärm surrande förbi bisvärm FS alca 11 f 1.1.7 tillslut välte han ner hela kupan och en hel bi bisvärm FS hais 11 f svärm surrade ut. 1.1.8 Efter 5 minuter körde en brand bil in på brandbil CF idja 11 f gården. 1.1.9 Då vi kom till min by. Trillade jag av brand brandbilen CF jowe 9 f bilen 1.1.10 Men grannen intill ringde brand kåren. brandkåren CF jobe 10 m 1.1.11 när brand kåren kom hade hela vår ranch brandkåren DV idja 11 f brunnit ner till grunden. 1.1.12 brand larmet går brandlarmet CF erja 9 m 1.1.13 Just när han hörde smällen gick brand larmet brandlarmet CF klma 10 f på riktigt! 1.1.14 Han rusade ut till brandmännen som inte hade brandlarmet CF klma 10 f hört smällen och brand larmet. 1.1.15 Han jobbade som brand man brandman CF erja 9 m 1.1.16 En brand man klättrade upp till oss. brandman CF idja 11 f 1.1.17 om det fanns någon ledig brand man brandman CF idja 11 f 1.1.18 jag håller på och utbildar mig till brand man brandman CF idja 11 f 1.1.19 Petter sa att han tänkte bli Brand man när han brandman CF idja 11 f blir stor. 1.1.20 En brand man berättade att... brandman CF jowe 9 f 1.1.21 BRAND MANEN brandmannen CF erja 9 m 1.1.22 det här var en bra träning för mig sa brand brandmannen CF idja 11 f manen 1.1.23 brand menen ryckte ut och släckte elden. brandmännen CF anhe 11 m 1.1.24 jag ringde till brand stationen brandstationen CF idja 11 f 1.1.25 Och i morgon är det brand övning brandövning CF klma 10 f 1.1.26 där brand övningen skulle hålla till. brandövningen CF klma 10 f 1.1.27 vi skulle börja göra i ordning den lilla byn som byhus DV hais 11 f bestod av 8 hus 6 affärer och ett by hus 1.1.28 Desere jobbade i en djur affär djuraffär DV hais 11 f 1.1.29 men se där är ni ju det lilla följet bestående av djuraffären DV hais 11 f snutna djur från djur affären. 1.1.30 när det lilla djur följet gått i fyra timmar djurföljet DV hais 11 f 1.1.31 Efter några sekunder stod såfus med tungan dörröppningen FS hais 11 f halvvägs hängande ut i mun i dörr öppningen. 1.1.32 hon lurade i min pojkvän massa elak heter om elakheter SN wg07 10 f Linnea. 1.1.33 han hade ett 4 mannatält I sin fik kniv. fickkniv DV alhe 9 f 1.1.34 Då sprang dom fort till tunneln och fort till skidbacken och Fort till flyg platsen flygplatsen DV erha 10 m

Error Corpora 307 ERROR CORRECTION CORP SUBJ AGE SEX 1.1.35 Jag hör fot steg från trappan fotsteg CF alhe 9 f 1.1.36 frukost klockan ringde frukostklockan DV hais 11 f 1.1.37 jag går ner och ringer i frukost klockan frukostklockan DV hais 11 f 1.1.38 genom att han tappat en jord fläck på fönster fönsterkarmen FS hais 11 f karmen. 1.1.39 Ronja hittade en förbands låda förbandslåda DV mawe 11 f 1.1.40 Men lars fick försäkrings pengarna försäkringspengarna CF erha 10 m 1.1.41 Hunden hoppar vid ett geting bo. getingbo FS erha 10 m 1.1.42 Geting boet trillar ner på marken. getingboet FS erha 10 m 1.1.43 Geting boet går sönder. getingboet FS erha 10 m 1.1.44 det var en gips skena som... gipsskena SN wj05 13 m 1.1.45 Nu hade han den i en ganska stor glas burk, på glasburk FS alca 11 f sitt rum. 1.1.46 så han tog med sig grodan hem i en glas burk. glasburk FS alhe 9 f 1.1.47 grodan klev upp ur glas burken. glasburken FS alca 11 f 1.1.48 hunden stack in huvudet i glas burken glasburken FS alca 11 f 1.1.49 Glas burken som hunden hade på huvudet gick glasburken FS alca 11 f i tusen bitar 1.1.50 Oliver innerligt försökte få av sig den glas glasburken FS alhe 9 f burken som... 1.1.51 Hunden hade fastnat i glas burken och ramlade glasburken FS caan 9 m ner. 1.1.52 Pojken och hunden sitter och kollar på grodan i glasburken FS erha 10 m glas burken. 1.1.53 När pojken och hunden har somnat kryper glasburken FS erha 10 m grodan ut ur glas burken. 1.1.54 Glas burken går sönder. glasburken FS erha 10 m 1.1.55 såfus hade letat i glas burken glasburken FS hais 11 f 1.1.56 han fick ha på sig glas burken över huvudet. glasburken FS hais 11 f 1.1.57 såfus landade med huvudet före och hela glas glasburken FS hais 11 f burken sprack. 1.1.58... så gick glas burken sönder. glasburken FS klma 10 f 1.1.59 dom plockade många kran kvistar och la som grankvistar DV hais 11 f täcke 1.1.60 här är också en grav sten från 1989. gravsten DV hais 11 f 1.1.61 jag satte upp grav stenar efter dom gravstenar DV hais 11 f 1.1.62 dan efter grävde vi upp deras grav stenar gravstenar DV hais 11 f 1.1.63 hit ut går det ju bara en grus väg grusväg DV idja 11 f 1.1.64 Hästarna saktade av när dom kom ut på en grus grusväg DV idja 11 f väg. 1.1.65 vi fortsatte på den lilla grus vägen. grusvägen DV idja 11 f 1.1.66 grus vägen ledde fram till en övergiven by. grusvägen DV idja 11 f 1.1.67 Vi följde grus vägen grusvägen DV idja 11 f 1.1.68 Vi red i genom det stora hålet och kom in på grusvägen DV idja 11 f grus vägen 1.1.69 vart tionde år måste han ha 5 guld klimpar guldklimpar DV angu 9 f 1.1.70 en hund på 14 hund år hundår DV hais 11 f 1.1.71 trampat på igel kott igelkott DV hais 11 f 1.1.72 En dag hade vi en informations dag om mobbing informationsdag SE wj16 13 f 1.1.73 Då kom det upp en jord ekorre jordekorre FS alca 11 f 1.1.74 han tittade i ett jord hål. jordhål FS alhe 9 f

308 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 1.1.75 det är ju jul afton om 3 dagar julafton CF erge 9 f 1.1.76 Innan jul skulle våran klass ha jul fest. julfest SN wg02 10 f 1.1.77 sen var det problem på klass fotot klassfotot SE wg18 10 m 1.1.78 man vill ju vara fin på klass fotot klassfotot SE wg18 10 m 1.1.79 På t ex klass fotot klassfotot SE wg19 10 m 1.1.80 MIN KLASS KAMRAT VILLE INTE klasskamrat SN wg13 10 m HOPPA FRÅN HOPPTORNET 1.1.81 snabbt tog han på sig klä där kläder FS erja 9 m 1.1.82 Och så landade du visst i en möglig ko skit koskit DV idja 11 f också 1.1.83 men det finns i alla fall ingen tur med en möglig koskit DV idja 11 f ko skit. 1.1.84 De hade med sig : ett spritkök, ett tält, och kulgevär DV jobe 10 m Massa Mat, några kul gevär, och ammunition M.M. 1.1.85 När kvälls daggen kom var vi helt klara kvällsdaggen DV hais 11 f 1.1.86 Kvälls daggen hade fallit kvällsdaggen DV mawe 11 f 1.1.87 det brann på Macintosh vägen 738c Macintoshvägen CF anhe 11 m 1.1.88 Att få status är kanske det maffia ledarna maffialedarna SE wj20 13 m håller på med. 1.1.89 Hela majs fältet var svart majsfältet CF jowe 9 f 1.1.90 Vid mat bordet var det en livlig stämma matbordet DV idja 11 f 1.1.91 dom kom in till oss med 2 stora mat kassar. matkassar CF alhe 9 f 1.1.92 det var när jag gick i mellan stadiet mellanstadiet SN wj14 13 m 1.1.93 Jag satt vid middags bordet tillsammans med middagsbordet CF mawe 11 f mamma och min lillebror Simon. 1.1.94 där stannade dem och bodde där resten av livet mobiltelefonen DV jobe 10 m för mobil telefonen räckte inte enda hem. 1.1.95 alla djur rusade ut ur affären upp på mölndals Mölndalsvägen DV hais 11 f vägen 1.1.96 Han hade fångat en groda när han var i parken näckrosdammen FS alca 11 f vid den stora näckros dammen. 1.1.97 skuggorna föll förundrat på det vita parkett parkettgolvet FS hais 11 f golvet. 1.1.98 En vecka senare så var det en polis patrull som polispatrull DV alca 11 f letade efter skol klassen 1.1.99 och precis när en av dem skulle slå till mig så polissirener SN wj15 13 m hörde jag polis sirener 1.1.100 Man hämtar då en rast vakt. rastvakt SE wg07 10 f 1.1.101 följer du med på en rid tur ridtur DV idja 11 f 1.1.102 här står det August rosen gren har lämnat Rosengren DV hais 11 f jorden 1.1.103 jag hade fått en sjuk dom sjukdom CF erge 9 f 1.1.104 helt plötsligt var jag på sjuk huset. sjukhuset CF erge 9 f 1.1.105... förrän jag vaknade i en sjukhus säng. sjukhussäng CF mawe 11 f 1.1.106 jag tog mina saker ner i en sken påse skenpåse DV haic 11 f 1.1.107 dom bär massor av sken smycken skensmycken DV haic 11 f 1.1.108 Pappa det var du som la den i skrivbords lådan skrivbordslådan CF erge 9 f 1.1.109...men sen måste dom väl få skuld känslor. skuldkänslor SE wj04 13 m 1.1.110 därför är lärarens skyldig het att se till att eleven skyldighet SE wj19 13 m får hjälp. 1.1.111 Sedan var det ett sov rum med 4 bäddar. sovrum DV mawe 11 f 1.1.112 Dem kom med en steg bil och hämtade oss. stegbil CF jobe 10 m

Error Corpora 309 ERROR CORRECTION CORP SUBJ AGE SEX 1.1.113 det var ett stort sten hus stenhus DV erha 10 m 1.1.114 Kalle-knorr hade hittat ett stort sten kors stenkors DV hais 11 f 1.1.115 där står ett gult hus med stock rosor slingrande stockrosor DV hais 11 f efter väggarna 1.1.116 allt från att förstå en telefon apparat till att telefonapparat SE wj20 13 m förstå en människa. 1.1.117 när de var hemma så tittade de i telefon katalogen telefonkatalogen CF alca 11 f 1.1.118 ni får gärna bo hos oss under tid en ni inte har tiden DV idja 11 f nåt att bo i. 1.1.119 så kom brandbilen och räddade mamma ut toalettfönstret CF hais 11 f genom toalett fönstret. 1.1.120 där bakom några grenar låg någonting ett trähus DV hais 11 f trä hus 1.1.121 Ett vardags rum med 2 soffor 1 bord och en vardagsrum DV mawe 11 f stor öppenspis 1.1.122 Johan gick in i vardags rummet och satte upp vardagsrummet CF alca 11 f elementet. 1.1.123 hela vardags rummet stod i brand vardagsrummet CF alca 11 f 1.1.124 hans älsklings djur var groda. älsklingsdjur FS jowe 9 f 1.1.125 Hans älsklings färg vad grön älsklingsfärg FS jowe 9 f 1.1.126 Och det är nog en överlevnads instinkt. överlevnadsinstinkt SE wj20 13 m 2 ADJECTIVE/PARTICIPLE 2.1.1 Fast pappa hade utrustat alla hus brand säkra. brandsäkra DV idja 11 f 2.1.2 där va massa människor som va fast kedjade fastkedjade DV alhe 9 f med tjocka kedjor 2.1.3 Människorna hade haft färg glada dräkter på färgglada DV mawe 11 f sig 2.1.4 Tanja sydde glatt färgade kläder åt allihop glattfärgade DV mawe 11 f 2.1.5 Fönstret stod halv öppet halvöppet FS hais 11 f 2.1.6 där han låg hjälp lös på marken. hjälplös FS hais 11 f 2.1.7 Cristoffer hoppade ner och var jätte arg för att jättearg FS alca 11 f burken gick sönder. 2.1.8 Cristoffer lyfte upp hunden och var fortfarande jättearg FS alca 11 f jätte arg men... 2.1.9 Ett par horn på en hjort som blev jätte arg. jättearg FS erge 9 f 2.1.10 Bina som var inne i boet blev jätte arga och jättearga FS alca 11 f surrade upp ur boet. 2.1.11 så kanske de blir jätte bra kompisar. jättebra SE wg16 10 f 2.1.12 och tänk om den som man skrev av hade skrivit jättebra SE wg17 10 f en jätte bra dikt 2.1.13 Det var inte så jätte djupt på den delen av jättedjupt FS alca 11 f floden som Cristoffer och hunden föll i på. 2.1.14 dom bott i en jätte fin by jättefin DV alhe 9 f 2.1.15 Sen hjälpte vi dom att göra om byn till en jätte jättefin DV alhe 9 f fin by 2.1.16 Mamma och pappa tyckte det var en jätte fin jättefin DV idja 11 f by 2.1.17 Jag hade ett jätte fint rum. jättefint DV idja 11 f 2.1.18 då blev jag jätte glad jätteglad SN wg18 10 f 2.1.19 Då blev dom jätte glada. jätteglada DV alhe 9 f 2.1.20 där man kan äta jätte god picknick jättegod DV alhe 9 f

310 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 2.1.21 det var helt lila och såg jätte hemskt ut, jättehemskt SN wj03 13 f 2.1.22 pappa och jag tänkte att vi skulle cykla upp på jättehöga DV alhe 9 f det jätte höga berget för att titta på ut sikten. 2.1.23 pappa gick ut och såg att vi va I en jätte liten jätteliten DV alhe 9 f och fin by, 2.1.24 Den andra frågan är jätte lätt jättelätt SE wj03 13 f 2.1.25 vi mulade och kastade jätte många snöbollar jättemånga SN wj10 13 m på dom 2.1.26 tuni hade jätte ont i knät jätteont SN wj03 13 f 2.1.27 Nästa dag när Oliver vaknade blev han jätte jätterädd FS jowe 9 f rädd för han såg inte grodan i glasburken. 2.1.28 Då blev Oliver jätte rädd. jätterädd FS jowe 9 f 2.1.29 jag blev jätte rädd jätterädd SN wj03 13 f 2.1.30 både muffins och Oliver blev jätte rädda. jätterädda FS jowe 9 f 2.1.31 Det blev jätte struligt med allt möjligt inblandat. jättestruligt SN wg11 10 f 2.1.32 han sade till muffins att vara jätte tyst. jättetyst FS jowe 9 f 2.1.33 man ser att det är nåt jätte viktigt hon ville jätteviktigt CF alhe 9 f berätta. 2.1.34 Med en gång blev jag klar vaken klarvaken DV idja 11 f 2.1.35 en platta som nästan va om ringad av lava. omringad DV alhe 9 f 2.1.36 vi slog upp tältet på den spik spetsiga toppen spikspetsiga DV alhe 9 f 2.1.37 det var en varm och stjärn klar natt. stjärnklar DV hais 11 f 2.1.38 En gång blev den hemska pyroman ut kastad utkastad CF frma 9 m ur stan. 2.1.39 Om man blir ut satt för något... utsatt SE wj19 13 m 2.1.40 i vart enda hus var alla saker kvar från 1600 vartenda DV hais 11 f talet 2.1.41 då bar det av i 14 dagar och 14 äventyrs fyllda äventyrsfyllda DV hais 11 f nätter 2.1.42 då kom dom till en över given by övergiven DV erge 9 f 2.1.43 de kom till en över given by övergiven DV erha 10 m 2.1.44 de kom till en över given by övergiven DV hais 11 f 2.1.45 Det var en över given by. övergiven DV hais 11 f 2.1.46 då för stod vi att det var en över given by övergiven DV hais 11 f 2.1.47 till slut kom dem till en över given By. övergiven DV jobe 10 m 2.1.48 vi passerade många över vuxna hus övervuxna DV hais 11 f 2.1.49 Oliver fick se ett geting bo och blev hel galen. helgalen FS alhe 9 f 3 PRONOUN 3.1.1 hon hade bara drömt allt ihop. alltihop DV angu 9 f 3.1.2 simon låg på sin kudde och hade inte märkt någonting FS hais 11 f någon ting. 3.1.3 Nu ska jag visa er någon ting någonting DV hais 11 f 3.1.4 Dom flesta var duktiga på någon ting någonting DV mawe 11 f 3.1.5 för då kan man inte något ting någonting SE wg03 10 f 4 VERB 4.1.1 när jag dog 1978 i cancer återvände jag hit för fortsätta DV alco 9 f att fort sätta mitt liv här 4.1.2 Jag tror att killen inte kan för bättra sig själv... förbättra SE wj03 13 f 4.1.3 då för stod vi att det var en över given by förstod DV hais 11 f 4.1.4 medan jag för sökte lyfta upp mig skälv försökte SN wg16 10 f

Error Corpora 311 ERROR CORRECTION CORP SUBJ AGE SEX 4.1.5 ni för tjänar verkligen mina hem kokta kladdkakor förtjänar DV hais 11 f 4.1.6 a Tess min fina gamla hund du på minner mig påminner DV hais 11 f om någon jag har träffat förut 4.1.7 Han ring de till mig sen och sa samma sak. ringde SN wg07 10 f 4.1.8 Hon under sökte noga hans fot. undersökte DV mawe 11 f 5 ADVERB 5.1.1 Där efter dog mamma på sjukhuset. därefter CF hais 11 f 5.1.2 men han tog sig snabbt där i från. därifrån FS hais 11 f 5.1.3 när man bara går där ifrån därifrån SE wj19 13 m 5.1.4 SEN GICK VI DÄR IFRÅN därifrån SN wg13 10 m 5.1.5 Jag ställde mig på en sten och efter ett tag så därifrån SN wj01 13 f ville jag gå där ifrån, 5.1.6 så till slut så sprang dom där ifrån därifrån SN wj10 13 m 5.1.7 Bina som bodde i bot rusade i mot Oliver emot FS alhe 9 f 5.1.8 han råkade bara kom i mot getingboet. emot FS haic 11 f 5.1.9 Marianne sa fort farande hur jag kunde va med fortfarande SN wg07 10 f henne 5.1.10 Alla såg fram emot att åka framemot SN wj09 13 m 5.1.11 Då kom hunden för bi med getingar förbi FS caan 9 m 5.1.12 människor som går för bi kan höra oss. förbi DV hais 11 f 5.1.13 Eller när man går för bi varandra förbi SE wg07 10 f 5.1.14 vi hade aldrig fått smaka plättar sylt och kola förut DV hais 11 f för ut 5.1.15 Inte konstigt att vi inte har upptäckt den här förut DV idja 11 f ingången för ut 5.1.16 jag som alltid tyckt det var så högt här i från. härifrån CF idja 11 f 5.1.17 stick här i från annars är du dödens härifrån DV angu 9 f 5.1.18 I bland kan allt vara jobbigt och hemskt ibland SE wj02 13 f 5.1.19 Men i bland kan det vara så att dom tror att ibland SE wj09 13 m dom är coola 5.1.20 jag var tvungen att berätta hela historien om i igen CF hais 11 f gen. 5.1.21 vad var det han hete nu i gen? igen CF hais 11 f 5.1.22 jag vill bli kompis med henne i gen igen SN wg03 10 f 5.1.23 och så ville Johanna bli kompis i gen. igen SN wg03 10 f 5.1.24 Pojken och hunden söker i genom rummet. igenom FS erha 10 m 5.1.25 morfar och dom andra letar och letar i genom igenom DV erge 9 f staden 5.1.26 Vi red i genom det stora hålet igenom DV idja 11 f 5.1.27 Vi red i genom byn igenom DV idja 11 f 5.1.28 när Gunnar öppna dörren till det stora huset rasa ihop DV erha 10 m det i hop 5.1.29 snart rasa hela byn i hop ihop DV erha 10 m 5.1.30 snabbt samla han i hop alla sina jägare ihop DV erja 9 m 5.1.31 Rådjuret sprang i väg med honom. iväg FS angu09 9 f 5.1.32 Han sprang i vägg och klättrade upp på en iväg FS anhe 11 m kulle. 5.1.33 Lena såg en gammal man sitta i ett tält av guld också DV angu 9 f intill sov säckarna som och så var av guld. 5.1.34 dam tåg och så med sig sina två tigrar också DV erja 9 m 5.1.35 undulater flög om kring omkring DV hais 11 f

312 Appendix B. ERROR CORRECTION CORP SUBJ AGE SEX 5.1.36 när de såg sig om kring omkring DV jowe 9 f 5.1.37 han trillar om kull. omkull FS klma 10 f 5.1.38 Han ropade igenom fönstret men inget kvack tillbaka FS caan 9 m kom till baka. 5.1.39 vi gick till baka igen tillbaka DV alhe 9 f 5.1.40 svarta manen sprang sin väg och kom aldrig tillbaka DV angu 9 f mer till baka. 5.1.41 Efter det gick vi till baka tillbaka DV idja 11 f 5.1.42... ska man lämna till baka den. tillbaka SE wg17 10 f 5.1.43 Sedan slumrade såfus, grodan och simon djupt tillsammans FS hais 11 f till sammans. 5.1.44 Men de var fult med buskar utan för som vi utanför DV idja 11 f fick rid igenom. 5.1.45 en kille blev utan för, utanför SE wj11 13 f 5.1.46 men olof var glad en då ändå FS frma 9 m 5.1.47 men om man inte får vara med än då ändå SE wj14 13 m 5.1.48 Erik letade över allt överallt FS alhe 9 f 5.1.49 Han letade över allt i sitt rum överallt FS jobe 10 m 5.1.50 Han letade under sängen under pallen i tofflorna överallt FS jowe 9 f bland kläderna ja över allt 5.1.51 Han letade över allt överallt FS mawe 11 f 5.1.52 Desere letade över allt överallt DV hais 11 f 5.1.53 jag har letat över allt överallt DV hais 11 f 6 PREPOSITION 6.1.1 fram för mig stod världens finaste häst. framför CF alhe 9 f 6.1.2 Vi gick längs vägen tills vi såg ett stort hus som låg en bit utan för själva stan utanför DV idja 11 f 7 CONJUNCTION 7.1.1 Efter som han frös och inte såg sig för snubblade han på en sten. 7.1.2... och efter som det inte fanns nåt lock på burken... 7.1.3 men jag kunde inte säga det till honom för att jag visste att han skulle bli lite ledsen då efter som vi hade bestämt. eftersom DV mawe 11 f eftersom FS alhe 9 f eftersom SN wg06 10 f 8 RUN-ONS 8.1.1 Nathalie berättade alltför mig allt för SN wg11 10 f 8.1.2 därbakom fanns 2 grodor. där bakom FS jowe 9 f 8.1.3 och tillslut stod vi alla på marken till slut CF idja 11 f 8.1.4 tillslut välte han ner hela kupan till slut FS hais 11 f 8.1.5 tillslut kom de fram till en gärdsgård till slut DV alca 11 f 8.1.6 men tillslut tyckte de också att... till slut DV alca 11 f 8.1.7 tillslut blev dam två kamelerna så trötta... till slut DV erja 9 m 8.1.8 tillslut kom de fram till en vacker plats till slut DV hila 10 f 8.1.9 tillslut sa pappa till slut DV idja 11 f 8.1.10 Tillslut kom dom upp mot sidan av oss och sa, till slut SN wj04 13 m 8.1.11 Tillslut kom det en massa vuxna som... till slut SN wj04 13 m 8.1.12 Vi åkte tillslut på bio. till slut SN wj04 13 m 8.1.13 mobbing råkar väldigt många utför. ut för SE wj05 13 m

Appendix C SUC Tagset The set of tags used was taken from the Stockholm Umeå Corpus (SUC): Code AB DL DT HA HD HP HS IE IN JJ KN NN PC PL PM PN PP PS RG RO SN O VB Code Category Adverb Delimiter (Punctuation) Determiner Interrogative/Relative Adverb Interrogative/Relative Determiner Interrogative/Relative Pronoun Interrogative/Relative Possessive Infinitive Marker Interjection Adjective Conjunction Noun Participle Particle Proper Noun Pronoun Preposition Possessive Cardinal Number Ordinal Number Subjunction Foreign Word Verb Feature UTR Common (Utrum) Gender NEU Neutre Gender MAS Masculine Gender UTR/NEU Underspecified Gender - Unspecified Gender SIN Singular Number PLU Plural Number

314 Appendix C. SIN/PLU Underspecified Number - Unspecified Number IND Indefinite Definiteness DEF Definite Definiteness IND/DEF Underspecified Definiteness - Unspecified Definiteness NOM Nominative Case GEN Genitive Case SMS Compound Case - Unspecified Case POS Positive Degree KOM Comparative Degree SUV Superlative Degree SUB Subject Pronoun Form OBJ Object Pronoun Form SUB/OBJ Underspecified Pronoun Form PRS Present Verb Form PRT Preterite Verb Form INF Infinitive Verb Form SUP Supinum Verb Form IMP Imperative Verb Form AKT Active Voice SFO S form Voice KON Subjunctive Mood PRF Perfect Perfect AN Abbreviation Form

Appendix D Implementation D.1 Broad Grammar #### Declare categories define PPheadPhr ["<pphead>" $"<pphead>" "</pphead>"]; define VPheadPhr ["<vphead>" $"<vphead>" "</vphead>"]; define APPhr ["<ap>" $"<ap>" "</ap>"]; define NPPhr ["<np>" $"<np>" "</np>"]; define PPPhr ["<pp>" $"<pp>" "</pp>"]; define VPPhr ["<vp>" $"<vp>" "</vp>"]; #### Head rules define AP [(Adv) Adj+]; define PPhead [Prep]; define VPhead [[[Adv* Verb] [Verb Adv*]] Verb* (PNDef & PNNeu)]; #### Complement rules define NP [[[(Det Det2 NGen) (Num) (APPhr) (Noun) ] &?+] Pron]; define PP [PPheadPhr NPPhr]; define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*]; #### Verb clusters define VC [ [[Verb Adv*] / NPTags] (NPPhr) [[Adv* Verb (Verb)] / NPTags] ]; D.2 Narrow Grammar: Noun Phrases ############### Narrow grammar for APs: define APDef ["<ap>" (Adv) AdjDef+ "</ap>"]; define APInd ["<ap>" (Adv) AdjInd+ "</ap>"]; define APSg ["<ap>" (Adv) AdjSg+ "</ap>"]; define APPl ["<ap>" (Adv) AdjPl+ "</ap>"]; define APNeu ["<ap>" (Adv) AdjNeu+ "</ap>"]; define APUtr ["<ap>" (Adv) AdjUtr+ "</ap>"]; define APMas ["<ap>" (Adv) AdjMas+ "</ap>"];

316 Appendix D. ############### Narrow grammar for NPs: ###### NPs consisting of a single noun define NPDef1 [(Num) [NDef PNoun]]; define NPInd1 [(Num) NInd]; define NPSg1 [(NumO) NSg [NPl & NInd] PNoun]; define NPPl1 [(NumC) [NPl PNoun]]; define NPNeu1 [(Num) [NNeu [NUtr & NInd] PNoun]]; define NPUtr1 [(Num) [[NUtr & NPl] [NUtr & NDef] PNoun]]; ###### NPs consisting of a determiner (or a noun in genitive) and a noun define NPDef2 [DetDef (DetAdv) (Num) NDef] [[DetMixed NGen] (Num) NInd]; define NPInd2 [DetInd (Num) NInd]; define NPSg2 [[DetSg (DetAdv) NGen] (NumO) NSg]; define NPPl2 [[DetPl (DetAdv) NGen] (NumC) NPl]; define NPNeu2 [[DetNeu (DetAdv) NGen] (Num) NNeu]; define NPUtr2 [[DetUtr (DetAdv) NGen] (Num) NUtr]; ###### NPs consisting of [Det (AP) N] define NPDef3 [DetDef (DetAdv) (Num) (APDef) NDef] [[DetMixed NGen] (Num) (APDef) NInd]; define NPInd3 [DetInd (NumO) (APInd) NInd]; define NPSg3 [[DetSg (DetAdv) NGen] (NumO) (APSg) NSg]; define NPPl3 [[DetPl (DetAdv) NGen] (NumC) (APPl) NPl]; #define NPNeu3 [[DetNeu (DetAdv) NGen] (Num) (APNeu) NNeu]; define NPNeu3 [[DetNeu (DetAdv) NGen] (Num) [[(APNeu) NNeu] [(APMas) NMas]]]; define NPUtr3 [[DetUtr (DetAdv) NGen] (Num) (APUtr) NUtr]; ###### NPs consisting of [Adj+ N] # optional numbers only in NPINd and NPPl define NPDef4 [APDef NDef]; define NPInd4 [(Num) APInd NInd]; define NPSg4 [APSg NSg]; define NPPl4 [(Num) APPl NPl]; define NPNeu4 [APNeu NNeu]; define NPUtr4 [APUtr NUtr]; ###### NPs consisting of a single pronoun define NPDef5 [PNDef]; define NPInd5 [PNInd]; define NPSg5 [PNSg]; define NPPl5 [PNPl]; define NPNeu5 [PNNeu]; define NPUtr5 [PNUtr]; ###### NPs consisting of a single determiner define NPDef6 [DetDef (DetAdv)]; define NPInd6 [DetInd]; define NPSg6 [DetSg (DetAdv)]; define NPPl6 [DetPl (DetAdv)]; define NPNeu6 [DetNeu (DetAdv)]; define NPUtr6 [DetUtr (DetAdv)];

Implementation 317 ###### NPs consisting of adjectives define NPDef7 [APDef+]; define NPInd7 [APInd+]; define NPSg7 [APSg+]; define NPPl7 [APPl+]; define NPNeu7 [APNeu+]; define NPUtr7 [APUtr+]; ###### NPs consisting of a single determiner and adjectives define NPDef8 [DetDef APDef]; define NPInd8 [DetInd APInd]; define NPSg8 [DetSg APSg]; define NPPl8 [DetPl APPl]; define NPNeu8 [DetNeu APNeu]; define NPUtr8 [DetUtr APUtr]; ###### NPs consisting of number as the main word define NPDef9 [(DetDef) NumO]; define NPInd9 [Num]; define NPSg9 [Num]; define NPPl9 [Num]; define NPNeu9 [Num]; define NPUtr9 [Num]; ###### NPs that meet definiteness agreement ### Definite NPs define NPDef [NPDef1 NPDef2 NPDef3 NPDef4 NPDef5 NPDef6 NPDef7 NPDef8 NPDef9 ]; ### Indefinite NPs define NPInd [NPInd1 NPInd2 NPInd3 NPInd4 NPInd5 NPInd6 NPInd7 NPInd8 NPInd9 ]; define NPDefs [NPDef NPInd]; ###### NPs that meet number agreement ### Singular NPs define NPSg [NPSg1 NPSg2 NPSg3 NPSg4 NPSg5 NPSg6 NPSg7 NPSg8 NPSg9 ]; ### Plural NPs define NPPl [NPPl1 NPPl2 NPPl3 NPPl4 NPPl5 NPPl6 NPPl7 NPPl8 NPPl9 ]; define NPNum [NPSg NPPl]; ###### NPs that meet gender agreement ### Utrum NPs define NPUtr [NPUtr1 NPUtr2 NPUtr3 NPUtr4 NPUtr5 NPUtr6 NPUtr7 NPUtr8 NPUtr9 ]; ### Neutrum NPs define NPNeu [NPNeu1 NPNeu2 NPNeu3 NPNeu4 NPNeu5 NPNeu6 NPNeu7 NPNeu8 NPNeu9 ]; define NPGen [NPNeu NPUtr];

318 Appendix D. ########## Partitive NPs define NPPart [[Det Num] PPart NP]; define NPPartDef [[Det Num] PPart NPDef]; define NPPartInd [[Det Num] PPart NPDef]; define NPPartSg [[DetSg Num] PPart NPPl]; define NPPartPl [[DetPl Num] PPart NPPl]; define NPPartNeu [[DetNeu Num] PPart NPNeu]; define NPPartUtr [[DetUtr Num] PPart NPUtr]; define NPPartDefs [NPPartDef NPPartInd]; define NPPartNum [NPPartSg NPPartPl]; define NPPartGen [NPPartNeu NPPartUtr]; ########## NPs followed by relative subclause define SelectNPRel [ "<np>" -> "<NPRel>" _ DetDef $"<np>" "</np>" (" ") {som} Tag*]; D.3 Narrow Grammar: Verb Phrases #### Infinitive VPs # select Infinitive VPs define SelectInfVP ["<vphead>" -> "<vpheadinf>" InfMark "<vp>" _ ]; # Infinitive VP define VPInf [Adv* (ModInf) VerbInf Adv* (NPPhr)]; #### Tensed verb first define VPFinite [ Adv* VerbTensed?* ]; #### Verb Clusters: # select VCs define SelectVC [VC @-> "<vc>"... "</vc>" ]; define VC1 [ [[Mod INFVerb] / NPTags ] (NPPhr) [[Adv* VerbInf] / NPTags ]]; define VC2 define VC3 define VC4 define VC5 define VCgram [ [Mod / NPTags] (NPPhr) [[Adv* ModInf VerbInf] / NPTags ]]; [ [Mod / NPTags] (NPPhr) [[Adv* PerfInf VerbSup] / NPTags ]]; [ [Perf / NPTags] (NPPhr) [[Adv* VerbSup] / NPTags ]]; [ [Perf / NPTags] (NPPhr)[[Adv* ModSup VerbInf] / NPTags ]]; [VC1 VC2 VC3 VC4 VC5];

Implementation 319 ### Coordinated VPs: define SelectVPCoord ["<vphead>" -> "<vpheadcoord>" ["<vpheadinf>" "</vc>"] $"<vphead>" $"<vp>" [{eller} {och}] Tag* (" ") "<vp>" _ ]; #** ATT-VPs that do not require infinitive define SelectATTFinite [ "<vphead>" -> "<vpheadattfinite>" [ [ [[{sa} Tag+] [[{för} Tag+] / NPTags]] ("</vphead></vp>")] [ [{tänkte} Tag+] [[NPPhr "</vphead></vp>" ] ["</vphead>" NPPhr "</vp>"]]]] InfMark "<vp>"_ ]; ### Supine VPs define SelectSupVP [ "<vphead>" -> "<vpheadsup>" _ VerbSup "</vphead>"]; D.4 Parser ###### Mark head phrases (lexical prefix) define markpphead [PPhead @-> "<pphead>"... "</pphead>"]; define markvphead [VPhead @-> "<vphead>"... "</vphead>"]; define markap [AP @-> "<ap>"... "</ap>" ]; ###### Mark phrases with complements define marknp [NP @-> "<np>"... "</np>" ]; define markpp [PP @-> "<pp>"... "</pp>" ]; define markvp [VP @-> "<vp>"... "</vp>" ]; ###### Composing parsers define parse1 [markvphead.o. markpphead.o. markap]; define parse2 [marknp]; define parse3 [markpp.o. markvp]; D.5 Filtering ################# Filtering Parsing Results ### Possessive NPs define adjustnpgen [ 0 -> "<vphead>" NGen "</np><vphead>" NPPhr _,, "</np><vphead><np>" -> 0 NGen _ $"<np>" </np>"]; ### Adjectives define adjustnpadj [ "</np><vphead><np>" -> 0 Det _ APPhr "</np></vphead>" NPPhr,, "</np></vphead><np>" -> 0 Det "</np><vphead><np>" APPhr _]; ### Adjective form, i.e. remove plural tags if singular NP define removepluraltagsnpsg [ TagPLU -> 0 DetSg "<ap>" Adj _ $"</np>" "</np>"]; ### Partitive NPs define adjustnppart [

320 Appendix D. "</np><pphead>" -> 0 _ PPart "</pphead><np>",, "</pphead><np>" -> 0 "</np><pphead>" PPart _]; ### Complex VCs stretched over two vpheads: define adjustvc [ "</vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags] _ NPPhr VPheadPhr,, "</vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags] NPPhr _ VPheadPhr,, "<vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags]] "</vphead>" NPPhr _ $"<vphead>" "</vphead>",, "<vphead>" -> 0 [[Adv* VBAux Adv*] / VCTags]] NPPhr "</vphead>" _ $"<vphead>" "</vphead>" ]; ### VCs with two copula or copula and an adjective: define SelectVCCopula [ "<vc>" -> "<vccopula>" _ [CopVerb / NPTags] $"<vc>" "</vc>"]; ################# Removing Parsing Errors ### not complete PPs, i.e. ppheads without following NP define errorpphead [ "<pphead>" -> 0 \["<pp>"] _,, "</pphead>" -> 0 _ \["<np>"]]; ### empty VPHead define errorvphead [ "<vp><vphead></vphead></vp>" -> 0]; D.6 Error Finder ######### Finding grammatical errors (Error marking) ###### NPs # Define NP-errors define npdeferror ["<np>" [NP - NPDefs] "</np>"]; define npnumerror ["<np>" [NP - NPNum] "</np>"]; define npgenerror ["<np>" [NP - NPGen] "</np>"]; # Mark NP-errors define marknpdeferror [ npdeferror -> "<Error definiteness>"... "</Error>"]; define marknpnumerror [ npnumerror -> "<Error number>"... "</Error>"]; define marknpgenerror [ npgenerror -> "<Error gender>"... "</Error>"]; # Define NPPart-errors define NPPartDefError ["<NPPart>" [NPPart - NPPartDefs] "</np>"]; define NPPartNumError ["<NPPart>" [NPPart - NPPartNum] "</np>"]; define NPPartGenError ["<NPPart>" [NPPart - NPPartGen] "</np>"]; # Mark NPPart-errors define marknppartdeferror [ NPPartDefError -> "<Error definiteness NPPart>"... "</Error>"]; define marknppartnumerror [ NPPartNumError -> "<Error number NPPart>"... "</Error>"]; define marknppartgenerror [ NPPartGenError -> "<Error gender NPPart>"... "</Error>"];

Implementation 321 ###### VPs # Define errors in VPs define vpfiniteerror ["<vphead>" [VPhead - VPFinite] "</vphead>"]; define vpinferror ["<vpheadinf>" [VPhead - VPInf] "</vphead>"]; define VCerror ["<vc>" [VC - VCgram] "</vc>"]; # Mark VP-errors define markfiniteerror [ vpfiniteerror -> "<Error finite verb>"... "</Error>"]; define markinferror [ vpinferror -> "<Error infinitive verb>"... "</Error>"]; define markvcerror [ VCerror -> "<Error verb after Vaux>"... "</Error>"];