Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University



Similar documents
Brill s rule-based PoS tagger

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

A Mixed Trigrams Approach for Context Sensitive Spell Checking

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Improving Word-Based Predictive Text Entry with Transformation-Based Learning

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Statistical Machine Translation

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

PoS-tagging Italian texts with CORISTagger

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction

Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen

Writing Common Core KEY WORDS

Context Grammar and POS Tagging

Ling 201 Syntax 1. Jirka Hana April 10, 2006

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

Text Generation for Abstractive Summarization

ESL 005 Advanced Grammar and Paragraph Writing

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

English Appendix 2: Vocabulary, grammar and punctuation

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

MESLEKİ İNGİLİZCE I / VOCATIONAL ENGLISH I

National Quali cations SPECIMEN ONLY

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

Index. 344 Grammar and Language Workbook, Grade 8

Chinese Open Relation Extraction for Knowledge Acquisition

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

ETL Ensembles for Chunking, NER and SRL

According to the Argentine writer Jorge Luis Borges, in the Celestial Emporium of Benevolent Knowledge, animals are divided

How to make Ontologies self-building from Wiki-Texts

12 FIRST QUARTER. Class Assignments

Student Guide for Usage of Criterion

Albert Pye and Ravensmere Schools Grammar Curriculum

Automating Legal Research through Data Mining

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

The Design of a Proofreading Software Service

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Terminology Extraction from Log Files

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

A Method for Automatic De-identification of Medical Records

The parts of speech: the basic labels

Annotation Guidelines for Dutch-English Word Alignment

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

How To Translate English To Yoruba Language To Yoranuva

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

PREP-009 COURSE SYLLABUS FOR WRITTEN COMMUNICATIONS

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns.

Training and evaluation of POS taggers on the French MULTITAG corpus

Mining Opinion Features in Customer Reviews

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle

Ask your teacher about any which you aren t sure of, especially any differences.

MODULE 15 Diagram the organizational structure of your company.

Training Paradigms for Correcting Errors in Grammar and Usage

Final Exam Grammar Review. 5. Explain the difference between a proper noun and a common noun.

DEPENDENCY PARSING JOAKIM NIVRE

Language Lessons. Secondary Child

Special Topics in Computer Science

Criterion SM Online Essay Evaluation: An Application for Automated Evaluation of Student Essays

Year 3 Grammar Guide. For Children and Parents MARCHWOOD JUNIOR SCHOOL

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Prosodic Phrasing: Machine and Human Evaluation

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Customizing an English-Korean Machine Translation System for Patent Translation *

Avoiding Run-On Sentences, Comma Splices, and Fragments

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Natural Language Database Interface for the Community Based Monitoring System *

Determine two or more main ideas of a text and use details from the text to support the answer

Lecture 9. Phrases: Subject/Predicate. English 3318: Studies in English Grammar. Dr. Svetlana Nuernberg

Software Engineering EMR Project Report

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

English for Academic Skills Independence [EASI]

BILINGUAL TRANSLATION SYSTEM

ONLINE ENGLISH LANGUAGE RESOURCES

An online semi automated POS tagger for. Presented By: Pallav Kumar Dutta Indian Institute of Technology Guwahati

A Writer s Reference, Seventh Edition Diana Hacker Nancy Sommers

Learning Translation Rules from Bilingual English Filipino Corpus

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

Terminology Extraction from Log Files

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Interactive Dynamic Information Extraction

What Is This, Anyway: Automatic Hypernym Discovery

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

Paraphrasing controlled English texts

Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql

CALICO Journal, Volume 9 Number 1 9

Transcription:

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect commas in Danish. Trained on a part-of-speech tagged corpus of 600,000 words, the system identifies incorrect commas with a precision of 91% and a recall of 77%. The system was developed by randomly inserting commas in a text, which were tagged as incorrect, while the original commas were tagged as correct. Then the tagger was trained to recognize the contexts in which incorrect commas occur. In what follows, we first describe the corpora and tag sets used in this research, and give background on the Brill Tagger. We then describe the methodology for learning to identify comma errors, and then we examine some of the principles that the system learned to identify comma errors. Finally, test results are presented, and we discuss plans for future research. The method used here is quite general, and could be applied fairly directly to a wide range of grammar checking problems, in Danish or other languages. 2. Background Corpora and tag sets This research uses two Danish corpora: the Parole corpus (Parole 1998) and the Bergenholtz corpus (Bergenholtz 1988). The Brill tagger was trained on the manually tagged Parole corpus to recognize Danish part of speech tags. The Danish Parole tag set consists of 151 distinct Tags, containing information such as syntactic category, number, gender, case, tense and so on. As described below, we have used a reduced version of the Danish Parole tag set for the current project. Brill tagger The Brill tagger learns by first tagging raw text with an Initial State Tagger, which tags words with their most frequent tag. The resulting file is termed Dummy, and is compared to a file called Truth, which has been manually tagged, and is thus assumed to be completely correct. 1 The system Contextual-Rule-Learn searches for transformations that can be used to make Dummy more closely resemble Truth. The system searches among transformations that instantiate the following templates: Change tag a to tag b when: 1. The preceding (following) word is tagged z. 2. The word two before (after) is tagged z. 3. One of the two preceding (following) words is tagged z. 4. One of the three preceding (following) words is tagged z. 5. The preceding word is tagged z and the following word is tagged w. 6. The preceding (following) word is tagged z and the word two before (after) is tagged w. Learning proceeds iteratively as follows: Contextual-Rule-Learn tries every instantiation of the transformations templates, and finds the transformation that results in the greatest error reduction. (See Fig. 1.) This transformation is output to the Context Rules list, and the transformation is applied to Dummy. The process continues until no transformation results in an improvement above a preset 1 We ignore the lexical rules, which are learned in a separate phase. These are not relevant to the present study. See Brill 94, 95 for details. 266

threshold. The tagger can then be run with rules that determine part of speech tagging for Danish, based on the Danish Parole Corpus. We term this the Base Tagger. Raw Text: PAROLE Corpus Initial State Tagger Dummy Truth: Tagged PAROLE Corpus Contextual-Rule-Learn POS Context Rules Fig 1. Training the Base Tagger 3. Training the comma checker We produced the training file by tagging 600,000 words of text from the Bergenholtz corpus, using the Base Tagger. We converted the tags to the Reduced Parole Tag Set. This was done to facilitate the learning of generalizations such as no comma between a preposition and a noun. In the original tag set, there are 23 tags for common nouns, because of differences in number, gender, etc. In the reduced tag set, there are just two: N (common noun), and N_GEN (genitive noun). Other categories have similarly reduced numbers of tags. To use this tagged file as the training corpus for developing a comma checking system, we make the simplifying assumption that all existing commas are correct, and that no additional commas would be correct. Thus all existing commas in the training corpus are given a new tag, GC (good comma). Next, two copies of the training corpus are created, Truth and Dummy. In each of these, commas are inserted at random positions (in the same positions in each file). The inserted commas are labeled BC in Truth, and GC in Dummy file. Thus the only differences between the two files are that the randomly inserted commas are tagged with BC in Truth and GC in Dummy. Then, Contextual-Rule-Learn is run on these two files. The result is an ordered list of Error Context Rules for commas. (See Fig. 2.) 267

Tagged Text (Bergenholtz Corpus) Generate Errors: Insert Random Commas Dummy: Errors Not Marked Truth: Errors Marked Contextual-Rule-Learn Error Context Rules for commas Fig 2. Learning Comma Context Rules Thus what the system learns is contexts in which a comma s tag should be changed from GC to BC, and in this way marked as an error. The list of such contexts is produced by the learner as an ordered list of rules, specifying when the comma tag should be changed. It is important to note that these rules are ordered, so that a decision specified by a rule early on the list will sometimes be reversed by a rule later on the list. In all, 166 Error Context Rules for commas were produced. The first 12 rules are shown below: 1. GC -> BC if one of the three following tags is End-of-sentence 2. GC -> BC if one of the two previous tags is Beginning-of-sentence 3. GC -> BC if the next tag is Preposition 4. GC -> BC if one of the two following tags is Verb(Infinitive) 5. GC -> BC if the previous tag is Conjunction 6. BC -> GC if the previous tag is Interjection 7. GC -> BC if one of the two previous tags is Subordinating Conjunction 8. GC -> BC if the previous tag is Preposition and the following tag is N 9. GC -> BC if the previous tag is Pronoun and the following tag is N 10. GC -> BC if the previous tag is Verb(past) and the following tag is Pronoun(personal) 11. BC -> GC if one of the next two tags is Subordinating Conjunction 12. GC -> BC if the previous word is er (is) The first two rules state that a comma is marked bad ( BC ) if it is within 3 words of the end of a sentence, or within 2 words of the beginning of the sentence. These rules were learned because there were comparatively few correct commas in these environments in the Truth file, and a large number of 268

incorrect commas in these environments. However, the system soon learns that these rules are overly general. For example, the sixth rule states that a comma is correct if preceded by an interjection. This occurs typically near the beginning or end of a sentence, as in the following example from the training corpus: Naa/INTERJ,/GC I/PRON_PERS sidder/v_pres stadig/rgu og/cc hygger/v_pres Well, you sit still and enjoy jer/pron_pers./xp yourselves. Rule 7 doesn t permit commas between prepositions and nouns, and Rule 8 doesn t permit commas near the beginning of a subordinate clause. This is related to the fact that a comma typically introduces a subordinate clause in Danish. This fact is partially captured in Rule 11, which permits commas just before subordinating conjunctions. Rule 9 disallows commas between a Pronoun and Noun. In the Parole corpus, there is no category for Determiner, and words like the and a are tagged as pronouns. 4. The resulting system We build a system that corrects commas in raw text, based on the rules learned above. Text is first tagged by the Base Tagger, and then commas are all tagged GC. Next the Comma Corrector is executed this is the tagger with the Comma Error Rules. In the output, any incorrect commas are tagged with BC. (See Figure 3.) Here is a sample run of the system, with different comma positions in the (constructed) sentence Det er godt, at du kom (It is good, that you came): Input Det er godt, at du kom. Det er godt at, du kom. Det er godt at du, kom. Det, er godt at du kom. Det er, godt at du kom. Output Det er godt, at du kom. Det er godt at,/bc du kom. Det er godt at du,/bc kom. Det,/BC er godt at du kom. Det er,/bc godt at du kom. Of the five different comma positions, only the first is correct in Danish. 2 The system correctly labels all the other alternatives as incorrect (BC). 2 This is in fact not entirely clear, since there are at least two distinct systems for placing commas in Danish, and this position may not be considered correct in one of the two systems. While it is difficult to get confident judgments, all my Danish informants agree that this example has only one possible comma position, which is the one accepted by the comma correction system. 269

Raw Text Base Tagger Tagger with POS Context Rules Label Commas Comma Corrector Tagger with Comma Context Rules Output Fig. 3 Comma Correction System 5. Empirical results The system was tested with a file of distinct text from the Bergenholtz corpus, containing 14,044 words. The file contains 869 commas. 389 additional commas were introduced in random positions, as errors. The system marked 327 commas as errors, of which 299 actually were errors. This gives a precision of 91.4% and a recall of 76.9%. Here is a list of the first 10 examples where the system incorrectly marked a comma as an error: 1. Hulgaard/EGEN,/BC Århus/EGEN 2. mener/vpres,/bc vi/pronpers 3. mener/vpres,/bc han/pronpers 4. menneskemassen/vpres,/bc der/unik 5. 17-13/NUM,/BC Norris-Paulsen/N 6. morderiske/vpres,/bc psykopatiske/vinf 7. Sørensen/EGEN,/BC Århus/EGEN 8. nabokommunen/n,/bc på/sp 9. systemet/n,/bc kan/vpres 10. de/prondemo aktive/adj,/bc servicefunktionerne/n In items 1 and 7 a line break was incorrectly placed immediately before the text in question. Items 4 and 6 involve mis-tagging: 270

menneskemassen ( mass of people ) and morderiske ( murderous ) are both nouns, mis-tagged as verbs. Item 10 is an interesting case De aktive, servicefunktionerne (the active, service workers). The comma is marked as incorrect because of the following rule: GC -> BC if the previous tag is ADJ and the next tag is N This is normally correct; commas don t tend to appear between an ADJ and a N. Here, however, the active is a complete NP, on par with, e.g., the rich, and service workers is a separate NP. Total Number Commas Incorrect Commas Total System Corrections Valid System Corrections Precision Recall 1258 389 327 299 91.4% (299 / 327) 76.9% (299 / 389) Table 1. Results 6. Discussion and further work The system was developed using the transformation-based learning system of the Brill tagger. This learning system is limited in various ways: for example, only three words or tags before or after a position are examined. It is likely that certain patterns involving commas could be learned if that locality restriction were loosened. Furthermore, the learning system of the Brill tagger maximizes overall success rate, using a greedy strategy. We believe precision is a more relevant measure in grammar checking problems. Thus it would be interesting to modify the learner so that it optimizes precision or some related measure, and we suspect that greedy learning may be problematic in this case. We are contemplating various experiments related to these issues. It is also possible that the precision and recall of the system would be substantially increased with a larger training corpus. Work is proceeding on this. Finally, we plan to apply similar techniques to a wide variety of grammar problems, both in Danish and other languages. References Bergenholtz, H 1988 Et korpus med dansk almensprog. Hermes. Brill, E 1994 Some Advances in rule-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA. Brill, E 1995 Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics 21(4). Golding, Andrew R and Schabes, Yves 1996 Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proceedings of the 34th Annual meeting of the Association for Computational Linguistics. Jacobsen, Henrik Glaberg and Jørgensen, Peter Stray. 1991. Politikens Håndbog i Nudansk. Politikens Forlag. Parole. 1998. http://coco.ihu.ku.dk/~parole/par_eng.htm 271