LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH



Similar documents
Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

A chart generator for the Dutch Alpino grammar

From D-Coi to SoNaR: A reference corpus for Dutch

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Annotation Guidelines for Dutch-English Word Alignment

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Shallow Parsing with Apache UIMA

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Example-based Translation without Parallel Corpora: First experiments on a prototype

Chapter 8. Final Results on Dutch Senseval-2 Test Data

FoLiA: Format for Linguistic Annotation

Text Generation for Abstractive Summarization

Syntax: Phrases. 1. The phrase

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

Paraphrasing controlled English texts

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Timeline (1) Text Mining Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Index. 344 Grammar and Language Workbook, Grade 8

Automatic Pronominal Anaphora Resolution. in English Texts

Automatic Pronominal Anaphora Resolution in English Texts

PoS-tagging Italian texts with CORISTagger

Development of a Dependency Treebank for Russian and its Possible Applications in NLP

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle

CALICO Journal, Volume 9 Number 1 9

Constraints in Phrase Structure Grammar

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Statistical Machine Translation

MESLEKİ İNGİLİZCE I / VOCATIONAL ENGLISH I

Syntactic Theory on Swedish

SAND: Relation between the Database and Printed Maps

Natural Language Database Interface for the Community Based Monitoring System *

Context Grammar and POS Tagging

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

I have eaten. The plums that were in the ice box

Distinguishing Prepositional Complements from Fixed Arguments

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Structure of Clauses. March 9, 2004

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns.

Dutch Parallel Corpus

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

CURRICULUM VITAE Studies Positions Distinctions Research interests Research projects

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

A Beautiful Four Days in Berlin Takafumi Maekawa (Ryukoku University)

Online Tutoring System For Essay Writing

Outline of today s lecture


Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Special Topics in Computer Science

Schema documentation for types1.2.xsd

Applying quantitative methods to dialect Dutch verb clusters

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

Early Morphological Development

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Interactive Dynamic Information Extraction

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

A Writer s Reference, Seventh Edition Diana Hacker Nancy Sommers

Cultural Trends and language change

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

Dutch-Flemish Research Programme for Dutch Language and Speech Technology. stevin programme. project results

EAP Grammar Competencies Levels 1 6

PiQASso: Pisa Question Answering System

Surface Realisation using Tree Adjoining Grammar. Application to Computer Aided Language Learning

Movement and Binding

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

Phrase Structure Rules, Tree Rewriting, and other sources of Recursion Structure within the NP

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian

Lecture 9. Phrases: Subject/Predicate. English 3318: Studies in English Grammar. Dr. Svetlana Nuernberg

Automated Extraction of Security Policies from Natural-Language Software Documents

12 FIRST QUARTER. Class Assignments

Curriculum Vitae, Gertjan van Noord. Education. Employment. November 2012

Transcription:

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1

1 Background Lassy Small is the Lassy corpus in which the syntactic annotations have been manually verified. This part contains one million words. The composition of the corpus is detailed in deliverable 1.1. The annotations include syntactic dependency annotations, as documented in deliverable 3.5 [5], and the annotation of the part-of-speech and lemma of each token, as documented in [3]. 2 Annotation Procedures Both the annotation guidelines manuals and the various tools we used for annotation were initially developed in the STEVIN D-Coi project. The annotation of part-of-speech and lemma proceeded in the same way as in D-Coi: initial assignment of part-of-speech and lemma by TadPole [2]. These automatically assigned annotations were then checked and corrected by students. The syntactic annotation procedure works in a similar way. The Alpino parser [4] is used to assign initial dependency structures automatically. These automatically assigned annotations were then checked and corrected by students (using an adapted version of TrEd, http://ufal.mff.cuni.cz/~pajas/tred/). Unlike the part-of-speech and lemma annotations, most syntactic dependency annotations were checked by students a second time. Syntactic dependency annotations and part-of-speech and lemma annotations were then integrated into a single XML representation. For further quality improvement, a large number of heuristics has been defined and implemented to check the resulting annotations both for illegal annotations, as well as for unexpected annotions. In particular, heuristics which confront the syntactic analysis with the part-of-speech tag assignments were very effective. Based on the results of these heuristics, thousands of mistakes have been corrected. This latter step therefore appears to be a very important step, and for this reason we provide an overview of the most important of these heuristics here in the next session. 3 Verification As indicated above, all annotations were verified systematically using dtchecks, a script which applies a large set of rules which check for either illegal annotations or unexpected annotations. The rules are defined as XPATH queries, and any annotation which matches a query is returned. Each of these annotations then undergoes yet another manual inspection to correct the annotation, or to confirm the exceptional annotations. We now describe the most important rules here as follows. each node has atmost a single head 2

du mwu whq oti ti pp ap svan cp ahi rel whrel whsub conj advp detp inf sv1 smain ssub ppart dp, sat, nucl, tag, dlink mwp whd, body cmp,body,mod cmp,body hd,obj1,mod,vc,predc,hdf,pobj1,se hd,mod,vc,pc,obcomp,obj1,predm,predc,me,se,pobj1,obj2 cmp,body,mod cmp,body,mod cmp,body body,rhd body,rhd body,whd cnj,crd hd,mod,obcomp,me hd,obcomp,mod,me Table 1: Relations which are expected to occur in nodes of the given category. check for expected daughers for a given category. For instance, a noun phrase can have a determiner, but a prepositional phrase does not. Table 1 lists which relations are expected for which category. check for expected categories for a given relation. For instance, direct objects are usually NP. The expected categories for each relation name is given in table 2. check for expected relations of sister nodes for a given relation. For instance, a node with relation whd usually has a body sister node. The expected sisters are listed in table 3. check co-indexing of nodes: for every node with an index, there should be another node with the same index. Also, a node with a index cannot dominate a node with the same index. check that the required attributes cat, lemma, postag are present in nodes of the relevant type. rhd- and whd-nodes usually are co-indexed. check that nodes of categories such as np, smain,... have a head node. 3

hdf hd cmp su obj2 pc vc svp predc predm ld me obcomp rhd whd mod body det app crd mwu mwu mwu,conj np,conj,cp,ti,oti,whrel,whsub,svan,mwu np,pp,conj,mwu,whrel pp,conj cp,ti,ppart,inf,oti,conj,whsub,ahi,svan,smain pp,mwu,ti,ahi,inf np,ap,ppart,ppres,cp,pp,conj,mwu,whrel,oti advp,np,ap,ppart,ppres,cp,pp,conj,mwu,whrel pp,np,conj,mwu,ap,advp,whrel np,ap,conj,mwu cp,oti,ssub,ti,conj np,pp,ap,conj,advp,mwu,ppart,ppres np,pp,ap,conj,advp,cp,mwu,ppart,ppres rel,cp,np,advp,ap,ppart,pp,mwu,conj,oti,du,smain,whrel,ti,sv1,ppres ssub,ti,sv1,inf,np,conj,pp,cp,mwu,du,ppart,smain,whrel,ap,ppart,ppres detp,mwu,np,conj,ap,pp mwu,np,conj,smain mwu Table 2: Categories which are expected to occur in a node with the given relation. tag rhd whd dp sat nucl body cmp crd cnj mwp hdf tag,nucl,sat,dlink body,rhd body,whd dp sat,nucl,dlink,tag tag,sat,nucl,dlink body,cmp,rhd,whd,mod body,cmp,mod cnj,crd cnj,crd mwp hdf,hd,obj1,mod,se Table 3: Relations of sister nodes which are expected to occur with a node with the given relation. 4

if there is a sup-node, there should be a su-node. if there is a pobj1-node, there should be a obj1- or vc-node. if there is a sat- or tag-node, there should be a nucl-node. a relative should be a modifier a comparative should be part of an obcomp-node nodes cannot have a single daughter head nodes cannot have daughters, except for multi-word-units nodes with a cat attribute should have daughters words such as inclusief, uitgezonderd, voorbehouden,... are adjectives, not prepositions PPs should have a direct object category smain does not occur as a body. check that subjects of infinitives in auxiliary verb and control verb contexts are coindexed with subject of finite verb check that modifiers are attached to main verb in auxiliary verb constructions check that multi-word-units are concatenative nodes with cat=rel cannot modify verbal projections in passive construction with a subject there is a coindexed direct object subjects without co-indexing do not occur in infinitival VPs auxiliaries and copula do not take a direct object VP complements should be vc, not obj1. a PP cannot be an obj1, unless it is a complement of another preposition check that the value of postag is legal if there is a subject, there should be a finite verb there cannot be a space in lemma, root and word values a finite verb should head one of smain, ssub, sv1 the body of a cp should not be an infinite vp 5

a node with category ppart should be headed by a participle a node with category inf should be headed by an infinitive a node with category oti should have a body with category inf a node with category sv1, smain, ssub should be headed by a finite verb the complementizer om heads a node with category oti the complementizer te heads a node with category ti the complementizers dat, of head a node with category cp a noun phrase with determiner het, dit, dat should have a head which is not a zijd noun a noun phrase with determiner de, die, deze should have a head which is not a singular onz noun a noun phrase with determiner het, dit, dat, deze should not have a plural head words that are cmp should have a complementizer part-of-speech words that are rhd should have a relative pronoun part-of-speech words that have relative pronoun part-of-speech should not be determiner whq nodes should contain a part-of-speech with wh flag personal pronouns should not be determiner relatives should be rhd determiners can be nouns only if genitive a determiner should be used as determiner identical sentences should have identical annotations In addition, we have adapted the DECCA software to the LASSY Small treebank and corrected some further mistakes [1]. Finally, we have created various frequency lists (some of these are part of the Lassy Small release) which we have inspected for outliers (for instance, words that were annotated many times with a given part-of-speech and in a small number of times with another partof-speech were manually checked, etc. 6

4 Limitations Given the huge amounts of mistakes that have been corrected after the initial manual corrections, it is clear that more mistakes are present in the released version. In addition to obvious mistakes, there are also many cases in which more than a single analysis appears to be reasonable. In such cases we have aimed at a consistent approach in which the same analysis is chosen in similar cases, but we are aware that this objective has not been established in all cases. We now list a number of issues in which we are aware of inconsistencies. The following issues are relevant for the assignment of part-of-speech labels. The distinction between the part-of-speech labels SPEC(deeleigen) and N(eigen,...) is unreliable The distinction between the part-of-speech labels SPEC(symb) and TW(hoofd,vrij) is unreliable The distinction between part-of-speech labels SPEC(vreemd) and SPEC(eigen) is unreliable For names N(eigen,...), it is often arbitrary what value is assigned for gender. The part-of-speech assignment of words that are part of a multi-word-unit are unreliable The following issues are relevant for the syntactic dependency annotations: Verbal modifiers with relation mod are almost always assigned to the main verb, even in cases where the modifier clearly modifies a higher verb (such as the modal kunnen). For verbal modifiers that receive the relation predm the situation is reversed. Predicate modifiers are almost always attached to the finite verb, even in cases where a lower attachment is more reasonable. For noun-phrases such as een zak friet, een berg kleren, een verzameling boeken it is often unclear to which word a following modifier should attach. A minimal pair is een berg kleren in de voorkamer versus een berg kleren met vieze vlekken. In most cases, however, the correct assignment is unclear or irrelevant (een berg kleren van mijn dochter). Lists. In some cases, lists are represented as a conjunction without a coordinator. In other cases, a dp category is assigned where each element receives the du role. Incomplete coordinations. In coordinations in which one conjunct contains material that appears to be elided in the other conjunct, this material is sometimes (but not always consistently) present in the analysis of both conjuncts using co-indexing. 7

Plaatsonderwerp er. In the Dutch linguistic tradition, some usages of the word er are called plaatsonderwerp. In contrast and for practical reasons, these cases are all assigned a modifier role. 5 Tools The tools that were used during the annotation process are all available free of charge. The TadPole part-of-speech tagger and lemmatizer is developed at the Tilburg University, and is available free of charge under the Gnu General Public License, http: //ilk.uvt.nl/tadpole/. The Alpino parser, as well as the various tools to browse, search, and check dependency structures encoded in XML, is developed at the University of Groningen. Alpino is available free of charge under the Gnu Lesser General Public License. For editing the syntactic annotations, we used TrEd. TrEd is developed at Charles University in Prague. TrEd is free software distributed under the Gnu General Public Licence. In order to use TrEd with Lassy files, there are a number of files which define the interface and various settings specific for Lassy. These files are part of the Alpino distribution. References [1] Jasper Hoenderken. Inconsistencies in dependency treebanks. Master s thesis, University of Groningen, 2009. [2] Antal van den Bosch, Bertjan Busser, Sander Canisius, and Walter Daelemans. An efficient memory-based morphosyntactic tagger and parser for Dutch. In Peter Dirix, Ineke Schuurman, Vincent Vandeghinste, and Frank van Eynde, editors, Computational Linguistics in the Netherlands 2006. Selected papers from the seventeenth CLIN meeting, LOT Occassional Series, pages 99 114. LOT Netherlands Graduate School of Linguistics, 2007. [3] Frank van Eynde. Part of speech tagging en lemmatisering van het D-COI corpus, 2005. [4] Gertjan van Noord. At Last Parsing Is Now Operational. In TALN 2006 Verbum Ex Machina, Actes De La 13e Conference sur Le Traitement Automatique des Langues naturelles, pages 20 42, Leuven, 2006. [5] Gertjan van Noord, Ineke Schuurman, and Gosse Bouma. Lassy syntactische annotatie, revision 19053. http://www.let.rug.nl/vannoord/lassy/sa-man lassy.pdf, 2010. 8