Topological Field Chunking in German



Similar documents
The finite verb and the clause: IP

Machine Learning for natural language processing

German Language Resource Packet

Corpuslinguistics Annis 2 -Corpus query tool searching in Falko

Search Engines Chapter 2 Architecture Felix Naumann

Exemplar for Internal Achievement Standard. German Level 1

Satzstellung. Satzstellung Theorie. learning target. rules

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

German Language Support Package

German Language Support Package

FOR TEACHERS ONLY The University of the State of New York

GCE EXAMINERS' REPORTS

GCE German (2660) Candidate Exemplar Work: Additional Candidate Exemplar Work Unit 1 (Autumn 2008)

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Exemplar for Internal Assessment Resource German Level 1. Resource title: Planning a School Exchange

Elena Chiocchetti & Natascia Ralli (EURAC) Tanja Wissik & Vesna Lušicky (University of Vienna)

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

LEJ Langenscheidt Berlin München Wien Zürich New York

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

bound Pronouns

AP WORLD LANGUAGE AND CULTURE EXAMS 2012 SCORING GUIDELINES

All the English here is taken from students work, both written and spoken. Can you spot the errors and correct them?

Processing Dialogue-Based Data in the UIMA Framework. Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg

WhyshouldevenSMBs havea lookon ITIL andit Service Management and

LEHMAN BROTHERS SECURITIES N.V. LEHMAN BROTHERS (LUXEMBOURG) EQUITY FINANCE S.A.

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Reference Determination for Demonstrative Pronouns

A Joint Sequence Translation Model with Integrated Reordering

Chinese Open Relation Extraction for Knowledge Acquisition

0525 GERMAN (FOREIGN LANGUAGE)

IAC-BOX Network Integration. IAC-BOX Network Integration IACBOX.COM. Version English

How to make Ontologies self-building from Wiki-Texts

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Modalverben Theorie. learning target. rules. Aim of this section is to learn how to use modal verbs.

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Student Booklet. Name.. Form..

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Vergleich der Versionen von Kapitel 1 des EU-GMP-Leitfaden (Oktober 2012) 01 July November Januar 2013 Kommentar Maas & Peither

GM0101S HEADSTART STUDENT GUIDE PREPARED IY THE DEFENSE LANGUAGE INSTITUTE FOREIGN LANGUAGE CENTER

Annotation Guidelines for Dutch-English Word Alignment

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Shallow Parsing with Apache UIMA

Chapter 6. Decoding. Statistical Machine Translation

31 Case Studies: Java Natural Language Tools Available on the Web

Statistical Machine Translation


Version 1.0: General Certificate of Secondary Education June German. (Specification 4665) Unit 4: Writing. Mark Scheme

A chart generator for the Dutch Alpino grammar

Text Generation for Abstractive Summarization

Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Lernsituation 9. Giving information on the phone. 62 Lernsituation 9 Giving information on the phone

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Successful Collaboration in Agile Software Development Teams

GCE EXAMINERS' REPORTS. GERMAN AS/Advanced

Coffee Break German Lesson 06

Negation and (finite) verb placement in a Polish-German bilingual child. Christine Dimroth, MPI, Nijmegen

Übungen zur Vorlesung Einführung in die Volkswirtschaftslehre VWL 1

Outline of today s lecture

AP GERMAN LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

AS Music Bach Chorale Cadence exercises

PoS-tagging Italian texts with CORISTagger

Special Topics in Computer Science

Customizing an English-Korean Machine Translation System for Patent Translation *

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

SYNTAX AND SEMANTICS OF CAUSAL DENN IN GERMAN TATJANA SCHEFFLER

O CONNOR ebook How do I say that in English?

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

The ALeSKo learner corpus: Design annotation quantitative. [short title for running heads: The ALeSKo learner corpus]

Comprendium Translator System Overview

Complex Predications in Argument Structure Alternations

German Beginners. Stage 6 Syllabus. Preliminary and HSC Courses

(Incorporated as a stock corporation in the Republic of Austria under registered number FN m)

Varieties of specification and underspecification: A view from semantics

Hybrid Strategies. for better products and shorter time-to-market

Wie ist das Wetter? (What s the weather like?)

I Textarbeit. Text 1. I never leave my horse

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Transcription:

Topological Field Chunking in German Jorn Veenstra, Frank H. Müller, Tylman Ule [veenstra,fhm,ule]@sfs.uni-tuebingen.de ESSLLI Summerschool Workshop on Machine Learning Approaches in Computational Linguistics August 7, 2002 Topological Field Chunking in German p.1/35

German English Sentence structure in English lends itself to chunk analysis, followed by grammatical role assignment. In other Germanic languages the sentence structure can be different. NPs structure is different, VPs are less clustered. The strategy proposed the USA was accepted the NATO yesterday. by by Die von den USA vorgeschlagene Strategie wurde gestern von der NATO übernommen. Topological Field Chunking in German p.2/35

Topological Field Theory (TopF) Topological Field Theory gives a handle to divide the sentence into labeled verbal and non-verbal parts: Non-Verbal parts: VF: Vorfeld, MF: Mittelfeld, NF: Nachfeld Verbal parts: CF: Complementizer, LK: Linke Klammer, RK: Rechte Klammer Die von den USA vorgeschlagene Strategie wurde gestern von der NATO übernommen. Topological Field Chunking in German p.3/35

Overview Sentence Structure in German Topological Field Theory The Data: TüBa-D/Z Three TopFChunkers: FSA, PCFG, MBL Results Future Research Topological Field Chunking in German p.4/35

Sentence Structure in German I Verbs can be far apart in German sentences In so-called V2 sentences the finite verb is always in second position. The other verbal elements is further up in the sentence. Ich habe Spaghetti gestern mit meinen Freunden gegessen. I have yesterday with my friends spaghetti eaten. Topological Field Chunking in German p.5/35

Sentence Structure in German II The finite verb will stay in second position: Mit meinen Freunden Spaghetti gegessen habe. ich gestern With my friends have I yesterday spaghetti eaten. Mit meinen Freunden is in first position now, forcing the subject Ich after the verb. Topological Field Chunking in German p.6/35

Sentence Structure in German III In the Verb-Last constructions, verbs are clustered at the end of the sentence. This construction occurs e.g. in relative clauses: Er glaubt, dass ich gestern mit meinen Freunden Spaghetti gegessen habe He thinks that I yesterday with my friends spaghetti eaten have.. Topological Field Chunking in German p.7/35

Sentence Structure in German IV In yes/no-questions the finite verb is in first position, the so-called V1 sentences: Hast Spaghetti du gestern mit deinen Freunden gegessen? Have you yesterday with your friends spaghetti eaten? Topological Field Chunking in German p.8/35

Sentence Structure in German V There can be more after the last verbal chunk: Ich habe gestern Spaghetti in einer Pizzeria. gegessen I have yesterday spaghetti eaten in a pizzeria. Topological Field Chunking in German p.9/35

Sentence Structure in German VI The finite verb forms its own constituent. However, the non-finite verbal constituent can contain many verbs, the so-called verb cluster. Das sind allerdings Gelder, die vom Schatzmeister hintergezogen worden sein sollen. These are, however, funds which by the treasurer defrauded been have should. Topological Field Chunking in German p.10/35

Topological Fields Descriptive model of the constituent order in German(ic), the TopF structure gives the skeleton of the sentence. Topological fields describe sections with respect to the distributional properties of the verbs in German sentences. CF: complementizer, LK: left bracket, RK: right bracket, VF: initial field, MF: middle field, NF: final field The TopFs CF, LK and RK form the verbal frame relative to which the other fields are defined. The VF, MF and NF are defined relative to the Topological Field Chunking in German p.11/35 verbal fields.

Sentence Structure in German type compl. field topological fields verb complex (VL) (CF) (MF) (RK) (NF) finite verb verb complex (V1) (LK) (MF) (RK) (NF) finite verb verb complex (V2) (VF) (LK) (MF) (RK) (NF) left part right part Figure 1: VL=Verb-Last; V1=Verb-First; V2= Verb-Second; VF=Initial Field; MF=Middle Field; NF=Final Field Topological Field Chunking in German p.12/35

An example of an analysed sentence SIMPX 516 VF ON 515 SIMPX MF 513 514 OA PRED C MF VC LK ADJX 508 509 510 511 512 ON OD OA HD HD NX NX NX VXFIN VXFIN NX ADJX ADJX 500 501 502 503 504 505 506 507 HD HD HD HD HD HD HD HD Wer seinem Publikum eine solche Reaktion prophezeit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PWS PPOSAT NN ART PIDAT NN VVFIN $, VVFIN PRF KON ADJD KON ADJD $., macht sich entweder lächerlich Who his audience a such reaction predicts, makes himself or ridiculous or legendary. oder legendär. Topological Field Chunking in German p.13/35

Other Languages, e.g. Dutch Ik had willen eten gisteren met mijn vrienden pasta bij de pizzeria. I had yesterday with my friends pasta want eat at the pizzeria. Ik pasta had willen eten gisteren met mijn vrienden bij de pizzeria. Topological Field Chunking in German p.14/35

The Corpus: TüBa-D/Z German newspaper text from the Tageszeitung (taz). Annotation style comparable to VERBMOBIL about 7300 sentences annotated and POS-tagged with Topological Field structure Topological Field Chunking in German p.15/35

An example of an annotated sentence SIMPX 516 NF MOD 515 VF ON NX Katastrophenstimmung NN cngs:nsfw LK 508 509 510 511 512 HD VXFIN 500 501 502 503 504 505 506 507 herrscht 0 1 2 3 4 5 6 7 8 9 10 VVFIN pnmt:3sis MF MOD ADVX HD HD HD HD HD HD HD erst ADV, $, wenn KOUS HD NX nichts PIS cng:*** 513 ADVX mehr ADV 514 zu PTKZU OV VXINF verheimlichen Catastrophe-mood prevails only, when nothing anymore to C hide is MF ON NX SIMPX VVINF VC HD VXFIN ist VAFIN pnmt:3sis. $. Topological Field Chunking in German p.16/35

The Task: TopF chunking Find a task with high accuracy to find the basic structure of German sentences. TopFChunking finds the most common verbal parts of the Topological Field structure: CF (compl), LK (left) and RK (right). This gives us a handle to find the basic structure of the sentence. TopFChunking is clearly defined, and can be expected to be performed with a high accuracy. Topological Field Chunking in German p.17/35

TopF chunking: some examples Die von den USA vorgeschlagene Strategie wurde gestern von der NATO übernommen. Ich habe gestern mit meinen Freunden Spaghetti gegessen. Das sind allerdings Gelder, die vom Schatzmeister hintergezogen worden sein sollen. Topological Field Chunking in German p.18/35

Three Chunkers Rule-based hand-crafted Finite State Automata (FSA), by Frank H. Müller. Rule-Based corpus trained PCFG: Probabilistic Context Free Grammars, by Tylman Ule Machine Learning Approach: Memory-Based Natural Language Processing, by Jorn Veenstra Topological Field Chunking in German p.19/35

Finite-State Automata I topological fields structure can be captured by regular expression grammar, thus, it can be translated into finite state automaton (FSA) FSA = automaton which accepts input symbols according to transition function FSA can act as a transducer POS tags are input of transducer and topological fields and POS tags are output transition function is described by hand-written rules Topological Field Chunking in German p.20/35

FSA II, a simplified example simple examples: eine Tatsache, [mit der] man leben kann ein Mensch, [dem] man vertrauen kann eine Frage, die [aufgeworfen werden könnte] recursion can be covered by iteration of FSAs (cascade) Topological Field Chunking in German p.21/35

FSA III, some POS tags APPR=preposition, PRELS=relative pronoun, APPO=postposition, VVPP=past participle, VAINF=auxiliary verb infinitive, VAFIN=auxiliary verb finite, VMFIN=modal verb finite Topological Field Chunking in German p.22/35

FSA IV, cannot be too simple The seemingly simple and effective FSA: is not correct. e.g. Ich [bin] [gegangen] I have gone or: Mit meinem Bruder [gespielt] [hab] ich nie. with my brother played have I never Topological Field Chunking in German p.23/35

Probabilistic Context Free Grammars The FSA approach uses hand-crafted non-probabilistic rules to recognise the TopFChunks. The PCFG approach extracts context-free, probabilistic rules from the corpus. The PCFG chunker uses POS information only. Used the LOPAR (Schmidt, 2000). Topological Field Chunking in German p.24/35

PCFG II Design rules over POS tags to predict TopF chunks. Extract CF rules from the corpus and count their frequencies. Assign probabilities to rules according to these frequencies. For parsing new sentences parse bottom up, prune unprobable parses and work out the most probable routes. Topological Field Chunking in German p.25/35

PCFG III: an example SIMPX 516 NF 515 MOD VF ON NX Katastrophenstimmung NN cngs:nsfw LK 508 509 510 511 512 HD VXFIN 500 501 502 503 504 505 506 507 herrscht 0 1 2 3 4 5 6 7 8 9 10 VVFIN pnmt:3sis MF MOD ADVX HD HD HD HD HD HD HD erst ADV, $, wenn KOUS HD NX nichts PIS cng:*** 513 ADVX mehr ADV 514 zu PTKZU OV VXINF verheimlichen Catastrophe-mood prevails only, when nothing anymore to C hide is MF ON NX SIMPX VVINF VC HD VXFIN ist VAFIN pnmt:3sis. $. Topological Field Chunking in German p.26/35

Topological Field Chunking in German p.27/35 C SfS L PCFG IV: an example

Memory-Based Learning MBL is a machine learning approach to NLP Learning Stage is Memory-Based Classification stage is similarity-based We have used TiMBL, an MBL software tool. Approach TopFChunking as a word-by-word classification task, comparable to POS tagging and NP chunking. Topological Field Chunking in German p.28/35

Memory-Based Learning Assign to each word a tag: inside or outside one of the TopF chunks, or between two chunks. This gives us three tags (I, O & B) per TopF type (CF, LK & RK). Each word in the corpus data is annotated with one of these nine IOB tags. The MBL chunker is first trained on the corpus annotated with these IOB tags. We have used a context window of two words and POS tags to the left and to the right, and information gain to weight the feature relevance. Topological Field Chunking in German p.29/35

Memory-Based Learning C SfS L gelacht gestern hat Der Mann. gestern hat. Der Mann gelacht der DET Mann NN hat V gestern TEMP gelacht V --- I LK Topological Field Chunking in German p.30/35

Baseline As a baseline we have computer the most probable TopF IOB tag for each POS tag in the train set. We have used these TopT tags to chunk the test set. Topological Field Chunking in German p.31/35

Experimental setup first 5000 sentences as training data last 2300 sentences as test data FSA developer was not allowed to see the test set. Since the FSA and PCFG chunkers predicted too much, their output was converted to the TopF chunk format As the TüBa corpus is still under development, some changes were made during the development of the chunkers, this might have harmed the performance of the chunkers. Topological Field Chunking in German p.32/35

Results ALL LK VC C FSA TnT 94.1 96.2 92.0 93.8 FSA gold 98.4 98.8 98.3 97.5 PCFG TnT 94.4 97.0 92.2 92.3 PCFG gold 98.1 98.9 98.1 96.1 MBL TnT 93.3 96.0 90.0 91.6 MBL gold 97.2 98.0 96.7 96.6 baseline TnT 75.5 75.2 72.3 83.2 baseline gold 77.9 75.0 79.1 86.2 Topological Field Chunking in German p.33/35

Conclusions scores are high for all three approaches Many errors are caused by POS tag errors Since TopFChunking is a well defined task, rule-based and hand-crafted approaches work relatively well. Memory-Based Approaches seem to suffer more from sparse data than FSA and PCFG. Such a high performance gives a solid basis for parsing and other text analysis tasks, such as information extraction and grammatical function assignment. Topological Field Chunking in German p.34/35

Future Work Combining chunkers to optimise performance Annotating the full TopF structure of a sentence Sentence type classifier NP chunking with the German idiosyncrasies Grammatical function assignment Topological Field Chunking in German p.35/35