Building and exploiting a dependency treebank for French radio broadcasts

Building and exploiting a dependency treebank for French radio broadcasts Christophe Cerisara, Claire Gardent and Corinna Anderson LORIA, Nancy 2011-11-15

Goals Corpus Annotation Tools and Methodology Annotation schema The Impact of Speech Constructs on Parsing Conclusions

Goals Long term Use syntax to improve speech recognition (INRIA Collaborative Action Rapsodis 2009-2010) Medium term Build a tree bank of spoken data (transcription of radio broadcast news) Empirical study of speech constructs Analyse impact of speech constructs on parsing Parse speech

The Ester Corpus 37 hours of manual transcriptions of French radios (1998.. 2003) Annotations with speakers, words, noise symbols, sometimes punctuation Normalisation to match the output of speech recognition systems: Remove punctuation, first word upper-case letters Remove incomplete words: Le pe- petit... But keep disfluencies with complete words: Le le petit

Example transcriptions quiberon Frédéric Colas France Bleue Armorique pour France-Inter Header; No punctuation bonsoir Non verbal utterance l enquête sur l office HLM de Paris Jean Tiberi le maire de la capitale annonce lui-même dans une interview au Monde Incomplete utterances sa mise en examen pour complicité de trafic d influence Incorrect sentence segmentation je pense que cela doit conduire euh Jean Tiberi le premier euh à une réflexion Hesitations

Methodology for constructing ETB Manual annotation supported by pre-parsing Active Learning for selectively extending the annotated data and improve the parser using a small training corpus (Christophe s talk)

Manual Annotation On and Off since 2009 Uses JSafran framework http://rapsodis.loria.fr/jsafran/index.html Iterative process: 1. Design of an annotation scheme. 2. Manual annotation of 5000 words 3. Training of a Malt Parser model 4. Automatic parsing of a new corpus segment 5. Manual correction of this corpus segment 6. Addition of this corrected segment to the training corpus 7. Iterate from step 3

The ATB Annotation Schema 15 dependency relations: SUJ (subject) OBJ (object) POBJ (prepositional object) ATTS (subject attribute) ATTO (object attribute) MOD (modifier) COMP (complementizer) AUX (auxiliary) DET (determiner) CC (coordination) REF (reflexive pronoun) JUXT (juxtaposition) APPOS (apposition) DUMMY (syntactically governed but semantically empty dependent) e.g. expletive subject DISFL (disfluency).

ETB and PTB annotations ETB Label description P7Dep MOD modifier mod, mod rel, dep COMP complementizer obj DET determiner det SUJ subject suj OBJ object obj DISFL disfluency mod CC coordination coord, dep coord POBJ prepositional object a obj, de obj, p obj ATTS subject attribute ats JUXT juxtaposition mod MultiMots multi-word expression mod AUX auxiliary aux tps, aux pass, aux caus DUMMY empty dependent aff REF reflexive pronoun obj, a obj, de obj APPOS apposition mod ATTO object attribute ato

Rule Converter ETB PTB ETB MOD CC POBJ AUX REF P7Dep mod, mod rel, dep coord, dep coord a obj, de obj, p obj aux tps, aux pass, aux caus obj, a obj, de obj ETB DISFL JUXT MultiMots APPOS P7Dep mod mod mod mod Converter accuracy on an ESTER test corpus manually annotated with the P7Dep format LAS (labelled attachment score) = 92.6% UAS = 98.5%.

Example Annotations Figure: Screenshot of the J-Safran GUI for dependency tree edition

J-Safran software http://rapsodis.loria.fr/jsafran/index.html GUI with the following functionalities Vizualisation and Edition of dependency graphs POS-tagging: Tree-Tagger (French version) and OpenNLP Tagger (CRF trained on French TreeBank) Parsing with the Malt Parser (ETB or FTB models) Training of parsing models on annotated data Search functions (words, dependencies, sequences,...) Evaluation with CoNLL scripts

Utterance-level annotations Part 2 of the ETB corpus was annotated with utterance level annotations. GUEST: et euh je je pense que pourri beaucoup l image de de la conduite (and hum I I think deteriorates much the image of of driving) SPEAKER: les deux gouvernements cherchent un compromis (both governments look for some compromise) ELLIPSIS: je cite de mémoire qu un tiers des morts à l avant euh n avaient pas leur ceinture et euh non un quart à l avant et je crois près du tiers à l arrière (... a third of the deads in front did not have their safety belt on huh no a quarter in front and I think a third at the back) HEADER: quiberon frédéric colas france bleue armorique pour france-inter (Quiberon Frédéric Colas france bleue Armorique for france-inter)

Models performance on ETB Part 2 Training corpus: 8544 words Test corpus: 1747 words Labelled attachment score i.e., percentage of tokens with correct governor and dependency relation (LAS): 63.6% Which constructs most affect parsing accuracy? We look at Speaker/Guest differences, disfluencies and radio headlines

Impact of disfluencies Ratio of utterances with disfluencies: 41% (D sub-corpus) Manual removal of disfluencies in the test corpus Performances on the D sub-corpus: W/o With (w,w/o) disfl disfl LAS 70.2% 66.1% +4.1 UAS 77.2% 73.5% +3.7 LAC 76.5% 72.7% +3.8 Performances on the whole test corpus: W/o With (w,w/o) disfl disfl LAS 67.3% 65.7% +1.6 UAS 74.2% 73.0% +1.2 LAC 74.2% 72.6% +1.6

Impact of speaking style Ratio of journalistic/guest utterances: 72%/28% Performances on both types of speech: Journalist Guest (J,G) LAS 70.8% 65.2% -5.6 UAS 76.5% 71.8% -4.7 LAC 77.5% 72.0% -5.5 Is this difference due to disfluencies? remove disfluencies: Journalist Guest (J,G) LAS 71.2% 67.8% -3.4 UAS 77.2% 74.1% -3.1 LAC 78.2% 74.5% -3.7 Disfluencies explain 40% of the degradation observed between journalist and guest speaker parsing.

Impact of headers Ratio of header utterances: 14% Guest utterances removed 10-fold cross-validation Comparative results on headers / journalist style: Journalist without headers Headers (-H,+H) LAS 70.6% 61.7% -8.9 UAS 76.2% 69.7% -6.5 LAC 77.4% 67.5% -9.9

Summarising Disfluencies degrades parsing performance in average by 1.6 points Guest utterances are harder to parse (even after disfluencies are removed) with a LAS decrease of 3.4 points Radio specific constructs (headlines) show a LAS decrease of 8.9 points (different syntactic structure, sparse data)

Conclusions and future work Current status Current ETB: 65 000 words (53000 Ester 2, 12000 Etape) LAS with MATE Parser: 76% Future work Continue annotations Automatically detect incorrect annotations Finer grained annotation of disfluencies (hesitation,repairs,repetitions,false start) Investigate Active Learning Investigate different parsing strategies (preparse disfluencies and named entities, joint model for named entity recognition and parsing)