Building and exploiting a dependency treebank for French radio broadcasts

Similar documents
Syntactic annotation of spontaneous speech: application to call center conversation data

Evaluation of speech technologies

TS3: an Improved Version of the Bilingual Concordancer TransSearch

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Shallow Parsing with Apache UIMA

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Level 3 French, 2015

AP FRENCH LANGUAGE AND CULTURE 2013 SCORING GUIDELINES

Test Suite Generation

Elizabethtown Area School District French II

AP FRENCH LANGUAGE 2008 SCORING GUIDELINES

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Text Analysis Using Drupal

Annotation Guidelines for Dutch-English Word Alignment

AP FRENCH LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

Trameur: A Framework for Annotated Text Corpora Exploration


Course Title: French II Topic/Concept: ir and re verbs Time Allotment: 2 weeks Unit Sequence: 1 Major Concepts to be learned:

CURRICULUM VITAE Studies Positions Distinctions Research interests Research projects

Isabelle Debourges, Sylvie Guilloré-Billot, Christel Vrain

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

Transition-Based Dependency Parsing with Long Distance Collocations

AP FRENCH LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

Online Tutoring System For Essay Writing

Elizabethtown Area School District French III

GCSE FRENCH 8658/LF. Foundation Tier Paper 1 Listening

EVALITA Named Entity Recognition on Transcribed Broadcast News Guidelines for Participants

DECODA: a call-center human-human spoken conversation corpus

2. Il faut + infinitive and its more nuanced alternative il faut que + subjunctive.

SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

SUPPLEMENT N 4 DATED 12 May 2014 TO THE BASE PROSPECTUS DATED 22 NOVEMBER BPCE Euro 40,000,000,000 Euro Medium Term Note Programme

Natural Language Processing

Factoring Surface Syntactic Structures

Archived Content. Contenu archivé

Robustness of a Spoken Dialogue Interface for a Personal Assistant

Plugin SMILK. données liées et traitement de la langue pour plus d'intelligence dans la navigation sur le Web

Trameur: A Framework for Annotated Text Corpora Exploration

DEPENDENCY PARSING JOAKIM NIVRE

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Applying Repair Processing in Chinese Homophone Disambiguation

In-Home Caregivers Teleconference with Canadian Bar Association September 17, 2015

Assessments; Optional module 1 vert assessments. Own test on the perfect tense

Speech Transcription

Considerations for developing VoiceXML in Canadian French

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

Marie Dupuch, Frédérique Segond, André Bittar, Luca Dini, Lina Soualmia, Stefan Darmoni, Quentin Gicquel, Marie-Hélène Metzger

GCSE French. Other Guidance. Exemplar Material: Controlled Assessment Writing Autumn 2010

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Training and evaluation of POS taggers on the French MULTITAG corpus

HOW MUCH DO YOU KNOW ABOUT RUGBY???

Open issues regarding legal metadata: IP licensing and management of different cognitive levels

June 2016 Language and cultural workshops In-between session workshops à la carte June weeks All levels

Raconte-moi : Les deux petites souris

Specialty Answering Service. All rights reserved.

DHI a.s. Na Vrsich 51490/5, , Prague 10, Czech Republic ( t.metelka@dhi.cz, z.svitak@dhi.cz )

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Assessment software development for distributed firewalls

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

WIRING DIAGRAM EXAMPLE EXEMPLE DE SCHEMA DE CABLAGE

Terminology Extraction from Log Files

General Certificate of Education Advanced Level Examination June 2012

11520 Alberta CALGARY Nova Scotia / Nouvelle-Écosse HALIFAX Quebec / Québec MONTREAL Ontario OTTAWA

Outline of today s lecture

Identifying Focus, Techniques and Domain of Scientific Papers

Third Supplement dated 8 September 2015 to the Euro Medium Term Note Programme Base Prospectus dated 12 December 2014

VIREMENTS BANCAIRES INTERNATIONAUX

How To Write A Police Budget

TREATIES AND OTHER INTERNATIONAL ACTS SERIES Agreement Between the UNITED STATES OF AMERICA and CONGO

LEÇON 17 Le français pratique: L achat des vêtements

Amazigh ConCorde: an appropriate concordance for Amazigh

Online free translation services

Kindly go through this entire document (5 pages) carefully before booking your flight.

What about me and you? We can also be objects, and here it gets really easy,

A chart generator for the Dutch Alpino grammar

RAPPORT FINANCIER ANNUEL PORTANT SUR LES COMPTES 2014

The Addition to residences in Scotland, Canada

Temporary Supplement Stamps, Templates & Chipboards. Janvier Supplément temporaire Étampes, pochoirs & chipboards

BonPatronPro to the rescue

AgroMarketDay. Research Application Summary pp: Abstract

Natural Language Database Interface for the Community Based Monitoring System *

FRENCH AS A SECOND LANGUAGE TRAINING

ROME INTERNATIONAL MIDDLE/HIGH SCHOOL SYNOPSIS

FOR TEACHERS ONLY The University of the State of New York

Finding Syntactic Characteristics of Surinamese Dutch

Module 6: Le Shopping : Les Vêtements

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

PiQASso: Pisa Question Answering System

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

I will explain to you in English why everything from now on will be in French

Automatic Detection and Correction of Errors in Dependency Treebanks

SEPA Mandate Guide. Contents. 1.0 The purpose of this document Why mandates are required When a new mandate is required 2

Industry Guidelines on Captioning Television Programs 1 Introduction

Post-Secondary Opportunities For Student-Athletes / Opportunités post-secondaire pour les étudiantathlètes

CALICO Journal, Volume 9 Number 1 9

Terminology Extraction from Log Files

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Survey on Conference Services provided by the United Nations Office at Geneva

Transcription:

Building and exploiting a dependency treebank for French radio broadcasts Christophe Cerisara, Claire Gardent and Corinna Anderson LORIA, Nancy 2011-11-15

Goals Corpus Annotation Tools and Methodology Annotation schema The Impact of Speech Constructs on Parsing Conclusions

Goals Long term Use syntax to improve speech recognition (INRIA Collaborative Action Rapsodis 2009-2010) Medium term Build a tree bank of spoken data (transcription of radio broadcast news) Empirical study of speech constructs Analyse impact of speech constructs on parsing Parse speech

The Ester Corpus 37 hours of manual transcriptions of French radios (1998.. 2003) Annotations with speakers, words, noise symbols, sometimes punctuation Normalisation to match the output of speech recognition systems: Remove punctuation, first word upper-case letters Remove incomplete words: Le pe- petit... But keep disfluencies with complete words: Le le petit

Example transcriptions quiberon Frédéric Colas France Bleue Armorique pour France-Inter Header; No punctuation bonsoir Non verbal utterance l enquête sur l office HLM de Paris Jean Tiberi le maire de la capitale annonce lui-même dans une interview au Monde Incomplete utterances sa mise en examen pour complicité de trafic d influence Incorrect sentence segmentation je pense que cela doit conduire euh Jean Tiberi le premier euh à une réflexion Hesitations

Methodology for constructing ETB Manual annotation supported by pre-parsing Active Learning for selectively extending the annotated data and improve the parser using a small training corpus (Christophe s talk)

Manual Annotation On and Off since 2009 Uses JSafran framework http://rapsodis.loria.fr/jsafran/index.html Iterative process: 1. Design of an annotation scheme. 2. Manual annotation of 5000 words 3. Training of a Malt Parser model 4. Automatic parsing of a new corpus segment 5. Manual correction of this corpus segment 6. Addition of this corrected segment to the training corpus 7. Iterate from step 3

The ATB Annotation Schema 15 dependency relations: SUJ (subject) OBJ (object) POBJ (prepositional object) ATTS (subject attribute) ATTO (object attribute) MOD (modifier) COMP (complementizer) AUX (auxiliary) DET (determiner) CC (coordination) REF (reflexive pronoun) JUXT (juxtaposition) APPOS (apposition) DUMMY (syntactically governed but semantically empty dependent) e.g. expletive subject DISFL (disfluency).

ETB and PTB annotations ETB Label description P7Dep MOD modifier mod, mod rel, dep COMP complementizer obj DET determiner det SUJ subject suj OBJ object obj DISFL disfluency mod CC coordination coord, dep coord POBJ prepositional object a obj, de obj, p obj ATTS subject attribute ats JUXT juxtaposition mod MultiMots multi-word expression mod AUX auxiliary aux tps, aux pass, aux caus DUMMY empty dependent aff REF reflexive pronoun obj, a obj, de obj APPOS apposition mod ATTO object attribute ato

Rule Converter ETB PTB ETB MOD CC POBJ AUX REF P7Dep mod, mod rel, dep coord, dep coord a obj, de obj, p obj aux tps, aux pass, aux caus obj, a obj, de obj ETB DISFL JUXT MultiMots APPOS P7Dep mod mod mod mod Converter accuracy on an ESTER test corpus manually annotated with the P7Dep format LAS (labelled attachment score) = 92.6% UAS = 98.5%.

Example Annotations Figure: Screenshot of the J-Safran GUI for dependency tree edition

J-Safran software http://rapsodis.loria.fr/jsafran/index.html GUI with the following functionalities Vizualisation and Edition of dependency graphs POS-tagging: Tree-Tagger (French version) and OpenNLP Tagger (CRF trained on French TreeBank) Parsing with the Malt Parser (ETB or FTB models) Training of parsing models on annotated data Search functions (words, dependencies, sequences,...) Evaluation with CoNLL scripts

Utterance-level annotations Part 2 of the ETB corpus was annotated with utterance level annotations. GUEST: et euh je je pense que pourri beaucoup l image de de la conduite (and hum I I think deteriorates much the image of of driving) SPEAKER: les deux gouvernements cherchent un compromis (both governments look for some compromise) ELLIPSIS: je cite de mémoire qu un tiers des morts à l avant euh n avaient pas leur ceinture et euh non un quart à l avant et je crois près du tiers à l arrière (... a third of the deads in front did not have their safety belt on huh no a quarter in front and I think a third at the back) HEADER: quiberon frédéric colas france bleue armorique pour france-inter (Quiberon Frédéric Colas france bleue Armorique for france-inter)

Models performance on ETB Part 2 Training corpus: 8544 words Test corpus: 1747 words Labelled attachment score i.e., percentage of tokens with correct governor and dependency relation (LAS): 63.6% Which constructs most affect parsing accuracy? We look at Speaker/Guest differences, disfluencies and radio headlines

Impact of disfluencies Ratio of utterances with disfluencies: 41% (D sub-corpus) Manual removal of disfluencies in the test corpus Performances on the D sub-corpus: W/o With (w,w/o) disfl disfl LAS 70.2% 66.1% +4.1 UAS 77.2% 73.5% +3.7 LAC 76.5% 72.7% +3.8 Performances on the whole test corpus: W/o With (w,w/o) disfl disfl LAS 67.3% 65.7% +1.6 UAS 74.2% 73.0% +1.2 LAC 74.2% 72.6% +1.6

Impact of speaking style Ratio of journalistic/guest utterances: 72%/28% Performances on both types of speech: Journalist Guest (J,G) LAS 70.8% 65.2% -5.6 UAS 76.5% 71.8% -4.7 LAC 77.5% 72.0% -5.5 Is this difference due to disfluencies? remove disfluencies: Journalist Guest (J,G) LAS 71.2% 67.8% -3.4 UAS 77.2% 74.1% -3.1 LAC 78.2% 74.5% -3.7 Disfluencies explain 40% of the degradation observed between journalist and guest speaker parsing.

Impact of headers Ratio of header utterances: 14% Guest utterances removed 10-fold cross-validation Comparative results on headers / journalist style: Journalist without headers Headers (-H,+H) LAS 70.6% 61.7% -8.9 UAS 76.2% 69.7% -6.5 LAC 77.4% 67.5% -9.9

Summarising Disfluencies degrades parsing performance in average by 1.6 points Guest utterances are harder to parse (even after disfluencies are removed) with a LAS decrease of 3.4 points Radio specific constructs (headlines) show a LAS decrease of 8.9 points (different syntactic structure, sparse data)

Conclusions and future work Current status Current ETB: 65 000 words (53000 Ester 2, 12000 Etape) LAS with MATE Parser: 76% Future work Continue annotations Automatically detect incorrect annotations Finer grained annotation of disfluencies (hesitation,repairs,repetitions,false start) Investigate Active Learning Investigate different parsing strategies (preparse disfluencies and named entities, joint model for named entity recognition and parsing)