Syntactic annotation of spontaneous speech: application to call center conversation data

Similar documents
DECODA: a call-center human-human spoken conversation corpus

FrenchPod101.com Learn French with FREE Podcasts

AP FRENCH LANGUAGE AND CULTURE 2013 SCORING GUIDELINES

FOR TEACHERS ONLY The University of the State of New York

Percentage Ladder French Unit 1: Qu est-ce que tu aimes regarder? Year 8 Reading and Speaking

AP FRENCH LANGUAGE 2008 SCORING GUIDELINES

Evaluation of speech technologies

ORDER FORM FAX:

7th Grade - Middle School Beginning French

Year 9-11 GCSE Scheme of Planned Work and Topics. Based on. Edexcel GCSE French (A*-C)

General Certificate of Education Advanced Level Examination June 2012

Assessments; Optional module 1 vert assessments. Own test on the perfect tense

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Guidelines for marking of (Optional) School-based oral test in French at the Junior Certificate Examination

AS: MFL BRIDGING WORK

THE PRESENT PROGRESSIVE IN FRENCH

Lesson 7. Les Directions (Disc 1 Track 12) The word crayons is similar in English and in French. Oui, mais où sont mes crayons?

Financial Literacy Resource French As a Second Language: Core French Grade 9 Academic FSF 1D ARGENT EN ACTION! Connections to Financial Literacy

Considerations for developing VoiceXML in Canadian French

Saliency and frequency in a corpus of 1930 s French films

Level 2 French, 2014

How To Learn French

What about me and you? We can also be objects, and here it gets really easy,

Riverdale Collegiate Institute Toronto District School Board EVALUATION POLICY and COURSE OUTLINE Riverdale Collegiate Institute Course of Study

ROME INTERNATIONAL MIDDLE/HIGH SCHOOL SYNOPSIS

Survey on Conference Services provided by the United Nations Office at Geneva


2. Il faut + infinitive and its more nuanced alternative il faut que + subjunctive.

June 2016 Language and cultural workshops In-between session workshops à la carte June weeks All levels

FRENCH AS A SECOND LANGUAGE TRAINING

GCSE French. Other Guidance Exemplar Student Marked Work: Controlled Assessment Writing Autumn 2011

Archived Content. Contenu archivé

written by Talk in French Learn French as a habit French Beginner Grammar in 30 days

10 mistakes not to make in France!

EPREUVE D EXPRESSION ORALE. SAVOIR et SAVOIR-FAIRE

National Quali cations SPECIMEN ONLY

Archived Content. Contenu archivé

Table of Contents. Copyright. About the book. About Benjamin Houy. Je (I) De (from and possession) Est (is) Pas (not) Le (the) Vous (formal you)

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Archived Content. Contenu archivé

Shallow Parsing with Apache UIMA

Archived Content. Contenu archivé

Blaenau Gwent Schools Year 8 Scheme of Work. UNIT 1 January April 2012

GCSE French. Other Guidance. Exemplar Material: Controlled Assessment Writing Autumn 2010

Resources: Encore Tricolore 1 (set textbook), Linguascope, Boardworks, languagesonline, differentiated resources devised by MFL staff.

Marks Communication /10 Range and Accuracy of Language /10 Pronunciation and Intonation /5 Interaction and Fluency /5 TOTAL /30

French 2A. Course Overview


Les Métiers (Jobs and Professions)

Student e-book. With a French Accent. Grammar. French Basics. Easy grammatical explanations and practical, everyday language

FRENCH DEPARTMENT YEAR PLAN DATE: JANUARY 2013 REVIEW DATE: JANUARY 2014

Measuring Policing Complexity: A Research Based Agenda

Archived Content. Contenu archivé

AP FRENCH LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

Invitée : Islane Louis Révision : Directions. Conversation (jaune): Les expressions de survivance

Archived Content. Contenu archivé

Introduction au BIM. ESEB Seyssinet-Pariset Economie de la construction contact@eseb.fr

FOR TEACHERS ONLY The University of the State of New York

Une campagne de sensibilisation est lancée (1) pour lutter (1) contre les conséquences de l'alcool au volant. Il faut absolument réussir (2).

Enterprise Risk Management & Board members. GUBERNA Alumni Event June 19 th 2014 Prepared by Gaëtan LEFEVRE

AP FRENCH LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

Grammar notes English translation Not done in class (to be done later)

INDIRECT OBJECT PRONOUN. Nineteenth lesson Dix-neuvième leçon

Gaynes School French Scheme YEAR 8: weeks 8-13

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Archived Content. Contenu archivé

Le future proche, le futur simple, le futur antérieur

CALICO Journal, Volume 9 Number 1 9

Ministry of Defence Languages Examinations Board

CURRICULUM GUIDE. French 2 and 2 Honors LAYF05 / LAYFO7

Thailand Business visa Application for citizens of Hong Kong living in Manitoba

FrenchPod101.com Learn French with FREE Podcasts

In this lesson, I will view an animation titled La météo au Canada. 1. Open Section Two, Day One of the Workbook and complete the Reflection.

Note concernant votre accord de souscription au service «Trusted Certificate Service» (TCS)

French 8655/S 8655/S. AQA Level 1/2 Certificate June Teacher s Booklet. To be conducted by the teacher-examiner between 24 April and 15 May 2014

AgroMarketDay. Research Application Summary pp: Abstract

Archived Content. Contenu archivé

Sun Management Center Change Manager Release Notes

Module 1: Les Pre sentations

Temporary Supplement Stamps, Templates & Chipboards. Janvier Supplément temporaire Étampes, pochoirs & chipboards

Bicultural child in the Dordogne English classroom to be French or not to be? NORAH LEROY ESPE D AQUITAINE - UNIVERSITÉ DE BORDEAUX

French writing. Exemplar student marked work. At AQA, we provide Modern Foreign Language teachers with the support you need.

Office of the Auditor General / Bureau du vérificateur général FOLLOW-UP TO THE 2010 AUDIT OF COMPRESSED WORK WEEK AGREEMENTS 2012 SUIVI DE LA

Elizabethtown Area School District French III

1. Use the present tense of the verbs faire or aller and the words in the word bank to complete the sentences about activities.

Contents. Unit thirteen Bon appétit! (Enjoy your meal!) 7. Unit fourteen Je suis le musicien ( I am the music man ) 14

How To Write A Police Budget

Personnalisez votre intérieur avec les revêtements imprimés ALYOS design


GCE EXAMINERS' REPORTS


Calcul parallèle avec R

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Terminology Extraction from Log Files

Transcription:

Syntactic annotation of spontaneous speech: application to call center conversation data Frédéric Béchet, Thierry Bazillon, Benoit Favre, Alexis Nasr Aix Marseille Université LIF-CNRS Laboratoire d Informatique Fondamentale de Marseille Workshop on Spoken Treebanks Paris, November 15th

Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech

Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech

Context of this study Linguistic analysis of spoken messages for developing automatic speech processing systems Spoken language understanding Automatic Speech Recognition Natural Language processing Machine Learning

Spoken Language Understanding Applicative framework Automatic spoken dialog systems Call-routing Form filling Negociation Speech analytics Broadcast shows Audio archives (INA) Call centers Main issue : processing spontaneous speech Human Machine dialog Human Human conversation

Spoken Language Understanding Speech analytics from G. Riccardi

Spoken Language Understanding Spoken conversation analysis from B. Favre

Spoken Language Understanding Why using syntactic analysis? Syntactic relations Semantic relations Dependency analysis Ex: The CoNLL-2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies Semantic disambiguation Semantic role labelling Ex: ESTER2 BN named entity detection Organisation vs. Location La France a proposé au conseil de sécurite de l ONU.. J irais visiter la France l année prochaine. Language characterization Read speech / prepared speech / spontaneous speech Speaker role labelling

Linguistic analysis of automatic transcriptions Mains characteristics Automatic transcriptions = text generated by a language model Transcription errors (unknown words, deletion/insertion/subst.) Stream of words, no structure What kind of linguistic analysis? Traditional view of parsing based on context-free grammars is not suitable for processing automatic transcriptions Ungrammatical structures in spontaneous speech Transcription errors Parsing based on dependency structures and discriminative machine learning techniques are much easier to adapt to process speech Graph-based dependency parser (McDonald et al., 2007) Partial annotation can be performed Dependency parsing framework generates parses much closer to predicate argument structures

Approaches developed at the LIF Development of spoken ressources to train statistical models POS tagger / Chunker / Dependency parser MACAON NLP suite DECODA project Using existing linguistic ressources Syntactic/Semantic lexicons DicoValence, Dubois-Charlier Integration of the spoken recognition and linguistic analysis processes NLP tools keeping ambiguities at each processing level From word lattices to hypotheses lattices Joint processing Cost function dependent on the targeted application

Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech

The DECODA project DEpouillement automatique de COnversations provenant de centres D'Appels Partners Université d'avignon et des Pays de Vaucluse (UAPV) Laboratoire d'informatique Fondamentale de Marseille (LIF) Sonear RATP Program ANR Contint 2009 Start: october 2009 Duration: 36 months

The DECODA project Applicative framework: the RATP call center Paris public transport authority Route assistance, complaints, information desk,. Recording of all conversations Goal : quality control + statistics Forms manually filled by the operators Partial (and noisy) description of the conversations Goal of the project Applicative Automatic spoken conversation analysis tools summarization Speech analytics Interface helping the operators to fill the info forms during a conversation Scientific Limitation of the need for supervision to build models Machine learning with weakly supervised methods Linguistic analysis of spontaneous speech conversations Robust syntactic/semantic analyses

The DECODA corpus Data collection Paris public transport authority (RATP) call center Easly collection of large amount of data >1000 calls a day Large range of speakers Very few personal data Easy to anonymise without erasing a lot of signal Various acoustic quality Cell phones + Noisy environment Current state of the corpus 1514 dialogs selected from 2 days of the call center traffic 74 hours of signal Average duration: 3 minutes (12% over 5 minutes)

The DECODA corpus Transcription process Each file is manually anonymised Manual segmentation Dialog sections Speakers Manual transcription with Transcriber ESTER transcription guide Corpus statistics 1514 files 96103 speakers turns 482745 words after tokenization most frequent word:`= euh The total vocabulay of the corpus is 8806 words Example

The DECODA corpus: annotation process Semantic Annotations Manual annotation of the whole corpus based on the RATP ontology 10 top call types Syntactic Annotations 4 levels Part-Of-Speech tags Named Entities Syntactic chunks Syntactic dependencies Method Manual annotation of a subset of the corpus (100 dialogs) Projection toward the whole corpus

Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech

Adapting MACAON to spontaneous speech transcriptions Macaon NLP suite POS tagger Chunker Named Entities Dependency analysis Main feature Ambiguity management: Hypotheses lattices (input/output) XML format Direct integration of word lattices produced by a speech recognizer Filter : HTK lattices MACAON Model training P7 French treebank Adaptation to the DECODA corpus Semi-supervised adaptation

Semi supervised adaptation Method Selection of 2 sub-corpora from the DECODA corpus «TRAIN» : 728 dialogs, 255K words «GOLD» : 156 dialogs, 55K words 3 levels of annotation POS / «simple» disfluencies / named entities Manual annotation «GOLD» corpus Semi-supervised adaptation Automatic annotation of the «TRAIN» corpus With baseline MACAON models (FTB) WEB correction interface Based on regular expressions applied to the whole TRAIN corpus Macaon models retrained on the «TRAIN» corrected Evaluation on the GOLD corpus to verify the quality of the corrections

Levels of annotation Incremental process POS tagging Every word receives a tags Adding tags when needed to represent every phenomenon Disfluencies, spelled words, etc. Marking simple disfluencies Speech markers euh, bah, ben, hein.. False starts bon bonjour Simple repetitions Single words: le le le le Bigram or trigram repetition: je veux je veux dire Goal Prevent breaking Named Entities and syntactic chunks Named Entities

Example of disfluencies and NE annotations voilà euh je viens de descendre du bus et j' ai oublié en fait euh un une petite boite euh dans le bus euh où il y a mes affaires euh c' est le c' est le deux cent quatre-vingt-un qui je pense qu' il arrive à Eurostar là à Europarc pardon je viens de descendre du bus et j' ai oublié en fait une petite Boîte dans le bus où il y a mes affaires c' est le <E_T>deux-cent-quatre-vingt-un</E_T> qui je pense qu' il arrive à <E_N>Eurostar</E_N> là à <E_A> Europarc </E_A>

DECODA corpus GOLD corpus Manual annotation gold TRAIN corpus evaluation MACAON tools Automatic annotations auto Manual correction MACON model retraining Corrected annotations

Error Correction at the POS level Baseline MACAON models P7 French Treebanks: journal «Le Monde» Written French Spoken French Personal pronoun Tu : toi tu veux prendre cet appel? clo C' : c'est pas toujours le cas clo Nous : nous on est arrivés avec notre idée clo Moi : moi j'ai envie de te parler nc Toi : toi tu vas la récupérer clo Lexical ambiguities bon c'est vrai qu'il a pas tort quoi interjections moi j'aime que le métro adverbe negation bah, nan unknown words; added as interjections Repetitions C'est un peu le le le principe clo clo det

Evaluation of the POS model adaptation 2 models baseline models trained on the FTB DECODA models trained on the corrected TRAIN corpus 2 test corpora DECODA GOLD EPAC GOLD Broadcast conversation (radio interviews, radio talk shows) Results: 50% error reduction Corpus / Models Baseline (FTB) Adapted models (DECODA) DECODA 25.4% 11.6% EPAC 16.2% 7.2%

Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech

Syntactic Dependency annotation Method Manual annotation on the DECODA «GOLD» Training of a Graph-based dependency parser on the GOLD corpus Application to the whole DECODA corpus Annotation process DECODA «GOLD» without disfluencies, with NE and gold POS Chunking process with MACAON on the gold POS Dependency links at the chunk level To speed up the annotation process Used by the Spoken Language Understanding modules Automatic projection of the links from the chunk to the word level. Annotation guide Derived from the French TreeBank annotation guide http://alpage.inria.fr/statgram/frdep/publications/ftb-guidedepsurface.pdf Simplification of some annotation conventions since word-to-word links weren't needed here. 16 types of syntactic dependencies

Syntactic Dependency annotation 1. subject (suj): Jean est mon ami 2. impersonal subject (suj_imp): il pleut beaucoup ce matin 3. direct object (obj): je lis le journal 4. indirect object with {de} preposition (de_obj): il se souvient de ses vacances 5. indirect object with {à} preposition (a_obj): il pense à toi 6. indirect object introduced with another preposition (p_obj): il compte sur toi 7. locative object (p_obj_loc): j'habite à Marseille 8. coordination (coord): du pain et des jeux 9. dependant of the coordination (dep_coord): du pain et des jeux 10. subject attribute (ats): je suis content 11. object attribute (ato): il me trouve intelligent 12. reflexive pronoun (aff): je me lève 13. relative subordinate clause (mod_rel): l'homme qui rit 14. comparative (arg_comp): il est plus grand que toi 15. adverbial phrase (mod): il travaille depuis deux jours 16. noun complement (dep) : le journal de mon ami

Syntactic Dependency annotation All the dialogs have been annotated by two human annotators A subset of 20 dialogs has been annotated by both annotators in order to check inter-annotator agreement. A web-interface was used for the annotation task. Every dialog is segmented into chunks described by: the chunk position in the dialog the chunk content the POS tagging of each word inside the chunk the chunk type Sentences can contain chunks or groups of chunks not connected to the rest of the sentence. Spoken disfluencies such as false starts or juxtaposed structures. Juxtaposed structures are very frequent in oral conversations speakers don't always use relative pronouns, subordinative conjunctions or coordinative conjunctions to articulate their speech.

Examples of chunks on the DECODA data

WEB annotation interface

Annotation process: issues with the chunking process Chunk grammar defined on written text Detachment make some rules non applicable det+nc+pro : Written : le gouvernement lui se réserve le droit d'intervenir Oral : le trajet moi ça me semble très long det+np+nc : Written : le Molière comédien est moins célèbre que le Molière auteur Oral : le Navigo monsieur c'est quinze euros vingt-cinq Some chunks are «broken» by speech disfluencies les horaires du bus numéro je pense trois cent trente

Annotation process: issues with spoken phenomenons Lots of multiple relations lui il passe à Villemomble ce bus Some dependencies are difficult to asess J'arrive pas à accéder au / ouais ça marche pas / site web Sequences of chunks with no dependencies (different dialog acts) bonne journée // merci // au revoir Agrammaticality donc c'est les horaires qu'il faut que je prenne le bus? Phrases clivées Ce que vous voulez, c'est les horaires

From chunk dependencies to word dependencies Annotating dependencies inside the chunks Patterns for the non-ambiguous cases GN : det nc* le bus // mon fils // cette grève det(1,2) GV : clneg v advneg adv vppart vppart* n'ai pas totalement été remboursée mod(1,2), mod(3,2) mod(4,2), aux(2,5), aux(5,6) Lexicalized patterns for some ambiguous cases GP : prep* clo vinf pour le prendre pour me dire pour y aller obj(2,3) a_obj(2,3) p_obj_loc(2,3)

Automatic dependency parsing Training a dependency parser on the DECODA gold corpus Main issues non projectivity Multiple roots Overlapping parses

Adding dummy relations governor after governor before Non projectif multiracine chevauchement Baseline 8.21% 40.82% 1.29% Gouverneur Après 7.83% 0% 4.81% Gouverneur Avant 6.49% 0% 2.68%

Preliminary results MATE parser (http://code.google.com/p/mate-tools) Graph-based dependency parser 2nd order features; MIRA training; non projectivity allowed Training and testing on the DECODA GOLD 80% entraînement, 10% dev, 10% test DEV TEST LAS UAS LAS UAS Baseline 87.72% 91.41% 87.71% 91.19% Gouv. après 87.33% 91.15% 87.87% 91.36% Gouv. avant 87.57% 91.41% 87.37% 90.80%

Conclusions and perspectives Linguistic analysis of the DECODA corpus POS, «simple» disfluencies, Named Entities, chunking, dependency links Manual annotations on the «GOLD» corpus Automatic projection toward the whole DECODA corpus Preliminary evaluation of the quality of the automatic annotations Integration into the automatic speech processing tools Adaptation of the MACAON models Adding the «disfluency normalisation» module Integration with ASR output Evaluation on a Spoken Language Understanding task