Syntactic annotation of spontaneous speech: application to call center conversation data Frédéric Béchet, Thierry Bazillon, Benoit Favre, Alexis Nasr Aix Marseille Université LIF-CNRS Laboratoire d Informatique Fondamentale de Marseille Workshop on Spoken Treebanks Paris, November 15th
Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech
Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech
Context of this study Linguistic analysis of spoken messages for developing automatic speech processing systems Spoken language understanding Automatic Speech Recognition Natural Language processing Machine Learning
Spoken Language Understanding Applicative framework Automatic spoken dialog systems Call-routing Form filling Negociation Speech analytics Broadcast shows Audio archives (INA) Call centers Main issue : processing spontaneous speech Human Machine dialog Human Human conversation
Spoken Language Understanding Speech analytics from G. Riccardi
Spoken Language Understanding Spoken conversation analysis from B. Favre
Spoken Language Understanding Why using syntactic analysis? Syntactic relations Semantic relations Dependency analysis Ex: The CoNLL-2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies Semantic disambiguation Semantic role labelling Ex: ESTER2 BN named entity detection Organisation vs. Location La France a proposé au conseil de sécurite de l ONU.. J irais visiter la France l année prochaine. Language characterization Read speech / prepared speech / spontaneous speech Speaker role labelling
Linguistic analysis of automatic transcriptions Mains characteristics Automatic transcriptions = text generated by a language model Transcription errors (unknown words, deletion/insertion/subst.) Stream of words, no structure What kind of linguistic analysis? Traditional view of parsing based on context-free grammars is not suitable for processing automatic transcriptions Ungrammatical structures in spontaneous speech Transcription errors Parsing based on dependency structures and discriminative machine learning techniques are much easier to adapt to process speech Graph-based dependency parser (McDonald et al., 2007) Partial annotation can be performed Dependency parsing framework generates parses much closer to predicate argument structures
Approaches developed at the LIF Development of spoken ressources to train statistical models POS tagger / Chunker / Dependency parser MACAON NLP suite DECODA project Using existing linguistic ressources Syntactic/Semantic lexicons DicoValence, Dubois-Charlier Integration of the spoken recognition and linguistic analysis processes NLP tools keeping ambiguities at each processing level From word lattices to hypotheses lattices Joint processing Cost function dependent on the targeted application
Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech
The DECODA project DEpouillement automatique de COnversations provenant de centres D'Appels Partners Université d'avignon et des Pays de Vaucluse (UAPV) Laboratoire d'informatique Fondamentale de Marseille (LIF) Sonear RATP Program ANR Contint 2009 Start: october 2009 Duration: 36 months
The DECODA project Applicative framework: the RATP call center Paris public transport authority Route assistance, complaints, information desk,. Recording of all conversations Goal : quality control + statistics Forms manually filled by the operators Partial (and noisy) description of the conversations Goal of the project Applicative Automatic spoken conversation analysis tools summarization Speech analytics Interface helping the operators to fill the info forms during a conversation Scientific Limitation of the need for supervision to build models Machine learning with weakly supervised methods Linguistic analysis of spontaneous speech conversations Robust syntactic/semantic analyses
The DECODA corpus Data collection Paris public transport authority (RATP) call center Easly collection of large amount of data >1000 calls a day Large range of speakers Very few personal data Easy to anonymise without erasing a lot of signal Various acoustic quality Cell phones + Noisy environment Current state of the corpus 1514 dialogs selected from 2 days of the call center traffic 74 hours of signal Average duration: 3 minutes (12% over 5 minutes)
The DECODA corpus Transcription process Each file is manually anonymised Manual segmentation Dialog sections Speakers Manual transcription with Transcriber ESTER transcription guide Corpus statistics 1514 files 96103 speakers turns 482745 words after tokenization most frequent word:`= euh The total vocabulay of the corpus is 8806 words Example
The DECODA corpus: annotation process Semantic Annotations Manual annotation of the whole corpus based on the RATP ontology 10 top call types Syntactic Annotations 4 levels Part-Of-Speech tags Named Entities Syntactic chunks Syntactic dependencies Method Manual annotation of a subset of the corpus (100 dialogs) Projection toward the whole corpus
Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech
Adapting MACAON to spontaneous speech transcriptions Macaon NLP suite POS tagger Chunker Named Entities Dependency analysis Main feature Ambiguity management: Hypotheses lattices (input/output) XML format Direct integration of word lattices produced by a speech recognizer Filter : HTK lattices MACAON Model training P7 French treebank Adaptation to the DECODA corpus Semi-supervised adaptation
Semi supervised adaptation Method Selection of 2 sub-corpora from the DECODA corpus «TRAIN» : 728 dialogs, 255K words «GOLD» : 156 dialogs, 55K words 3 levels of annotation POS / «simple» disfluencies / named entities Manual annotation «GOLD» corpus Semi-supervised adaptation Automatic annotation of the «TRAIN» corpus With baseline MACAON models (FTB) WEB correction interface Based on regular expressions applied to the whole TRAIN corpus Macaon models retrained on the «TRAIN» corrected Evaluation on the GOLD corpus to verify the quality of the corrections
Levels of annotation Incremental process POS tagging Every word receives a tags Adding tags when needed to represent every phenomenon Disfluencies, spelled words, etc. Marking simple disfluencies Speech markers euh, bah, ben, hein.. False starts bon bonjour Simple repetitions Single words: le le le le Bigram or trigram repetition: je veux je veux dire Goal Prevent breaking Named Entities and syntactic chunks Named Entities
Example of disfluencies and NE annotations voilà euh je viens de descendre du bus et j' ai oublié en fait euh un une petite boite euh dans le bus euh où il y a mes affaires euh c' est le c' est le deux cent quatre-vingt-un qui je pense qu' il arrive à Eurostar là à Europarc pardon je viens de descendre du bus et j' ai oublié en fait une petite Boîte dans le bus où il y a mes affaires c' est le <E_T>deux-cent-quatre-vingt-un</E_T> qui je pense qu' il arrive à <E_N>Eurostar</E_N> là à <E_A> Europarc </E_A>
DECODA corpus GOLD corpus Manual annotation gold TRAIN corpus evaluation MACAON tools Automatic annotations auto Manual correction MACON model retraining Corrected annotations
Error Correction at the POS level Baseline MACAON models P7 French Treebanks: journal «Le Monde» Written French Spoken French Personal pronoun Tu : toi tu veux prendre cet appel? clo C' : c'est pas toujours le cas clo Nous : nous on est arrivés avec notre idée clo Moi : moi j'ai envie de te parler nc Toi : toi tu vas la récupérer clo Lexical ambiguities bon c'est vrai qu'il a pas tort quoi interjections moi j'aime que le métro adverbe negation bah, nan unknown words; added as interjections Repetitions C'est un peu le le le principe clo clo det
Evaluation of the POS model adaptation 2 models baseline models trained on the FTB DECODA models trained on the corrected TRAIN corpus 2 test corpora DECODA GOLD EPAC GOLD Broadcast conversation (radio interviews, radio talk shows) Results: 50% error reduction Corpus / Models Baseline (FTB) Adapted models (DECODA) DECODA 25.4% 11.6% EPAC 16.2% 7.2%
Content of this talk Context of this work Syntactic analysis of speech in our lab Why? How? For which kinds of application? The DECODA corpus A call-center human-human conversation corpus Part-Of-Speech annotation of the DECODA corpus A semi-supervised approach Syntactic dependency annotation of the DECODA corpus Training a first syntactic dependency parser for spontaneous speech
Syntactic Dependency annotation Method Manual annotation on the DECODA «GOLD» Training of a Graph-based dependency parser on the GOLD corpus Application to the whole DECODA corpus Annotation process DECODA «GOLD» without disfluencies, with NE and gold POS Chunking process with MACAON on the gold POS Dependency links at the chunk level To speed up the annotation process Used by the Spoken Language Understanding modules Automatic projection of the links from the chunk to the word level. Annotation guide Derived from the French TreeBank annotation guide http://alpage.inria.fr/statgram/frdep/publications/ftb-guidedepsurface.pdf Simplification of some annotation conventions since word-to-word links weren't needed here. 16 types of syntactic dependencies
Syntactic Dependency annotation 1. subject (suj): Jean est mon ami 2. impersonal subject (suj_imp): il pleut beaucoup ce matin 3. direct object (obj): je lis le journal 4. indirect object with {de} preposition (de_obj): il se souvient de ses vacances 5. indirect object with {à} preposition (a_obj): il pense à toi 6. indirect object introduced with another preposition (p_obj): il compte sur toi 7. locative object (p_obj_loc): j'habite à Marseille 8. coordination (coord): du pain et des jeux 9. dependant of the coordination (dep_coord): du pain et des jeux 10. subject attribute (ats): je suis content 11. object attribute (ato): il me trouve intelligent 12. reflexive pronoun (aff): je me lève 13. relative subordinate clause (mod_rel): l'homme qui rit 14. comparative (arg_comp): il est plus grand que toi 15. adverbial phrase (mod): il travaille depuis deux jours 16. noun complement (dep) : le journal de mon ami
Syntactic Dependency annotation All the dialogs have been annotated by two human annotators A subset of 20 dialogs has been annotated by both annotators in order to check inter-annotator agreement. A web-interface was used for the annotation task. Every dialog is segmented into chunks described by: the chunk position in the dialog the chunk content the POS tagging of each word inside the chunk the chunk type Sentences can contain chunks or groups of chunks not connected to the rest of the sentence. Spoken disfluencies such as false starts or juxtaposed structures. Juxtaposed structures are very frequent in oral conversations speakers don't always use relative pronouns, subordinative conjunctions or coordinative conjunctions to articulate their speech.
Examples of chunks on the DECODA data
WEB annotation interface
Annotation process: issues with the chunking process Chunk grammar defined on written text Detachment make some rules non applicable det+nc+pro : Written : le gouvernement lui se réserve le droit d'intervenir Oral : le trajet moi ça me semble très long det+np+nc : Written : le Molière comédien est moins célèbre que le Molière auteur Oral : le Navigo monsieur c'est quinze euros vingt-cinq Some chunks are «broken» by speech disfluencies les horaires du bus numéro je pense trois cent trente
Annotation process: issues with spoken phenomenons Lots of multiple relations lui il passe à Villemomble ce bus Some dependencies are difficult to asess J'arrive pas à accéder au / ouais ça marche pas / site web Sequences of chunks with no dependencies (different dialog acts) bonne journée // merci // au revoir Agrammaticality donc c'est les horaires qu'il faut que je prenne le bus? Phrases clivées Ce que vous voulez, c'est les horaires
From chunk dependencies to word dependencies Annotating dependencies inside the chunks Patterns for the non-ambiguous cases GN : det nc* le bus // mon fils // cette grève det(1,2) GV : clneg v advneg adv vppart vppart* n'ai pas totalement été remboursée mod(1,2), mod(3,2) mod(4,2), aux(2,5), aux(5,6) Lexicalized patterns for some ambiguous cases GP : prep* clo vinf pour le prendre pour me dire pour y aller obj(2,3) a_obj(2,3) p_obj_loc(2,3)
Automatic dependency parsing Training a dependency parser on the DECODA gold corpus Main issues non projectivity Multiple roots Overlapping parses
Adding dummy relations governor after governor before Non projectif multiracine chevauchement Baseline 8.21% 40.82% 1.29% Gouverneur Après 7.83% 0% 4.81% Gouverneur Avant 6.49% 0% 2.68%
Preliminary results MATE parser (http://code.google.com/p/mate-tools) Graph-based dependency parser 2nd order features; MIRA training; non projectivity allowed Training and testing on the DECODA GOLD 80% entraînement, 10% dev, 10% test DEV TEST LAS UAS LAS UAS Baseline 87.72% 91.41% 87.71% 91.19% Gouv. après 87.33% 91.15% 87.87% 91.36% Gouv. avant 87.57% 91.41% 87.37% 90.80%
Conclusions and perspectives Linguistic analysis of the DECODA corpus POS, «simple» disfluencies, Named Entities, chunking, dependency links Manual annotations on the «GOLD» corpus Automatic projection toward the whole DECODA corpus Preliminary evaluation of the quality of the automatic annotations Integration into the automatic speech processing tools Adaptation of the MACAON models Adding the «disfluency normalisation» module Integration with ASR output Evaluation on a Spoken Language Understanding task