Handling multiply-annotated multimodal corpora

Handling multiply-annotated multimodal corpora Jean University of Edinburgh CLARIN/FLARENET multimodal workshop Nov 2009

Outline 1 Who am I? (AMI and the NITE XML Toolkit) 2 3

Background research on spoken dialogue and how small groups interact designed and developed the NITE XML Toolkit in response to needs of distributed multi-disciplinary groups that share data consult for or support many types of data collection involving language plus something else (video annotation, eyetracking, dialogue system logs,...) secondary contributor to TEI, ISO, and W3C standards efforts

AMI Corpus (100 hours) 4 close- and 2 wide-view cameras, 4 head-set and 8 array microphones, presentation screen capture, whiteboard capture, pen devices, plus extra site-dependent devices

AMI Annotations transcription with word-level timings from forced alignment timestamping against signal: head gestures; hand gestures for addressing and interactions with objects; location in room; gaze annotation above words: dialogue acts (some w/ addressing), named entities, topic segments, linked extractive and abstractive summaries, subjectivity

The AMI Method Use hand-annotation to create automatic mark-up components Guess what features will help with new problems using hand-annotations Share automatic annotations so you ll know what will happen if you plug in someone else s technology Informed by all this, decide what new applications will work and build them

Personal goal: increase inter-disciplinary collaboration technologists often make dumb choices in their data collection and annotation computational linguistics is stuck in a rut - surely knowing something about the properties and structure of the data helps soft sciences can really benefit from some automation and they need to save money on data collection even more than systems developers do

NITE XML Toolkit Open source toolkit for handling annotations with temporal ordering and full structural relations Data storage format designed to support distributed corpus development Libraries for data handling, query, and writing graphical user interfaces End user browsing and annotation tools for common tasks Command line utilities for analysis, feature extraction

Example interface (one of many)

Who am I? (AMI and the NITE XML Toolkit) Typical data

nt da S statement disfluency reparandum movement source nt VP nt VP kontrast backgd nt kontrast contrast repair markable organisation med-gen target S nt VP markable non-concrete old nt EDITED nt NP kontrast contrast nt VP nt PP kontrast backgd nt NP word the DT * syl n nt NP word word word word word word word word sil the government doesn t have trace to deal with it DT NN VBZ-RB VB TO VB IN PRP * * * * * * * * * * * syl syl syl syl syl syl syl syl syl syl syl n p n s p n p n p p p * word the DT phon phonword the 47.48-47.61 syl n * word does VBZ word n t RB phon phon phonword doesn t 47.96-48.18 ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph dh ah dh ah g ah v er m ih n t d ah z en t hh ae v t ax d iy l w ih dh ih t * 47.0 48.0 49.0 t (s) phrase disfl phrase minor * * phrase major * accent nuclear accent * plain * accent nuclear

Community support stand-off annotation using multiple files under version control dependency structure for keeping track of which annotations rely on which versions of which other annotations multiple competing annotations for the same thing (different humans for a reliability assessment, different automatic processes for a competition) logical query language - because this is the only way to analyse this kind of data

Software support is everything Build a better mousetrap, and the world will beat a path to your door. Ralph Waldo Emerson

It ain t so People will use a bad tool they already have even if it s inappropriate the less computational the field, the harder it is to get people to install software

No one can afford to support every feature open data formats are essential interoperability reduces risk and gives people the ability to try wild and wonderful new things with their data for most users, you have to suggest data paths

People are reluctant to share data First person to do X gets a decent paper even if X is stupid or trivial, others get them for beating the results Researchers always worry that someone will publish second paper before they do the first one Current situtation iniquitous - can t beat the results without the data, but only friends get it Not enough to find funding and infrastructure for data releases - need to convince people to let go

Individual researchers think short term deadline driven, always rushed for time Data quality suffers unless they intend to release from the beginning Documentation (if it exists) usually suffers from nerdview

Tag set re-use sounds good, but often is bad New research often requires new theoretical advances and new ways of thinking about tags People just blindly apply tag set without thinking what they need Usable tags developed for one data set/use don t actually fit others

What s an annotation standard? Hopeless if not informed by past tag sets from all corners of the set of users you hope to attract Cleaned-up union of previously used tags is never usable Best option: describe the space tag a tag set for X can occupy and encourage documentation relating tags to this space

What would really make research better? an open data format tools can import and export containing all the representational richness that any data and tool might need coordination of enough tools developers to establish good data paths carrots for data sharing

Attracting users demonstration using a reformulation of some well-loved corpus give people something they don t have, as a lure Sometimes, bad speech recognition is better than no transcription, but people still doing without - hook up with webasr? unskilled forced alignment tool (only takes a bad ASR system and dictionary entries by a computer-literature language speaker) Speech spotter to find areas of speech

Encouraging data sharing Often easier to get researchers to agree in principle to data release at some specific time in the future than now, no matter how old the data is Pressure from funders, except it s too hard for them to get the conditions right Like in American psychology, pressure from professional association that controls journals