Transcription bottleneck of speech corpus exploitation

Similar documents
Robust Methods for Automatic Transcription and Alignment of Speech Signals

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

Turkish Radiology Dictation System

SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne

Carla Simões, Speech Analysis and Transcription Software

Scandinavian Dialect Syntax Transnational collaboration, data collection, and resource development

Master of Arts in Linguistics Syllabus

An analysis of coding consistency in the transcription of spontaneous. speech from the Buckeye corpus

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

Text-To-Speech Technologies for Mobile Telephony Services

Tools & Resources for Visualising Conversational-Speech Interaction

1. Introduction to Spoken Dialogue Systems

Computerized Language Analysis (CLAN) from The CHILDES Project

Study Plan for Master of Arts in Applied Linguistics

A Short Introduction to Transcribing with ELAN. Ingrid Rosenfelder Linguistics Lab University of Pennsylvania

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Crowdsourcing for Big Data Analytics

The use of Praat in corpus research

Research Portfolio. Beáta B. Megyesi January 8, 2007

Thirukkural - A Text-to-Speech Synthesis System

A CHINESE SPEECH DATA WAREHOUSE

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis

Annotation in Language Documentation

Things to remember when transcribing speech

Online experiments with the Percy software framework experiences and some early results

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Teaching Methodology Modules. Teaching Skills Modules

ANALEC: a New Tool for the Dynamic Annotation of Textual Data

Speech Transcription

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

8 Strategies for 2008

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Gilead Transparency Reporting Methodological Note

Volume Trends in EU Postal Markets

DIXI A Generic Text-to-Speech System for European Portuguese

Speech Analytics. Whitepaper

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

209 THE STRUCTURE AND USE OF ENGLISH.

Pan-European opinion poll on occupational safety and health

Developing LMF-XML Bilingual Dictionaries for Colloquial Arabic Dialects

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Understanding Impaired Speech. Kobi Calev, Morris Alper January 2016 Voiceitt

Annotation Pro Software Speech signal visualisation, part 1

Between voicing and aspiration

Develop Software that Speaks and Listens

Evaluating grapheme-to-phoneme converters in automatic speech recognition context

Using the Amazon Mechanical Turk for Transcription of Spoken Language

Payments and Revenues. Do retail payments really matter to banks?

Offshore Software Development Centers in Russia: Risk Mitigation Strategy

CallAn: A Tool to Analyze Call Center Conversations

Robustness of a Spoken Dialogue Interface for a Personal Assistant

Reading Competencies

English for communication in the workplace

Technical Report. Overview. Revisions in this Edition. Four-Level Assessment Process

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

Database Design For Corpus Storage: The ET10-63 Data Model

The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

C E D A T 8 5. Innovating services and technologies for speech content management

Crowdsourcing for Speech Processing

Central and South-East European Resources in META-SHARE

CONTENTS: bul BULGARIAN LABOUR MIGRATION, DESK RESEARCH, 2015

PHONETIC TOOL FOR THE TUNISIAN ARABIC

Master of Arts in Teaching English to Speakers of Other Languages (MA TESOL)

Prosodic focus marking in Bai

Reading Specialist (151)

COMMUNICATION POLICY. Adopted by the Board of Directors on 6 March 2008 NORDIC INVESTMENT BANK

From Fieldwork to Annotated Corpora: The CorpAfroAs project

Global Food Security Programme A survey of public attitudes

Hyunah Ahn

Fundamentals of Information Systems, Fifth Edition. Chapter 8 Systems Development

Reporting. Understanding Advanced Reporting Features for Managers

Ontology construction on a cloud computing platform

Identifying Focus, Techniques and Domain of Scientific Papers

Bridgestone Europe HR Transformation. Martha C. White, Vice President, Human Resouces & CSR Bridgestone EMEA 9 September, 2015

W-PhAMT: A web tool for phonetic multilevel timeline visualization

Transcription:

Transcription bottleneck of speech corpus exploitation Caren Brinckmann Institut für Deutsche Sprache, Mannheim, Germany Lesser Used Languages and Computer Linguistics (LULCL) II Nov 13/14, 2008 Bozen

Overview Introduction Written corpora vs. speech corpora Speech corpus annotation Transcription bottleneck Crowdsourcing the orthographic transcription Automatic broad phonetic alignment Query-driven annotation Summary 2

Written vs. speech corpora Written corpora can be compiled/accessed more easily web as corpus large available corpora, e.g. DeReKo for German (3.4 billion words): http://www.ids-mannheim.de/kl/projekte/korpora/ Written corpora can be exploited without any annotation, e.g. extraction of higher-order collocations in CCDB: http://corpora.ids-mannheim.de/ccdb/ Limited availability of speech corpora Speech corpora need at least a basic transcription 3

Speech corpus annotation "Basic" transcription: orthographic transcription languages without standardized orthography? Text-to-audio alignment Phonetic transcription for phonetic and phonological research Prosody, information structure, coreferences, POS,... 4

Transcription bottleneck Reliable orthographic transcription: only feasible for near-native speakers problem: minority languages / dialectal speech crowdsourcing the orthographic transcription Phonetic transcription: manual annotation is very time-consuming (1:200) and requires considerable skill automatic broad phonetic alignment query-driven annotation 5

Transcription bottleneck Reliable orthographic transcription: only feasible for near-native speakers problem: minority languages / dialectal speech crowdsourcing the orthographic transcription Phonetic transcription: manual annotation is very time-consuming (1:200) and requires considerable skill automatic broad phonetic alignment query-driven annotation 6

Crowdsourcing: Introduction Term coined by Jeff Howe (Wired, June 2006) Outsourcing: subcontracting a process, such as product design or manufacturing, to a third-party company Crowdsourcing: outsourcing a task traditionally performed by an employee or contractor to an undefined, generally large group of people Classical crowdsourcing: self-service restaurants, supermarkets, IKEA, ATMs, ticket machines New: use the Internet to publicize and manage crowdsourcing projects "Wisdom of crowds": aggregation of information in groups result in decisions that are often better than could have been made by any single member of the group 7

Amazon Mechanical Turk (mturk.com) 8

Distributed Proofreaders (pgdp.net) 9

Recording Teenagers: Ph@ttSessionz (LMU Munich) 10

Key guidelines for successful crowdsourcing 1. Be focused: vaguely defined problems get vague answers 2. Get your filters right: use crowd and experts to extract the best answers 3. Tap the right crowds: find the best experts in the mass 4. Build community into social networks (BusinessWeek, September 25, 2006) 11

Possible application: speech corpus "German Today" Recordings in 160+ towns throughout the German speaking area of Europe (D, A, CH, LUX, I, B, FL) 4 high school students (aged 16-20) in every town und 2 older adults (aged 50-60) in 80 towns 800+ speakers 90 minutes per speaker 1200 hrs. of speech Material: read speech interview map task 12

13

Map Task Bruneck Landeck Start Ziel Start Ziel 14

Crowdsourcing the orthographic transcription Dialectal spontaneous speech (map task data) can be transcribed reliably only by (near-)native speakers of the dialect. Possible crowdsourcing implementation: central database of speech signals, metadata, transcripts, and information about the users/transcribers web-based transcription software, e.g. WebTranscribe (as used in Ph@ttSessionz) clearly defined task: transcribe each inter-pause-stretch with standard German orthography quality assurance: parallel transcription, evaluation + control tasks (as employed by CastingWords on mturk.com) recruit transcribers: contact the schools where the recordings took place and/or the speakers directly community: points / virtual titles, rewards (e.g. visit to IDS), games... 15

Transcription bottleneck Reliable orthographic transcription: only feasible for near-native speakers problem: minority languages / dialectal speech crowdsourcing the orthographic transcription Phonetic transcription: manual annotation is very time-consuming (1:200) and requires considerable skill automatic broad phonetic alignment query-driven annotation 16

Automatic broad phonetic alignment Input: speech signal orthographic transcription canonic/phonemic transcription of all words in the corpus pronunciation lexicon grapheme-to-phoneme converter language-specific phoneme models (e.g. trained HMMs) Output: time-aligned broad phonetic transcription 17

Example: orthographic transcription 18

Munich Automatic Segmentation System MAUS 19

Modelling post-lexical phonological processes 20

Obvious errors 21

Evaluation: comparison with manual transcription Van Bael et al. (2006, 2007) compared 10 aligners for Dutch with a manually obtained reference transcription. Results: Best performance: Canonical transcription + modelling of postlexical phonological processes with a decision tree Number of remaining disagreements with the reference transcription (14.6% for spontaneous speech, 8.1% for read speech) only slightly higher than human inter-labeller disagreement scores reported in the literature 22

Task-based evaluation access specific portions of the speech signal for further manual annotation? duration-based analyses (only large, significant effects can be found) analyses in the frequency domain (e.g. formant slope) 23

Phonetic aligners for lessresourced languages? build your own using HTK but: you need at least one hour of phonetically segmented and labelled speech data find an aligner for a language that is phonetically similar to your target language and use its pre-built HMMs adding pronunciation lexicon and/or grapheme-to-phoneme rules mapping between the phonemes of your target language and the HMM-modelled language 24

Transcription bottleneck Reliable orthographic transcription: only feasible for near-native speakers problem: minority languages / dialectal speech crowdsourcing the orthographic transcription Phonetic transcription: manual annotation is very time-consuming (1:200) and requires considerable skill automatic broad phonetic alignment query-driven annotation 25

Traditional corpus annotation process Gut (2008) 26

Problems with sequential corpus creation too time-consuming: many years of annotation work before corpus can be exploited and any results can be published very error-prone: limited reliability of annotations due to coder drift restricted corpus queries: failed/impossible queries re-annotation of corpus 27

Cyclic and iterative corpus annotation ("agile corpus creation") Gut (2008) 28

Query-driven phonetic annotation of "German Today" 29

30

31

Advantages of agile corpus creation Query-driven approach tests suitability and consistency of annotation schema very little data has to be re-annotated or discarded design errors, annotation errors and conceptual inadequacies become immediately visible successive cycles improve annotation schema and limit it to the elements necessary for the queries saves time early publication of first results 32

Combining automatic and querydriven annotation 33

Summary speech corpora need at least a basic (orthographic) transcription to be exploitable difficult to produce for languages/dialects with only few native speakers use crowdsourcing phonological research further requires phonemic/phonetic segmentation and labelling very time-consuming combine automatic broad phonetic alignment with querydriven annotation 34

References Brinckmann, C., Kleiner, S., Knöbl, R., and Berend, N. (2008): German Today: an areally extensive corpus of spoken Standard German. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. Draxler, C. (2005): WebTranscribe an extensible web-based speech annotation framework. Proceedings of the 8th International Conference on Text, Speech and Dialogue (TSD 2005), Karlovy Vary, Czech Republic, 61-68. Keibel, H. and Belica, C. (2007): CCDB: a corpus-linguistic research and development workbench. Proceedings of Corpus Linguistics 2007, Birmingham, United Kingdom. Raffelsiefen, R. and Brinckmann, C. (2007): Evaluating phonological status: significance of paradigm uniformity vs. prosodic grouping effects. Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS XVI), Saarbrücken, Germany, 1441-1444. Schiel, F. (2004): MAUS Goes Iterative. Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 1015-1018. Van Bael, C., Boves, L., van den Heuvel, H. and Strik, H. (2006): Automatic phonetic transcription of large speech corpora. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, 4-11. Van Bael, C., Boves, L., van den Heuvel, H. and Strik, H. (2007): Automatic phonetic transcription of large speech corpora. Computer Speech and Language 21 (4), 652-668. Voormann, H. and Gut, U. (2008): Agile corpus creation. Corpus Linguistics and Linguistic Theory 4 (2), 235-251. 35

Thank you! brinckmann@ids-mannheim.de 36