Transcription bottleneck of speech corpus exploitation

Transcription

1 Transcription bottleneck of speech corpus exploitation Caren Brinckmann Institut für Deutsche Sprache, Mannheim, Germany Lesser Used Languages and Computer Linguistics (LULCL) II Nov 13/14, 2008 Bozen

2 Overview Introduction Written corpora vs. speech corpora Speech corpus annotation Transcription bottleneck Crowdsourcing the orthographic transcription Automatic broad phonetic alignment Query-driven annotation Summary 2

3 Written vs. speech corpora Written corpora can be compiled/accessed more easily web as corpus large available corpora, e.g. DeReKo for German (3.4 billion words): Written corpora can be exploited without any annotation, e.g. extraction of higher-order collocations in CCDB: Limited availability of speech corpora Speech corpora need at least a basic transcription 3

4 Speech corpus annotation "Basic" transcription: orthographic transcription languages without standardized orthography? Text-to-audio alignment Phonetic transcription for phonetic and phonological research Prosody, information structure, coreferences, POS,... 4

5 Transcription bottleneck Reliable orthographic transcription: only feasible for near-native speakers problem: minority languages / dialectal speech crowdsourcing the orthographic transcription Phonetic transcription: manual annotation is very time-consuming (1:200) and requires considerable skill automatic broad phonetic alignment query-driven annotation 5

7 Crowdsourcing: Introduction Term coined by Jeff Howe (Wired, June 2006) Outsourcing: subcontracting a process, such as product design or manufacturing, to a third-party company Crowdsourcing: outsourcing a task traditionally performed by an employee or contractor to an undefined, generally large group of people Classical crowdsourcing: self-service restaurants, supermarkets, IKEA, ATMs, ticket machines New: use the Internet to publicize and manage crowdsourcing projects "Wisdom of crowds": aggregation of information in groups result in decisions that are often better than could have been made by any single member of the group 7

8 Amazon Mechanical Turk (mturk.com) 8

9 Distributed Proofreaders (pgdp.net) 9

10 Recording Teenagers: (LMU Munich) 10

11 Key guidelines for successful crowdsourcing 1. Be focused: vaguely defined problems get vague answers 2. Get your filters right: use crowd and experts to extract the best answers 3. Tap the right crowds: find the best experts in the mass 4. Build community into social networks (BusinessWeek, September 25, 2006) 11

12 Possible application: speech corpus "German Today" Recordings in 160+ towns throughout the German speaking area of Europe (D, A, CH, LUX, I, B, FL) 4 high school students (aged 16-20) in every town und 2 older adults (aged 50-60) in 80 towns 800+ speakers 90 minutes per speaker 1200 hrs. of speech Material: read speech interview map task 12

13 13

14 Map Task Bruneck Landeck Start Ziel Start Ziel 14

15 Crowdsourcing the orthographic transcription Dialectal spontaneous speech (map task data) can be transcribed reliably only by (near-)native speakers of the dialect. Possible crowdsourcing implementation: central database of speech signals, metadata, transcripts, and information about the users/transcribers web-based transcription software, e.g. WebTranscribe (as used in clearly defined task: transcribe each inter-pause-stretch with standard German orthography quality assurance: parallel transcription, evaluation + control tasks (as employed by CastingWords on mturk.com) recruit transcribers: contact the schools where the recordings took place and/or the speakers directly community: points / virtual titles, rewards (e.g. visit to IDS), games... 15

17 Automatic broad phonetic alignment Input: speech signal orthographic transcription canonic/phonemic transcription of all words in the corpus pronunciation lexicon grapheme-to-phoneme converter language-specific phoneme models (e.g. trained HMMs) Output: time-aligned broad phonetic transcription 17

18 Example: orthographic transcription 18

19 Munich Automatic Segmentation System MAUS 19

20 Modelling post-lexical phonological processes 20

21 Obvious errors 21

22 Evaluation: comparison with manual transcription Van Bael et al. (2006, 2007) compared 10 aligners for Dutch with a manually obtained reference transcription. Results: Best performance: Canonical transcription + modelling of postlexical phonological processes with a decision tree Number of remaining disagreements with the reference transcription (14.6% for spontaneous speech, 8.1% for read speech) only slightly higher than human inter-labeller disagreement scores reported in the literature 22

23 Task-based evaluation access specific portions of the speech signal for further manual annotation? duration-based analyses (only large, significant effects can be found) analyses in the frequency domain (e.g. formant slope) 23

24 Phonetic aligners for lessresourced languages? build your own using HTK but: you need at least one hour of phonetically segmented and labelled speech data find an aligner for a language that is phonetically similar to your target language and use its pre-built HMMs adding pronunciation lexicon and/or grapheme-to-phoneme rules mapping between the phonemes of your target language and the HMM-modelled language 24

26 Traditional corpus annotation process Gut (2008) 26

27 Problems with sequential corpus creation too time-consuming: many years of annotation work before corpus can be exploited and any results can be published very error-prone: limited reliability of annotations due to coder drift restricted corpus queries: failed/impossible queries re-annotation of corpus 27

28 Cyclic and iterative corpus annotation ("agile corpus creation") Gut (2008) 28

29 Query-driven phonetic annotation of "German Today" 29

30 30

31 31

32 Advantages of agile corpus creation Query-driven approach tests suitability and consistency of annotation schema very little data has to be re-annotated or discarded design errors, annotation errors and conceptual inadequacies become immediately visible successive cycles improve annotation schema and limit it to the elements necessary for the queries saves time early publication of first results 32

33 Combining automatic and querydriven annotation 33

34 Summary speech corpora need at least a basic (orthographic) transcription to be exploitable difficult to produce for languages/dialects with only few native speakers use crowdsourcing phonological research further requires phonemic/phonetic segmentation and labelling very time-consuming combine automatic broad phonetic alignment with querydriven annotation 34

35 References Brinckmann, C., Kleiner, S., Knöbl, R., and Berend, N. (2008): German Today: an areally extensive corpus of spoken Standard German. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. Draxler, C. (2005): WebTranscribe an extensible web-based speech annotation framework. Proceedings of the 8th International Conference on Text, Speech and Dialogue (TSD 2005), Karlovy Vary, Czech Republic, Keibel, H. and Belica, C. (2007): CCDB: a corpus-linguistic research and development workbench. Proceedings of Corpus Linguistics 2007, Birmingham, United Kingdom. Raffelsiefen, R. and Brinckmann, C. (2007): Evaluating phonological status: significance of paradigm uniformity vs. prosodic grouping effects. Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS XVI), Saarbrücken, Germany, Schiel, F. (2004): MAUS Goes Iterative. Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, Van Bael, C., Boves, L., van den Heuvel, H. and Strik, H. (2006): Automatic phonetic transcription of large speech corpora. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, Van Bael, C., Boves, L., van den Heuvel, H. and Strik, H. (2007): Automatic phonetic transcription of large speech corpora. Computer Speech and Language 21 (4), Voormann, H. and Gut, U. (2008): Agile corpus creation. Corpus Linguistics and Linguistic Theory 4 (2),

36 Thank you! 36