Efficient diphone database creation for, a multilingual speech synthesiser Institute of Linguistics Adam Mickiewicz University Poznań OWD 2010 Wisła-Kopydło, Poland
Why? useful for testing speech models in linguistic work easy manipulation of duration and pitch values easy to create new synthetic voices Recently used for: expressive speech dialogue synthesis voice quality underresourced languages large speech corpora evaluation (ACCS)
Ph.D. thesis context to model different speech styles which will align with the speaker in a consultation situation in a stress situation based on the phonetic and linguistic characteristics of the speaker s speech to design and build a speech synthesis component and a style selection module for an adaptive dialogue system 3
Ph.D. thesis context Adaptive dialogue system to adapt its speech by selecting a speech style appropriate for the speaker s level of speech arousal to improve human-computer interaction at emergency unit control centres and the help desks of call centres, by making the dialogue more natural. 4
Objectives Minimasation of the material to be recorded and annotated for a synthetic voice creation Automatisation of the process of synthetic voice creation 5
voice creation (Dutoit et al. 1996) Creating text corpus list of phones with allophones (PL) list of diphones (DL) DL = PL 2 list of words words in carries sentences Recording corpus with monotonous intonation Segmenting corpus phone level automatically and/or manually extracting diphones Equalising corpus (mbrolation) energy levels normalisation pitch normalisation 6
voice creation (Dutoit et al. 1996) Creating text corpus list of phones with allophones (PL) list of diphones (DL) DL = PL 2 list of words words in carries sentences Recording corpus with monotonous intonation Segmenting corpus phone level automatically and/or manually extracting diphones Equalising corpus (mbrolation) energy levels normalisation pitch normalisation 7
Mbrolation The Mbrolator, is a software suite for voice creation database file in the SEG format diphone filename diphone start & end diphone label diphone subsplitting restrictions put on the diphone files are: 16000Hz sampling rate no longer than 10000 samples context of 800 samples on the left and the right sides 8
Mbrolation 9
Phonetically rich sentence extractor to select the smallest possible set of sentences from a text corpus which will contain the largest number of diphones 10
Available text resources 1623 sentences from the BOSS corpus 8828 sentences from the Jurisdict database 10451 altogether transcription in Polish SAMPA = 37 phonemes Polish Extended-SAMPA (PE-SAMPA) = 40 phonemes 11
Sentence extraction procedure 12
Results SAMPA (38*38=1444 diphones) 1008 diphones in 211 sentences out of 10451 PE-SAMPA (41*41=1681 diphones) 1095 diphones in 201 out of 10451 13
Diphone extractor to automatically cut out diphones from the recordings based on the annotations of those recordings on the phone level 14
Available material 1580 sentences from BOSS corpus recordings in professional recording studio recorded male voice in monotonous intonation annotated in Polish Extended-SAMPA automatic annotation manual correction 15
Diphone extractor architecture 16
Diphone extraction results SAMPA: 1039 diphones from 1580 sentences PE-SAMPA: 1058 diphones from 1580 sentences 17
Tools combination and evaluation 226 sentences rocorded by a male speaker sentences annotated automatically 1002 extracted diphones voice creation Total time: ca. 5 hours 18
Tools combination and evaluation original fully automatic manual correction (micro-voice) 19
Conclusions Phonetically rich sentence extractor and diphone extractor seem to be indispensable in voice creation 20
Acknowledgements This work was partly funded by the research supervisor project grant to Prof. Grażyna Demenko & the author No. N N104 119838 the international cooperation scholarship funded by the Bielefeld University, Germany the scholarship for scientific achievements funded by the Kulczyk Family Foundation The author is very grateful to Prof. Grażyna Demenko for providing the text and speech corpora and to Prof. Dafydd Gibbon for his invaluable advice on the system design and implementation. 21
Thank you! 22