Outline. Motivation. Introduction. Bilingual speech synthesis. Previous research 26/11/2013

Outline Application of speaker adaptation methods used in speech synthesis for speaker de-identification Miran Pobar, Department of Informatics, University of Rijeka, Croatia Introduction Motivation Previous research The idea/planned work Conclusions and future work COST ACTION IC 1206 Meeting, Mataró, Spain, November 25-26 2013 Introduction Speaker de-identification: natural and intelligible sounding voice, that doesn t reveal the speaker s identity [Jin et al., 2009.] Not concerned with identification based on message content Applications Call centers, anonymous help lines Motivation Motivation: application of VT for de-identification, without the need for previous enrollment of new speakers. Previous research by [Jin et al. 2009.]: Voice Convergin: Speaker De-Identification by Voice Transformation Application of Voice Transformation to speaker deidentification Goal: Voice sounds natural and intelligible, but doesn t reveal the speaker s identity Results show successful de-identification while preserving intelligibility Speaker enrollment phase limits applicability Previous research Speaker adaptation in speech synthesis, esp. bilingual speech synthesis A Bilingual HMM-Based Speech Synthesis System for Closely Related Languages [Justin et al., 2012] A Comparison of Two Approaches to Bilingual HMM-Based Speech Synthesis. [Pobar et al. 2013] The basic concepts were already introduced in [Latorre et al., 2006], [Wu et al., 2009], [Liang et al., 2008] Bilingual speech synthesis Bilingual speech system: synthesizing speech in two languages with the same speaker identity. Quite a large number of languages with small database resources hard to recruit speakers fluent in both languages building the synthetic voice in target language, even though data in the target language does not exist for the target speaker. Data from multiple speakers, adaptation to target speaker Data from both languages treated as monolingual, labeled with a joined phoneme set Adapted to the target speaker using the CMLLR adaptation State mapping approach Speaker independent models trained separately for both lanugages Mapping between states derived automatically from minimum KL divergence between states Mapping used to apply transforms calculated for one language to the other language How to automatize finding the phoneme/state similarities/differences between spoken data in similar languages? Similar method as in de-identification: transformation of one voice to the target voice 1

HMM bilingual system overview Truly bilingual model speaker independent bilingual voice average voice (training data from multiple speakers in both languages) Speaker adaptive training [Yamagishi et al., 2003.] Obtained voice used as basis for adaptation to the target speaker identity CMLLR adaptation of average voice models to the target speaker Data labeled with a common phoneme set Conducted with help from a phonetic expert, who has knowledge about both languages. Slovene and Croatian databases already had phonetic dictionaries closely related to SAMPA phonetic alphabets of respective languages, we decided to map the phonemes with a help of SAMPA alphabet. Diagram adopted from [Yamagishi et al., 2009] Database preparation State mapping approach State mapping approach Train two Average Voice models in both source and target languages. Target speaker: speaks one language, we want synthesis in both Intra-lingual CMLLR adaptation for the target speaker s language Application of transforms trained in one language to the other via state mapping Establish state mappings between the source state model and the target state models, i.e. find a nearest state model in target language for each state model in source language. Train the transforms for the Average Voice model in target language using the adaptation data. Adapt each state model in source language using the transform attached to the mapped state model in target language. [Wu et al., 2009] 2

Test configurations Evaluation results Figure: Mean opinion scores for systems. Left: Croatian synthesis language, right: Slovene synthesis language Figure: Similarity to target speaker (degradation mean opinion score). Left: Croatian ynthesis language, right: Slovene synthesis language Summary of results Results similar for both approaches in intralingual situation Asymmetric results in cross-lingual situation The results suggest that although there are differences in the two approaches of phoneme mapping,the target speaker itself may play important role in the performance of bilingual systems. The duration model in cases of S1* for each phoneme is also shared between the trained model, but that is not the case in S2*. De-identification using VT Source: Tanja Schultz, Qin Jin, Arthur Toth, Alan W Black: Speaker De-Identification, COST Action IC 1206 Meeting, Zagreb, July 8-9 2013 De-identification using VT Speaker de-identification via VT Source: Tanja Schultz, Qin Jin, Arthur Toth, Alan W Black: Speaker De-Identification, COST Action IC 1206 Meeting, Zagreb, July 8-9 2013 [Jin et al. 2009.] Various strategies tested on GMM and Phonetic speaker identification systems Standard Voice Transformation De-Duration Voice Transformation Double Voice Transformation - composition of de-duration + standard VT Transterpolated VT linear inter/extrapolation of MCEPs x=s+f(v-s) Results from 92% for GMM and 42% for Phonetic in case of standard voice transformation to 100% for GMM and 87.5% for Phonetic in case of transterpolated VT Better results on large database (95 male and 102 female speakers) Human listening tests: intelligibility, quality of de-identification, comparison with naïve pitch shifting approach 3

Speaker de-identification via VT [Jin et al. 2009.] Voice transformation strategies can be used to de-identify speech Standard VT performed well with GMM, but poor with Phonetic Investigated different VT strategies: Transterpolation worked best Human tests: de-identified voices are understandable while providing securely transmitted content VT targets frame level and some prosodic characteristics BUT SID possible on higher-level aspects, like prosody, word choice Secure ways to re-identify a (selection of) speaker(s) Need for enrollment of new speakers limits applications Speaker de-identification via VT: planned improvements Can we avoid enrollment phase? Similar idea to state mapping in bilingual speech synthesis: apply transform learned on one speaker pair to another speaker Automatic mapping of states (synthesis) Synthesis: find suitable transform via minimum KL divergence between states De-identification: find suitable transform via SID in a closed set of source speakers Learning phase: Database of a number of male and female speakers Train a SID system using the closed set of speakers in the database Calculate transforms for each speaker in the database towards one target speaker for deidentification De-identification phase: Forced classification of the new speaker into one of the speakers from the set in the SID system (source speaker) VT of speech using the transform learned on sorce-target speaker pair If enough data is collected from the new speaker, learn speaker GMM and transform and include in the source speaker set expanding set of deidentification transforms. Planned setup Database: The multilingual COST278BN corpus (BCN) COST278BN Audio indexing of Broadcast News files in different languages complete news shows that were broadcasted by different TV stations in different countries. 16 TV stations, 11 languages (Basque, Croatian, Czech, Dutch, Galician, Greek, Hungarian, Portuguese, Slovakian, Slovenian (2 sets), Spanish) Audio and video: the audio is (16 khz, 16 bit, wave format) and the video (only 6 data sets also come with video) is mostly in Real Media Video. The aim is more to provide data that can be used for testing, model adaptation and for research on front-end processing algorithms such as speech/nonspeech and speaker turn segmentation, background classification, speaker clustering, etc. that are supposed to be only weakly language dependent. Planned setup Speaker identification: ALIZE Open Source platform, GMM/UBM based Voice transformation: HTS? Matlab Evaluation Automatic: speaker identification using ALIZE (baseline) Subjective/human listening test Conclusions De-identification of speakers using VT Suggested improvements increase possibilities for application: Goal: eliminate the need for enrollment of new speakers to be de-identified, which significantly increases application possibilities for de-identification. A pool of transformations, apply transfomation according to speaker identification results Expanding set of transformation using data collected from new speakers Application of speaker adaptation methods used in speech synthesis for speaker de-identification Miran Pobar, Department of Informatics, University of Rijeka, Croatia COST ACTION IC 1206 Meeting, Mataró, Spain, November 25-26 2013 4

References Jin, Q., Toth, A. R., Schultz, T., & Black, A. W. (2009, April). Voice convergin: Speaker de-identification by voice transformation. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on (pp. 3909-3912). IEEE. Miran Pobar, Tadej Justin, Janez Žibert, France Mihelič, Ivo Ipšić. A Comparison of Two Approaches to Bilingual HMM-Based Speech Synthesis. Text, Speech, and Dialogue, Lecture Notes in Computer Science, Volume 8082, 2013, pp 44-51 A. Vandecatseye, J.P. Martens, J. Neto, H. Meinedo, C. Garcia-Mateo, J.Dieguez, F. Mihelic, J. Zibert, J. Nouza, P. David, M. Pleva, A.Cizmar, H. Papageorgiou, C. Alexandris (2004). The COST278 pan- European broadcast news database, Proceedings LREC (Lissabon), 873-876. J. Zibert, F. Mihelic, J.P. Martens, H. Meinedo, J. Neto, L.Docio, C. Garcia-Mateo, P. David, J. Zdansky, M. Pleva, A. Cizmar, A. Zgank, Z. Kacic, C. Teleki, K. Vicsi (2005). The COST278 Broadcast News Segmentation and Speaker Clustering Evaluation - Overview, Methodology, Systems, Results,Proceedings Interspeech (Lissabon), 629-632. Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., and Isogai, J. (2009). Analysis of speaker adaptation algorithms for hmm-based speech synthesis and a constrained smaplr adaptation algorithm. Audio, Speech, and Language Processing, IEEE Transactions on, 17(1):66 83. 5