SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

Transcription

1 SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University of Technology and Economics Abstract. Statistical parametric synthesis offers numerous techniques to create new voices. Speaker adaptation is one of the most exciting ones. However, it still requires high quality audio data with low signal to noise ration and precise labeling. This paper presents an automatic speech recognition based unsupervised adaptation method for Hidden Markov Model (HMM) speech synthesis and its quality evaluation. The adaptation technique automatically controls the number of phone mismatches. The evaluation involves eight different HMM voices, including supervised and unsupervised speaker adaptation. The effects of segmentation and linguistic labeling errors in adaptation data are also investigated. The results show that unsupervised adaptation can contribute to speeding up the creation of new HMM voices with comparable quality to supervised adaptation. Key words: HMM-based speech synthesis, unsupervised adaptation, automatic speech recognition 1 Introduction In the last decade the primary goal of speech synthesis was to achieve natural sounding, high quality voices. As the results of unit selection and statistical parametric speech synthesis improve, new challenges emerge. Creating a new voice, which is similar to the voice characteristics of a target speaker, is an attractive challenge. Context independent unit selection synthesis demands a well constructed speech database with hours of speech, its phonetic transcription and precise labeling for each new voice. This method is time consuming and a lot of human interaction is necessary. Statistical parametric synthesis offers speaker adaptation techniques, where a speech database of moderate size is required only to create a similar voice to the target speaker s. Human interaction is still necessary for precise phonetic transcription and labeling.

2 2 Authors Suppressed Due to Excessive Length As the quality of statistical parametric speech synthesis approaches the quality of state-of-the-art unit selection methods it became a focused research area. Usually the HMM paradigm - well known from the speech recognition domain - is used in statistical speech synthesis [1]. It has numerous advantages compared to unit selection: small footprint, the possibility of creating various voices [2], emotional speech [3] and adapting the voice characteristics to a target speaker [4], [5]. Recently hybrid approaches, like target cost prediction of unit selection systems by HMMs [6], smoothing the segment sequence of unit selection systems with statistical models and/or their dynamic features [7], mixing unit selection and statistical parametric speech synthesis [8] have also been proposed. 2 SUPERVISED AND UNSUPERVISED ADAPTATION In HMM speech synthesis and recognition the two main techniques of speaker adaptation are maximum likelihood linear regression (MLLR) [4] and maximum a posteriori (MAP) estimation [5]. MLLR is applied when the amount of adaptation data is small, for MAP more data is required as the Gaussian distributions are updated individually. In both cases supervised speaker adaptation uses precise phonetic transcriptions, manually transcribed or automatically annotated segmentation and linguistic labels. The advantages of unsupervised adaptation of HMM speech synthesis are quite appealing - the creation of target voices becomes automatic which is favorable if several voices are required or if pre-processing of the speech data is not possible. Probably the most advanced method would be to create a full-context speech recognizer and train the HMMs with the output of this system. Although no studies have been carried out, it is likely to be computationally inadequate and would probably create inaccurate labels. In Automatic Speech Recognition (ASR) systems both supervised and unsupervised adaptation are used to increase the recognition accuracy. The unsupervised method requires less manual work, but more adaptation data; about one hour per speaker is used in practice [9]. In [10] an interesting method of unsupervised speaker adaptation was introduced. In this study only phonetic labels were used for adaptation, the transformation matrices were computed from triphone models. The results of the study show that the degradation in quality and naturalness is caused mainly by limiting full-context labels to triphone labels, and not by triphone mismatches. Another study [11] investigates a two-pass decision tree construction technique for unsupervised adaptation. The decision trees of full context models are built in two phases: first the segmental, then the supra-segmental features are processed. According to the results of [11] there is no perceived quality difference between supervised and unsupervised adaptation, although the average voice was trained by ASR corpora, so it produces very low quality synthetic speech ( MOS values [11]), which may hide the quality degradation caused by this two-pass method.

3 Title Suppressed Due to Excessive Length 3 Another important aspect is described in [12]. Several tests of different TTS systems with the same labels and clear and noisy speech database are carried out. The results of [12] show that HMM-based adaptive speech synthesis is far more robust than concatenative, speaker-dependent HMM-based, or hybrid speech synthesis approaches. 3 ASR-BASED UNSUPERVISED SPEAKER ADAPTATION Complementing the results of [9], [10], [11], [12] our concept is to evaluate the quality of adaptation with inaccurate, noisy phonetic transcription. The consequences of inaccurate phonetic transcription are phoneme mismatches, inaccurate segmentation and linguistic labels due to phoneme mismatch accumulation. Speech recognizers for a given context perform quite well, but their output still contains various mismatches. Fig. 1. Block diagram of the proposed unsupervised adaptation method

4 4 Authors Suppressed Due to Excessive Length 3.1 The Proposed Method The speech recordings from the target speaker are recognized, then phone boundaries are determined with forced alignment based on the recognition results. If the results of forced alignment do not satisfy an item-drop criterion (which is described in 3.3) that part of the recordings is rejected. When phone boundary detection is accepted for at least ten minutes of recordings, linguistic labeling is carried out. Finally the adaptation is applied. The block diagram of the proposed method is shown in Fig Automatic Recognition of the Speech Corpus and Phonetic Transcription The TTS adaptation database is transcribed automatically with an LVCSR ASR system [9]. The output will contain recognition errors, which can be significantly reduced if the context of the TTS adaptation database and the ASR training database are from the same domain. The following processing step is transforming the orthographic output of the ASR system into phonetic representation. This may be completed either by dictionary or rule-based software modules. 3.3 Phone Boundary Detection The phone boundaries in the TTS adaptation database are marked automatically based on the phonetic transcription described in section 3.2 using the ASR system in forced alignment mode enabling a narrow beam only. As the word level ASR can produce recognition errors, the length of the recognized phone sequence is likely to be longer or shorter than the correct transcription. If at the beginning of an audio segment the word is misrecognized with more/less phones compared to the correct word then the forced alignment procedure probably gives bad results for the whole audio segment. If this happens at the end of an audio segment, it is not so severe because it will produce only some phone mismatches. To avoid using adaptation data with critical phone error accumulation, the following drop criterion was introduced: 1 e accumulation = 1 imax i=1 (i i 100 <= ɛ (1) (100 pci) max i + 1) imax where i is the position of the phone, i max is the length of the phone sequence, p ci is the confidence, that the i-th phone is correctly recognized in the [0..100] interval (which is computed by the ASR) and ɛ is the limit of the drop criteria in the [0..1] interval (0 means there were no errors, 1 is the theoretical worst case). So mistakes at the beginning are more weighted than at the end and error accumulation is avoided.

5 Title Suppressed Due to Excessive Length 5 4 Results To measure the difference between the proposed method and the supervised adaptation technique a listening test was conducted. In the experiment a modified Hungarian version of HTS [13] was used. The average voice was computed from five speakers (1.5-2 hours of phonetically balanced speech corpus from each). The adaptation database contained semi-spontaneous (parliament speeches by politicians), 10 minute long speech from each of four different speakers. For adaptation the Constrained Maximum Likelihood Linear Regression (CMLLR) method was used. For speech recognition a state-of-the-art Hungarian LVCSR system was applied [14]. The triphone based acoustic model was trained with 5 hours of speech from 500 speakers. The training corpus of the morpheme trigram language model contained 1.2 million words in the domain of political news. The average accuracy of the system is 72%, while the average phone accuracy is above 85%. For the TTS adaptation database the accuracy of the recognizer in phonetic level is shown in Table 1. Table 1. Accuracy of the recognizer for the four speakers Speaker Phone accuracy Speaker #1 58% Speaker #2 79% Speaker #3 87% Speaker #4 90% In case of supervised speaker adaptation consensus manual phonetic transcription with punctuation was created, the segmentation and linguistic labels were automatically determined. In case of unsupervised adaptation the phonetic transcription was determined from the recognition results, the segmentation and linguistic labels were determined in the same way, as in case of supervised speaker adaptation. In the test the supervised and unsupervised adaptation from all speakers -altogether eight systems- were involved. 4.1 Experimental Conditions The experiment consisted of three main parts: paired comparison, Mean Opinion Score (MOS) test and naturalness evaluation. In the first section test subjects had to define how similar two synthesized samples are on a five point scale. The text of the utterance in one pair was always the same. Altogether 24 pairs were played: 8 pairs were from the same system; 8 pairs came from the same speaker with different adaptation methods; and 8 pairs were compiled from different speakers. Pair comparison as the first part is beneficial, because test subjects

6 6 Authors Suppressed Due to Excessive Length get used to the synthetic voice and they will give consistent answers for the MOS test of the second part. There the test subjects had to mark the quality of 32 samples, 4 samples from each system. In the last section test subjects had to decide how much the synthesized samples are similar to the natural voice of the original speaker. This was carried out with 40 synthesized samples (5 for each system). The order of the three parts is chosen in this way to minimize the chance that the test subjects memorize the speakers. The samples were selected from a large set in order to get the desired information about the systems and not about the speech samples. In every section the synthesized samples were pseudo-randomly selected from the larger sample database keeping the distribution of samples and eight different systems even. The authors carried out a pre-test with four subjects to verify the effectiveness of the test design. The results of the pre-test were promising, consequently the same design was kept. Altogether 25 test subjects (19 male, 6 female) were involved in the test. The test was internet-based, the average age was 35, and the youngest subject was 21, the oldest 67 years old. 10 test subjects were speech experts. 4.2 Analysis of Results Table 2 shows the results of the experiment. The first three columns (Similarity to synthesized voice) are related to the first section of the test, the fourth column (Similarity to native voice, same speaker) is related to the third section of the test, and the last column (MOS) is related to the second part of the test. The s rows correspond to supervised adaptation, while u rows refer to unsupervised adaptation. In the first and third test sections 1 refers to the lowest, 5 to the highest similarity. In the MOS test 1 is the worst, 5 is the best value. Except column three higher values represent better results for all speakers. Individual analysis of the results The first two columns show that test subjects can tell, if the samples were generated from the same speaker with the same methods (s-s, u-u samples). There is a minor impact of using different adaptation methods: s-u, u-s samples score consequently less than s-s, u-u pairs. The third column shows that in case of these four speakers the subjects could tell, if the synthesized samples are from different speakers. Based on the values of the fourth column, both supervised and unsupervised samples are considered moderately similar to the native speakers, but they are still scored much better, than different speakers. The relative low values can be the result of the adaptation data being semi-spontaneous speech, including sputter, echo, cough and hesitation. This is also the reason for rather low MOS scores, which are shown in the fifth column. The standard deviation- and confidence level intervals (α = 0.05) are also shown in Table 2.

7 Title Suppressed Due to Excessive Length 7 Table 2. Results of the listening test (s: supervised, u: unsupervised) Similarity to Synthetized voice Native Same different voice MOS speaker speaker same s u speaker Speaker #1 s u Speaker #2 s u Speaker #3 s u Speaker #4 s u Standard s deviation u Confidence s (α = 0.05) u Test section Analyzing the trends of the results Each part of the test shows, that the difference between supervised and unsupervised adaptation reduces as the phone accuracy of the ASR system (see Table 1) gets higher. This trend can be seen by examining the following pairs: s-s, u-u samples compared to s-u, u-s samples from the same speaker, the u and s samples similarity of speaker #1,2,3,4 to a different speaker, the u and s samples similarity of speaker #1,2,3,4 to the native voice of the same speaker, the MOS scores of s and u samples. The results show that the proposed unsupervised adaptation method with good phone accuracy produced similar quality to supervised adaptation with semispontaneous adaptation data. Creating new HMM voices can be speeded up by the proposed method. Phone accuracy as low as 58% may still allow with unsupervised adaptation the creation of a comparable voice to the supervised one. 5 CONCLUSIONS In this paper a method for unsupervised adaptation of HMM-based speech synthesis systems was introduced and the quality evaluation of the technique was investigated. As the results are quite promising further studies will be carried out. The parameters of the drop criteria (described in 3.3) will be fine-tuned and other types of drop criteria will be investigated. Unsupervised minimum generation error linear regression (MGELR) and constrained structural maximum a

8 8 Authors Suppressed Due to Excessive Length posteriori linear regression (CSMAPLR) adaptation methods will be evaluated. Listening tests will be carried out using the adaptation data presented in this paper and with studio quality data as well. Acknowledgments. This research was supported by the TELEAUTO (OM /2007) project of the Hungarian National Office for Research and Technology and by the ETOCOM project (TAMOP /1/KMR ) through the Hungarian National Development Agency in the framework of the Social Renewal Operative Programme supported by EU and co-financed by the European Social Fund and by the KMOP /A project through the Hungarian National Development Agency. References 1. Black, A., Zen, H., Tokuda, K.: Statistical parametric speech synthesis. In: ICASSP 2007, pp (2007) 2. Iwahashi, N., Sagisaka, Y.: Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Speech Communications. Vol. 16, no. 2, pp (1995) 3. Tachibana, M., Yamagishi, J., Masuko, T., Kobayashi, T.: Speech synthesis with various emotional expressions and speaking styles by style Interpolation and morphing. IEICE Trans. Inf. Syst. Vol. E88-D, no.11, pp (2005) 4. Tamura, M., Masuko, T., Tokuda, K., Kobayashi, T.: Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR. In: ICASSP 2001, pp (1998) 5. Ogata, K., Tachibana, M., Yamagishi, J., Kobayashi, T.: Acoustic model training based on linear transformation and MAP modification for HSMM-based speech synthesis. In: ICSLP 2006, pp (2006) 6. Kawai, H., Toda, T., Ni, J., Tsuzaki, M., Tokuda, K.: XIMERA: A new TTS from ATR based on corpus-based technologies. In ISCA SSW5 2004, pp (2004) 7. Plumpe, M., Acero, A., Hon, H.-W., Huang, X.-D.: HMM-based smoothing for concatenative speech synthesis. In: ICSLP 1998, pp (1998) 8. Okubo, T., Mochizuki, R., Kobayashi, T.: Hybrid voice conversion of unit selection and generation using prosody dependent HMM. IEICE Trans. Inf. Syst. Vol. E89-D, no. 11, pp (2006) 9. Mihajlik, P., Fegyó, T., Tüske Z., Ircing, P.: A Morpho-graphemic Approach for the Recognition of Spontaneous Speech in Agglutinative Languages like Hungarian. In: Interspeech 2007, pp (2007) 10. King, S., Tokuda, K., Zen, H., Yamagishi, J.: Unsupervised adaptation for HMMbased speech synthesis. In Interspeech 2008, pp (2008) 11. Gibson, M.: Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models. In Interspeech 2009, pp (2009) 12. Yamagishi, J., Ling, Z., King, S.: Robustness of HMM-based speech synthesis. In Interspeech 2008, pp (2008) 13. Tóth, B., Németh, G.: Hidden Markov model based speech synthesis system in Hungarian. Infocommunications Journal Vol. LXIII, no. 2008/7, pp (2008) 14. Mihajlik, P., Tarján, B., Tüske, Z., Fegyó, T.: Investigation of Morph-based Speech Recognition Improvements across Speech Genres In: Interspeech 2009, pp (2009)