Evaluation of Acoustic Model and Unsupervised Speaker Adaptation for Child Speech Recognition in Real Environment

Size: px

Start display at page:

Download "Evaluation of Acoustic Model and Unsupervised Speaker Adaptation for Child Speech Recognition in Real Environment"

Lucas James Franklin
7 years ago
Views:

1 Vol. 47 No. 7 July 2006, 71.1% 23.9% 59, % 1.7% 0.5% Evaluation of Acoustic Model and Unsupervised Speaker Adaptation for Child Speech Recognition in Real Environment Mitsuru Samejima, Randy Gomez, Akinobu Lee,, Hiroshi Saruwatari and Kiyohiro Shikano Child s utterance has totally different property from adult s speech, not only by their acoustic property, but by their incorrect pronunciation and totally ill-formed speaking style. The rapid physiological changes during the growth also prevent accurate speech recognition using a single model. However, collection of child s read speech is difficult in natural, since forcing them to read a sentence precisely will make the utterances far from spontaneous one. In this research, we evaluated acoustic models and an unsupervised adaptation method based on a large number of real spontaneous child speech automatically collected through an actual spoken dialogue system. Acoustic model trained by an actual spontaneous speech achieves the word accuracy of 71.1%, which outperforms one trained by read speech by 23.9%. Detailed investigation is carried out for child s ages (infant pupils, lower-grade elementary schoolers and higher-grade elementary schoolers), and accuracy of the infant pupils was greatly improved by using the age-dependent model. Then a speaker clustering method is proposed to perform unsupervised speaker adaptation based on HMM Sufficient Statistics on automatically collected database where no user tag is available. Clustering the 59,966 utterances to 200 speaker clusters, and selecting the neighbor one for each input to construct the adapted model has resulted in a further improvement of recognition accuracy by 1.5% as compared with age-class dependent models. 1. Graduate School of Information Science, Nara Institute of Science and Technology Presently with Graduate School of Engineering, Nagoya Institute of Technology 2295

2 2296 July ) Vocal Tract Length Normalizatioin VTLN 2) 5) HMM 1,100 6) 2) 78 7) 354 8) 9) 3 10) 2 3 HMM ) Fig. 1 Spoken-dialogue information agent Takemarukun.

3 Vol. 47 No Table 1 Breakdown of spontaneous child speech data collected by spoken dialogue system. Table 2 2 Most frequent utterances in each child age classes. 9, , , ,966 1,200 / 11) , , % 59, % 2.3 SI SI JNAS 12) Phonetic Tied- Mixture PTM triphone HMM 1 EM (%) 834 (8.03%) 251 (2.42%) 239 (2.30%) 169 (1.63%) 157 (1.51%) 147 (1.42%) 128 (1.23%) 119 (1.15%) 115 (1.11%) 114 (1.10%) 1,638 (4.25%) 731 (1.90%) 673 (1.75%) 672 (1.74%) 594 (1.54%) 558 (1.45%) 518 (1.35%) 438 (1.14%) 433 (1.12%) 377 (0.98%) 299 (2.44%) 218 (1.78%) 203 (1.65%) 198 (1.61%) 164 (1.34%) 163 (1.33%) 152 (1.24%) 147 (1.20%) 130 (1.06%) 119 (0.97%) Table 3 / 3 Conditions of acoustic analysis. 25 ms, 10 ms 16 khz/16 bit MFCC+ MFCC+ Power 25 EM JNAS CSRC SI

4 2298 July Table 4 Experimental conditions for evaluation of age class dependent models. Julius ver gram 40 k 6 Table 6 Specifications of test set for each age classes. test set perp Table 5 5 Breakdown of acoustic models. JNAS 47 JNAS 51 JNAS 98 CSRC 40,000 SI 15,707 /9 SI 59,966 /34 PTM ,000 CSRC 3,000 5 JNAS JNAS CSRC JNAS MAP CSRC ) SI SI JNAS EM PTM ,000 CSRC 3, gram Julius 14) ver ) Web N-gram 15) 4 3-gram 6, Fig. 2 Results of children s speech recognition using SI model. 2 % SI 71.1% CSRC 47.2% 23.9% JNAS SI JNAS CSRC SI SI 15% SI CSRC 42.3% 4

5 Vol. 47 No Fig. 3 3 Evaluation of age-dependent models. SI 82.1% 77.6% 53.6% SI 4. Maximum likelihood Linear Regression MLLR 16) 10) HMM-stat 4.1 HMM EM Gaussian Mixture Model GMM 1 HMM-stat 17) HMM-stat HMM-stat

6 2300 July ) 2 4 K-means MFCC 19) MFCC step 1 N step 2 step 3 step 4 4 Fig. 4 Automatic speaker clustering. 5 Fig. 5 Distance measuring using vowel segments. step 2 step HMM-stat 17) 2 6 step 1 N step 2 HMM GMM HMM step 3 GMM M step 4

7 Vol. 47 No Table 7 7 Experimental conditions for evaluation of proposed adaptation method. SI N 200 M Table 8 Selection rate of age-class model for age evaluation data. 6 HMMstat Fig. 6 HMM-statistics speaker adaptation using multiple initial acoustic models and automatic speaker clustering. 89.5% 3.3% 3.5% 10.5% 96.8% 17.3% 0.0% 0.0% 79.3% step 5 step 3 M HMM HMM-stat 7 59, N = 200 HMM-stat M SI % SI 7 Fig. 7 Results of children s speech recognition using adapted models. 20) 3.4% 2.2% 2.0% SI 5.3 N ,000 N M

8 2302 July Table 9 Word accuracy of adapted model for each cluster granularity % 81.1% 78.0% % 83.7% 78.5% % 84.1% 78.4% % 84.3% 79.6% % 84.2% 79.2% % 83.1% 78.3% 1, % 83.5% 78.6% % 23.9% 53.6% 82.1% 77.6% 59, % 1.7% 0.5% 1) software/ 2) Narayanan, S. and Potamianos, A.: Creating conversational interfaces for children, IEEE Trans. Speech and Audio Processing, Vol.10, No.2, pp (Feb. 2002). 3) Arunachalam, S., Gould, D., Andersen, E., Byrd, D. and Narayanan, S.: Politeness and frustration language in Child-Machine interactions, Proc. EUROSPEECH, pp (Sep. 2001). 4) Potamianos, A., Narayanan, S. and Lee, S.: Automatic speech recognition for children, Proc. EUROSPEECH, pp (Sep. 1997). 5) Wilpon, J.G. and Jacobsen, C.N.: A study of speech recognition for children and elderly, Proc. ICASSP, pp (May 1996). 6) Shobaki, K., Hosom, J.P. and Cole, R.A.: The OGI kid s speech corpus and recognizers, Proc. ICSLP, Vol.4, pp (2000). 7) 2-Q-7, pp (Sep. 2000). 8) Vol.J87-D-II, No.8, pp (2004). 9) Vol.J87-D-II, No.3, pp , ) Yoshizawa, S., Baba, A., Matsunami, K., Mera, Y., Yamada, M. and Shikano, K.: Unsupervised speaker adaptation based on Sufficient HMM Statistics of selected speakers, Proc. ICASSP, pp (2001). 11) 2004-SLP-53-9, pp (2004). 12) Itou, K., Yamamoto, M., Takeda, K., Takezawa, T., Matsuoka, T., Kobayashi, T., shikano, K. and Itahashi, S.: JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research, The Journal of the Acoustical Society of Japan (E), Vol.20, pp (1999). 13) 2002

9 Vol. 47 No SLP-48-1 (2003). 14) Kawahara, T., Lee, A., Kobayashi, T., Takeda, K., Minematsu, N., Sagayama, S., ITou, A., Ito, K., Yamamoto, M., Yamada, A., Utsuro, T. and Shikano, K.: Free software toolkit for japanese large vocabulary continuous speech recognition, Proc. ICSLP, Ob(16)- V-07, pp.iv (2000). 15) N-gram 2003-SLP-45-13, pp (2003). 16) Leggetter, C.J. and Woodland, C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models, Computer Speech and Language, Vol.9, pp (1995). 17) Gomez, R., Lee, A., Saruwatari, H. and Shikano, K.: Unsupervised Speaker Adaptation Based on HMM Sufficient Statistics Using Multiple Acoustic Models Under Noisy Environment, Proc. Acoustic Sciety of Japan, pp (2004). 18) Nguyen, L., Matsoukas, S., Davenport, J., Kubala, F. and Schwartz, R.: Progress in transcription of Broadcast News using Byblos, Speech Communication, Vol.38, pp (2002). 19) S83-48, pp (1983). 20) 2004-SLP , pp (2004). ( ) ( ) IP FAX Randy Gomez 1998 Mindanao State University-lligan Institute of Technology 2002 University of New South Wales-Electrical Engineering 2003 RA IEEE IS VR IEEE

10 2304 July ATR 4 NTT IEEE SP 1990 Senior Award VR IEEE ISCA VR

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University