CHEN Y N et al.: Speech Rate Robustness in Speech Recognition 757 tion feature vectors O = (o 1 ;o 2 ;:::;o T ) represented by state sequence S = (s 1

Transcription

1 Nov. 2003, Vol.18, No.6, pp J. Comput. Sci. & Technol. Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition CHEN YiNing ( ±Λ), ZHU Xuan ( Φ), LIU Jia ( ) and LIU RunSheng ( ΞΠ) Department of Electronic Engineering, Tsinghua University, Beijing , P.R. China chenyining99@mails.tsinghua.edu.cn Received April 26, 2002; revised February 14, Abstract In mandarin all-syllable recognition, many insert errors occur due to the influence of non-consonant syllables. Introducing the duration model into the recognition process is a direct way to lessen these errors. But that usually could not work well as expected, for the duration is sensitive to speech rate. Hence, aiming at this problem, a novel context dependent duration distribution normalized by speech rate is proposed in this paper and applied to a speech recognition system based on the frame of improved Hidden Markov Model (HMM). To realize this algorithm, the authors employ a new method to estimate the speech rate of a sentence; then compute the duration probability combined with speech rate; and finally implement this duration information in the post-processing stage. With little change in the recognition process and resource demand, the duration model is adopted efficiently in the system. The experimental results indicate that the syllable error rates decrease significantly in two different speech corpora. Especially for the insertions, the error rates reduce about sixty to eighty percent. Keywords 1 Introduction speech recognition, speech rate, duration distribution As we know, the performance of large-vocabulary continuous speech recognition (LVCSR) systems dramatically degrades with the variety of the speech rates, while human auditory system can keep robust under this condition [1]. Consequently, it becomes a natural idea to involve the speech rate information in the framework of speech recognition. On the other hand, it has been well known that introducing the duration model properly can reduce a great deal of insert errors in the LVCSR system. However, it is proved that the duration model is not only context dependent but also speech rate dependent, so combining both of them is essential in the process of obtaining the duration model. Several methods have been proposed recently to solve this problem in different ways [2 4]. The most popular ones are transition probability adaptation and speech rate dependent acoustic modeling. However, when any of these methods is utilized, it will bring about sharp increase in either the space complexity [2] or the time complexity [3;4] with about 10% word error rate reduction. Our method is different from them. Since the duration models have a close relationship with speech rate, we build novel duration models normalized by the speech rate and join them into the second decoding stage. Thus, the mended system can also get about 10 percent depressing, while the time spending only rises about 10% and the space demand is nearly no change. Obviously, this new method makes the system show higher performance by adding the new duration information skillfully. This paper is organized as follows. In Section 2, we introduce our framework employing the duration models. Section 3 describes how to adopt the duration that is normalized by speech rate in training and decoding process. In Section 4 experimental results are presented. Finally, Section 5 gives our conclusion and future work. 2 Duration Modeling Compared with the systems in [2 4], our baseline system is quite different. It derives from segmental HMM [5;6], which can avoid a traditional HMM's problem caused by transition probabilities. Therefore, our duration model becomes special in this kind of methods. 2.1 Baseline Framework In our framework, a segment of speech observa- The research is supported by the National Natural Science Foundation of China (Grant No ).

2 CHEN Y N et al.: Speech Rate Robustness in Speech Recognition 757 tion feature vectors O = (o 1 ;o 2 ;:::;o T ) represented by state sequence S = (s 1 ;s 2 ;:::;s N ) is modeled as: semi-syllable [p]. Here the independent variable of Gaussian distribution is limited to [1, 12], which is reasonable in real speech. p(o; D=S) = p(o=d; S); P(D=S) (1) where D = (d 1 ;d 2 ;:::;d N ) is the duration of each state, T is the frame number of this speech segment and N is the state number in the sequence S. Here, p(o; D=S)is known as the likelihood function, while the first item on the right is the observation distribution and the second item is the duration distribution. Since the computational complexity of segmental HMM will be significantly higher than conventional HMM [5], two assumptions must be introduced, that is: ffl Suppose that the duration and the observation vectors are independent. Then, (1) is simplified to p(o; D=S) = p(o=s)p (D=S) (2) ffl Premise that both the feature vector and the duration are independent, thus P (O=S) = P (D=S) = TY NY k=1 p(o i =S) (3) p(d k =s k ) (4) Therefore, in our framework there are no positions left for transition probabilities. The duration distribution information can be used in the first or second stage of recognition decoding process. In this paper, we use the duration models in the second pass. 2.2 Duration Model Estimation In training process, without considering the effect of duration model, we attain supervised segmentations of training set with the observation probabilities p(o=s). Then the histogram of duration for each state is collected. 1-D Gaussian distribution p(x) = p 1 exp 2ßff (x μ)2 2ff 2 (5) is found to produce a satisfactory quality fit to the empirical duration distribution, so we employ it in our system. Fig.1 shows the empirical duration distribution and its Gaussian fit for the first state of Fig.1. Duration distribution for the first state of [p]. (a: Empirical distribution (solid line); b: Gaussian fit (dashed line.) To estimate the parameters of Gaussian distribution, only the mean μ and the variance ff 2 are needed. They can be calculated as follows. P dmax μ = P h(i) i dmax h(i) (6) ff 2 = P dmax P dmax h(i) i2 h(i) μ 2 (7) where h(i) is the number of occurrences of duration i and d max is the largest duration allowed. 3 Duration Normalization with Speech Rate To use the duration information normalized by speech rate, different methods are employed in training and decoding processes. In both processes, we make a robust estimation of speaking rate first and then normalize the duration with it. 3.1 Duration Normalization in Training Process In the training process, the duration normalization is carefully done with an EM (Expectation Maximization) like iteration method. That is, since the speech rate is hard to be estimated, we treat it as a latent variable of EM algorithm. From the iteration method, the speech rate independent duration distribution can be estimated robustly. The

3 758 J. Comput. Sci. & Technol., Nov. 2003, Vol.18, No.6 detail of our method is addressed in the following steps: Step 1. Align the speech data only with observation probabilities p(o=s). Step 2. Get the duration d il of each state i in sentence l. Step 3. Set loop variable n = 1. Step 4. Estimate the Gaussian duration distribution N(μ i;ff 2 i ) of each state i with (6) and (7) (maximization step of EM algorithm). Step 5. Calculate the speech rate of each sentence l. speed l = 1 M MX k=1 d kl μ k (8) Here the speed means the average ratio between the state duration and the mean of state durations in one sentence, d kl is the duration of the state s kl in the sentence l and μ k is the mean duration of the state s k in the whole speech database, M presents the state number of one sentence. M does not include the state of silence. Step 6. Normalize the duration of each state in each sentence. d kl = d kl =speed l (9) Step 7. Calculate the duration probabilities of the whole corpus p(n) with (5) (expectation step of EM algorithm). Step 8. IF (p(n) p(n 1))=p(n 1) < Threshold THEN end ELSE n = n +1, GOTO step 4 where the Threshold is an experimental parameter to stop training. In training process, we align the speech data firstly only with observation probabilities (Step 1) and get the duration of each state in all the sentences at the same time (Step 2). Then, with these duration data as well as the mean duration distribution of each state (Step 4), we can obtain the speech rate estimation (Step 5). The duration can be normalized by the speech rate to get the speechrate-independent duration (Step 6). If the duration model is not good enough (Step 8), another cycle will start from Step 4. By introducing this iterative algorithm, the normalized duration probability distributions are estimated. 3.2 Duration Normalization in Decoding Process In decoding process, the method is different. Due to the ineluctable recognition errors, speech rate estimation is really difficult. A skillful method uses the following steps: Step 1. Recognize the speech data only with observation probabilities p(o=s). Step 2. Get the duration d il of each state i in sentence l. Step 3. Calculate the speech rate of each state for each sentence, in which the speech rate of state k in sentence l is d kl =μ k. Step 4. Sort the speech rate of each state for each sentence. In sentence l the speech rate is ranked from fd kl =μ k, k = 1; 2;:::;Mg to n d 0 kl o ; k = 1; 2;:::;M : μ 0 k Step 5. Compute the speech rate of each sentence. In sentence l that is speed l = 1 M 3M=4 X k=m=4 d 0 kl μ 0 k (10) Step 6. Normalize the duration of each state in each sentence with (9). Step 7. Calculate the probability of duration of each state p(d kl =s k ) with (5). In the first stage of decoding process we acquire the best recognition results only with the observation distributions (Step 1) and collect the durations of all states (Step 2). With these durations and the mean duration distribution of each state, we can get the speech rate of the sentence, which is similar to the training process. But for inserting or deleting syllables, the speech rate d k =m k of error state s k is much higher or lower than real speech rate of this sentence. Hence the accurate speech rate is difficult to get. To solve this problem, an effective method is employed as follows. First of all, we sort the absolute speed of each state (Steps 3 and 4). And then just using those items belonging to the area of [M=4; 3M=4], the sentence speed is computed (Step 5). Finally normalized duration is obtained (Step 6) and the duration probability of each state can be calculated (Step 7). Eliminating the disturbance of insert and delete error syllables, the correct duration probabilities can be acquired normally. 3.3 Utilize Normalized Duration in Decoding Process Mandarin speech includes 408 different nontone syllables, called all-syllables. A lot of highlyconfusing syllables are included in the all-syllables list. The recognition accuracy of all-syllables has a great effect on the performance of a dictation machine or a dialog system. The syllable error rate of all-syllables can also represent the performance of

4 CHEN Y N et al.: Speech Rate Robustness in Speech Recognition 759 acoustic models in the continuous speech recognition system. In the first stage, our base-line speaker independent speech recognition system can obtain N- Best all-syllable recognition results with 3 different syllable numbers. In the second stage, the duration models are used for improving recognition accuracy. The whole process consists of the following two steps. Step 1. With the assumptions in Subsection 2.1, the logarithmic likelihood LL(O; D=S) = log (p(o; D=S)) can be computed by the equation: LL(O; D=S) = TX log(p(o i =S))+ MX k=1 log(p(d k =s k )) (11) with which we can get duration probability p(d k =s k ) of state k through the steps in Subsection 3.2, and attain the p(o i =S) in the first stage of our baseline system. Using these items, the LL(O; D=S) of each result can be easily calculated without high computational complexity. The best candidate with the maximum LL(O; D=S) can be obtained as the result. Because the results with the wrong syllable number usually have bad duration probabilities, utilizing the duration information can effectively improve recognition accuracy. Step 2. The experiments reveal that some inserting errors still exist after the previous step. For example, sometimes the syllable [tang] may be recognized as [ta] and [ang] as shown in Fig.2. This kind of errors is about 70% of total insert ones, so some further process is required. In the best candidate, some syllables are easily divided into two syllables, the second of which is often nonconsonant syllable. For instance, the syllable [tang] is usually recognized as [ta] and [ang]. Accordingly, the speech data, which are joined by each couple of neighboring syllables, is aligned again with other syllable's acoustic model. Thus, we can gain the new log likelihood LL(O; D=S), for the duration probability has changed. Then, the new log likelihood is compared with the inherent one. If the new one is better, it will replace the position of original two syllables. So that [t ang] is selected as the correct result, if LL [t ang] (O; D=S) is higher than LL [ta] (O; D=S) + LL [tang] (O; D=S). Altogether, this method is to merge two neighboring syllables and then check it, so we can simply call it as Merge- Check algorithm. Fig.2. Insert error example. 4 Experimental Results The following experiments are based on two different speech corpora to test the robustness of our algorithm. The basic acoustic units adopted in our system are bi-phone units, which include 101 initials, 146 tonal finals, 1 pause and 1 silence. Each Initial has 2 states while each Final has 4. There are totally 788 states. For each state we use Gaussian mixtures to approach it. The component amount of each state is carefully selected. 4.1 Experiment with 863 Corpus The first experiment is done with the National 863 Standard Mandarin Speech Corpus [7]. Training database includes 20 hours' data spoken by 34 female speakers. Testing database contains 3.6 hours' data spoken by 6 female speakers. The speech rates of these sentences are about 0.6 times to 1.5 times compared with the normal speech rate. The speech is sampled by 16KHz and quantified linearly into 16 bits. 20ms frame length and 10ms frame overlap are used. 15 (including C 0 ) mel-frequency perceptual linear predictive coefficients (MF-PLP) [8] and their first and second order derivatives are adopted. The syllable error rate of Mandarin 408 different non-tone syllables is used to measure the system performance. The result is shown in Table 1. The error rate of our baseline system for the acou- Table 1. Word Error Rate Improvement for the 863 Corpus (%) Baseline Method A Method B Decrease from Baseline to Method B Delete error rate :44 Insert error rate Substitute error rate Error rate

5 760 J. Comput. Sci. & Technol., Nov. 2003, Vol.18, No.6 stic models is 22.69%, which is quite good performance in the published Mandarin speech recognition systems. The training and decoding processes are just as the steps described previously. In the baseline system we get the best candidate with the maximum p(o=s). In the method A system, we use the candidate afterstep 1 insubsection 3.3, which can achieve the best syllable number. In the method B system, we use the candidate after Step 2 in Subsection 3.3. It is improved based on method A by the Merge-Check algorithm. It is clear that both steps in Subsection 3.3 do help to decrease the error rate. Especially after Step 2, the syllable error rate is cut down to 12.58%, while the insert error rate reduces by 80.50%. Although the delete error rate increases, it is only 0.27% higher. 4.2 Experiment of Microsoft Corpus In this section the corpus is supplied by Microsoft Research Asia [9]. The training set is read by 100 male speakers, each speaking approximately 200 sentences, with a total of 19,688 sentences and 454,315 syllables. The test set involves 25 male speakers, with 20 test sentences per speaker. The speech rate of the test set is 0.8 to 1.3 times from normal. The speech is sampled by 16KHz and quantified into 16 bits. 20ms frame length and 10ms frame overlap are used. 13 (including C 0 ) mel-frequency cepstral coefficients (MFCC) and their first and second order of derivates are adopted. From the results of method B, we can see that our method gets almost the same improvement with Microsoft Corpus, while no parameters are changed. And, the results of method C display that employing the un-normalized duration model will cause a sudden drop of the system performance. The syllable error rate of Mandarin 408 different non-tone syllables is also adopted as the system performance's measure. The error rate of our baseline system for the acoustic models is 26.91%, which is comparable with the performance announced by Microsoft Research Asia in 2001 [9]. The same experiment was done without changing any parameter in our system. Table 2 shows the comparative performance. Here, method C introduces the duration model without speech rate normalization into the system. Table 2. The Improvement of Word Error Rates with the Microsoft Corpus (%) Baseline Method B Method C Decrease from Baseline to Method B Delete error rate :38 Insert error rate Substitute error rate Error rate In addition, the algorithms proposed in this paper have low computational complexity. The recognition time for method B is just increased by 10% and the memory demand is about 6K bytes more than that in the baseline system. Fig.3 shows the probability of insert error versus speech rate collected from Microsoft Corpus. Speech rate more than 1.0 means that the sentence is spoken slower than normal. The insert probability means how many percent of sentences at that speech rate have insert errors. For example, when the speech rate is about 1.3, insert errors will appear in every sentence. It is clear that the insert errors increase while the speech rate rises. Fig.4 reveals the probability of insert error versus speech rate after using our method. It is obvious that the reduction of the insert errors at high speech rate is much more than that at low speech rate. Now the insert error is almost independent of the speech rate. But when the speech rate is ultra-high, our algorithms show little help. 4.3 Relationship Between Insert Error and Speech Rate Fig.3. Probability of insert error vs. speech rate of Microsoft Corpus.

6 CHEN Y N et al.: Speech Rate Robustness in Speech Recognition 761 References Fig.4. Probability of insert error vs. speech rate of Microsoft Corpus. 5 Conclusions As shown in Subsections 4.1 and 4.2, our method gets nearly the same improvements with different corpora and various features. With speech rate independent duration model, the total word error rate decreases about 10 percent and the reduction of the insert errors is most remarkable, which is cut down to 20% 40%. This performance is much better than the method with speech rate un-normalized duration model. From Subsection 4.3 our method can also make the insert error independent of speech rate. Like other methods, decreasing the insert errors also leads to the increasing of delete errors. But in most Mandarin speech recognition systems the delete error rate is always very low and in our baseline system it is no more than 1%. After applying the normalized duration model in our system, the increasing of delete errors is less than 0.5%, which is considered to be acceptable. Since our method is used in the second stage of recognition process and does not need adaptation, the computational complexity is just 10% higher than the conventional HMM. Furthermore, this algorithm can be naturally adopted in any system that contains duration model. From the results in Subsection 4.3, our method works well when the speech rate is diverse in the common condition. Future work will be done under the condition of ultra-high or ultra-low speech rate. Acknowledgement We are grateful to Microsoft Research Asia for supplying the Corpus. [1] David W Carroll. Psychology of Language. Third Edition, Brooks/Cole Publishing Company, [2] Zheng J, Franco H, Stolcke A. Rate-of-speech modeling for large vocabulary conversational speech recognition. In Proc. the ISCA ITRW ASR2000, Paris, France, 2000, pp [3] Martinez F, Tapias D, Alvarez J. Towards speech rate independence in large vocabulary continuous speech recognition. In Proc. ICASSP, vol. 2, New York, NY, USA, May 12 15, 1998, pp [4] Kwon O W, Un C K. Context dependent word duration modeling for Korean connected digit recognition. Electronic Letters, 1995, 31(19): [5] Steve Young. Statistical modeling in continuous speech recognition. In Proc. Int. Conf. Uncertainty in Artificial Intelligence, Seattle, WA, Aug. 2001, pp [6] Liu Jia, Pan S X. A new robust telephone speech recognition algorithm with the multi-model structures. Chinese Journal of Electronics, Apr. 2000, 9(2): [7] Wang R H. National performance assessment of speech recognition systems of Chinese. In Proc. Oriental CO- COSDA Workshop'99, Taipei, 1999, pp [8] Woodland P C, Gales M J F, Pye D et al. Broadcast news transcription using HTK. In Proc. ICASSP'97, Los Alamitos, CA, USA, 1997, pp [9] Eric Chang, Yu Shi, Jianlai Zhou et al. Speech lab in a box: A mandarin speech toolbox to jumpstart speech related research. In Proc. Eurospeech, Aalborg, Denmark, 2001, pp CHEN YiNing received the B.S. and M.S. degrees in electronic engineering from Tsinghua University in 1999 and 2001 respectively. He is a Ph.D. candidate at the Department of Electronic Engineering of Tsinghua University. His research interests are speech recognition and spoken language processing. ZHU Xuan received the B.S. degree in electronic engineering from Beijing University of Astronautics and Aeronautics in 1998 and M.S. degree in electronic engineering from Tsinghua University in She is a Ph.D. candidate at the Department of Electronic Engineering of Tsinghua University. Her research interests are speech recognition and embedded signal processing system design. LIU Jia received the B.S., M.S. and Ph.D. degrees in electronic engineering from Tsinghua University from 1983, 1986 and 1990 respectively, and was a post doctor in Cambridge University from 1992 to He is a professor at the Department of Electronic Engineering of Tsinghua University. His research interests are speech recognition, speech coding, speech synthesis and speech ASIC design. Dr. Liu is a member of IEEE and a senior member of China Institute of Electronics. LIU RunSheng received the B.S. degree in electronic engineering from Tsinghua University in He is a professor at the Department of Electronic Engineering of Tsinghua University. His research interests are speech recognition, IC design and CAD.