Japanese Phoneme Recognition Based on Recurrent Neural Network Integrating Dynamic Parameters

Transcription

1 Journal of Communication and Computer 9 (2012) D DAVID PUBLISHING Japanese Phoneme Recognition Based on Recurrent Neural Network Integrating Dynamic Parameters Mohammed Rokibul Alam Kotwal 1, Konica Bhowmik 1, Md. Merajul Islam 2 and Mohammad Nurul Huda 1 1. Department of Computer Science and Engineering, United International University, Dhaka-1209, Bangladesh 2. Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka-1216, Bangladesh Received: August 04, 2011 / Accepted: September 06, 2011 / Published: March 31, Abstract: This paper presents a method for Japanese phoneme recognition based on recurrent neural network (RNN) integrating dynamic parameters ( and ). Articulatory features (AFs) or distinctive phonetic features (DPFs)-based system shows its superiority in performances over acoustic features-based in ASR. These performances can be further improved by incorporating articulatory dynamic parameters into it. In this paper, we have proposed such a phoneme recognition system that comprises three stages: (1) DPFs extraction using a recurrent neural network (RNN) from acoustic features, (2) incorporation of dynamic parameters into a multilayer neural network (MLN) for reducing DPF context, and (3) addition of an Inhibition/Enhancement network (In/En) network for categorizing the DPF movement more accurately and Gram-Schmidt orthogonalization procedure for decorrelating the inhibited/enhanced data vector before connecting with a hidden Markov models (HMMs)-based classifier. From the experiments on Japanese Newspaper Article Sentences (JNAS), it is observed that the proposed method provides a higher phoneme correct rate over the method that does not incorporate dynamic articulatory parameters. Moreover, it reduces mixture components in HMM for obtaining a higher performance. Key words: Distinctive phonetic feature, multi-layer neural network, recurrent neural network, inhibition/enhancement network, local features. 1. Introduction In automatic speech recognition (ASR), articulatory features (AFs) or distinctive phonetic features (DPFs) play an important role [1-3]. These features provide a higher word recognition performance in speech recognition in clean and noise corrupted acoustic environment [4-5]. Moreover, a higher phoneme recognition performance in different acoustic environments is also achieved using these features [6-7]. The generation of wide margin of acoustic likelihood between two phonemes, which is not affected much by the noisy environments is the reason Mohammad Nurul Huda, Ph.D., associate professor, research fields: phonetics, automatic speech recognition, neural networks, artificial intelligence, algorithms. Corresponding author: Mohammed Rokibul Alam Kotwal, research assistant, lecturer, research fields: neural networks, phonetics, automatic speech recognition and data mining. rokib_kotwal@yahoo.com. for providing a better recognition performance. Besides, these methods incorporated context window of limited size instead of using context sensitive triphone models, which requires a large scale speech corpus and a large number of speech parameters, to resolve coarticulation effects. The context window in multilayer neural network (MLN)-based speech recognition system reduces coarticulation effect slightly and consequently, provides a reasonable performance at fewer mixture components in the hidden Markov models (HMMs). On the other hand, a recurrent neural network (RNN) having feedback connections models a context window unto several number of frames and shows a better performances [8]. These performances were further improved slightly by adding an MLN in the method proposed by us [9], which reduces DPF fluctuations in phoneme boundaries. The reason not for obtaining a

2 318 Japanese Phoneme Recognition Based on Recurrent Neural Network Integrating Dynamic Parameters higher performance improvement further is that the second stage MLN has an inability of handing longer context. In this paper, we propose a phoneme recognition method, which incorporates dynamic articulatory parameters ( and ) at second stage, to reduce coarticulation effect further. The method comprises three stages: (i) DPFs extraction using a recurrent neural network (RNN) from acoustic features, (ii) incorporation of dynamic parameters into a multilayer neural network (MLN) for constraining the context, and (iii) addition of an Inhibition/Enhancement network (In/En) network for categorizing the DPF movement more accurately and Gram-Schmidt (GS) orthogonalization procedure for decorrelating the inhibited/enhanced data vector before connecting with HMMs-based classifier. The specialty of this paper is the incorporation of dynamic articulatory parameters to solve the coarticulation effect further. The paper is organized as follows: Section 2 discusses the articulatory features. Section 3 explains the system configuration of the existing method with the proposed. Experimental database and setup are provided in Section 4, while experimental results are analyzed in Section 5. Finally, Section 6 draws some conclusion and remarks on future works. 2. Distinctive Phonetic Features By using its unique articulatory features or distinctive phonetic features (DPFs) set, a phone can easily be identified [10-11]. The Japanese balanced DPF set [4] for classifying Advanced Telecommunications Research Institute International (ATR) phonemes have 15 elements. These DPF values are mora, high, low, intermediate between high and low <nil>, anterior, back, intermediate between anterior and back <nil>, coronal, plosive, affricate, continuant, voiced, unvoiced, nasal and semi-vowel. Table 1 shows a part of this balanced DPF set. Here, present and absent elements of the DPFs are indicated by + and - signs, respectively. Table 1 Japanese balanced DPF-set. DPF/Phone a e f r mora high low nil anterior back nil _ 3. Phoneme Recognition Methods 3.1 The Existing Method The existing method comprises two neural networks: (i) RNN and (ii) MLN, which is called hybrid neural network (HNN) and shown in Fig. 1. The RNN represents dynamics in a sequence of acoustic features to resolve coarticulation effects and the MLN reduces fluctuation of DPF patterns. The external input acoustic vector at time t, for the RNN, is formed by taking preceding (t - 3)-th and succeeding (t + 3)-th frames together with the current t-th frame. Each frame is composed of 25 local features (LFs) [12] that are same as the DPF-based phoneme recognition using MLN [4]. The RNN outputs 45 DPF values of which 15 are for the preceding frame, 15 for the current frame, and the rest for the succeeding frame. Next, the MLN outputs 45 DPF values for the current input frame by reducing DPF fluctuation. After that, the 45 dimensional DPF vector outputted by the MLN are inserted into In/En network, which will be described in Section 3.2.2, to obtain categorical DPF movements and next, the inhibited/enhanced data vector are decorrelated with each other by using the GS orthogonation procedure [9] before connecting with an MLN. A fully recurrent neural network (FRNN), which has a hidden layer of 350 units and an output layer, is used for this approach. Each time total input vector is formed by taking the output layer (OL) feedback values and the hidden layer (HL) feedback values together with the external input (25 3) LF values of that time.

3 Japanese Phoneme Recognition Based on Recurrent Neural Network Integrating Dynamic Parameters 319 RNN AMs Speech Local Feature Extraction x t-3 : 25 LF x t : 25 LF x t+3 :25 LF OL HL External input Output Layer Hidden Layer 350 y t-3 :15 DPF y t : 15 DPF y t+3 : 15 DPF MLN y t-3 :15 DPF Y t : 15 DPF y t+3 : 15 DPF Inhibition/Enhancement Network 45 Gram-Schmidt Orthogonalization 45 HMM Phoneme strings Phone-list Fig. 1 The existing method [9] without articulatory dynamic parameters. The feedback values of the hidden layer and the output layer at time t 0 are assumed to be 0.1. The back-propagation through time algorithm is used for training the RNN. Again, the MLN has three layers including two hidden layers and an output layer, and is trained by using the standard back-propagation algorithm. The hidden layers are of 180 and 90 units, respectively. 3.2 Proposed Method The proposed method diagram is depicted in Fig. 2 and comprises three stages: (1) DPFs extraction using a recurrent neural network (RNN) from acoustic features; (2) Incorporation of dynamic parameters into a multilayer neural network (MLN) for constraining the context; (3) Addition of an Inhibition/Enhancement network (In/En) network for categorizing the DPF movement more accurately and Gram-Schmidt (GS) orthogonalization procedure for decorrelating the inhibited/enhanced data vector before connecting with HMMs-based classifier DPF Extractor The RNN, which has same architecture and learning mechanism described in Section 3.1, generates a 45 dimensional DPF vector (15 DPF 3) for the current input frame t. The 45-dimensional context-dependent Fig. 2 Proposed method with articulatory dynamic parameters.

4 320 Japanese Phoneme Recognition Based on Recurrent Neural Network Integrating Dynamic Parameters DPF vector provided by the RNN at time t, and its corresponding Δ and ΔΔ vectors calculated by three-point linear regression (LR) are appended into the subsequent MLN with four layers including two hidden layers of 300 and 100 units, respectively. The MLNDyn is trained using the standard back-propagation algorithm and outputs a 45-dimensional DPF vector in which context effects for the current t -th frame are reduced Inhibition/Enhancement Network The In/En network is used to obtain modified DPF patterns from the patterns produced by the RNN + MLN. The algorithm for this network is given below: Step1: For each element of the DPF vectors, find the acceleration (ΔΔ) parameters by using three-point LR. Step2: Check whether (ΔΔ) is positive (concave pattern) or negative (convex pattern) or zero (steady state). Step3: Calculate f ( ) if pattern is convex, c1 f ( ) 1 ( c1 1) e if pattern is concave, 2(1 c 2 ) f ( ) c 2 1 e if steady state, f ( ) 1.0 Step4: Find modified DPF patterns by multiplying the DPF patterns with f ( ). 4. Experiments 4.1 Speech Database The following two clean data sets are used in our experiments D1. Training Data Set A subset of the Acoustic Society of Japan (ASJ) Continuous Speech Database comprising 4,503 sentences uttered by 30 different male speakers (16 khz, 16 bit) is used [13] D2. Test Data Set This test data set comprises 2,379 JNAS [14] sentences uttered by 16 different male speakers (16 khz, 16 bit). 4.2 Experimental Setup The frame length and frame rate (frame shift between two consecutive frames) are set to 25 ms and 10 ms, respectively, to obtain acoustic features from an input speech. LFs are a 25-dimensional vector consisting of 12 delta coefficients along time axis, 12 delta coefficients along frequency axis, and delta coefficient of log power of a raw speech signal [12]. Phoneme correct rate (PCR) for D2 data set is evaluated using an HMM-based classifier. The D1 data set is used to design 38 Japanese monophone HMMs with five states, three loops, and left-to-right models. In the HMMs, the output probabilities are represented in the form of Gaussian mixtures, and diagonal matrices are used. The mixture components are set to 1, 2, 4, 8, and 16. In our experiments of the RNN and MLN, the non-linear function is a sigmoid from 0 to 1 (1/(1 + exp(-x))) for the hidden and output layers. For the In/En network, C1, C2, and β are set to 4.0, 0.25, and 80, respectively. To evaluate PCRs using D2 data set for observing the effects of articulatory dynamic parameters ( and ), the following six experiments are designed, where input features for HMM-based classifier are DPFs of 45 dimensions for the existing and proposed methods. (1) DPF(RNN+Not-.MLN,dim:45); (2) DPF(RNN+Not-.MLN+GS,dim:45); (5) DPF(RNN+.MLN,dim:45); (9) DPF(RNN+Not-.MLN+In/En+GS,dim:45); (i) DPF(RNN+MLN +GS,dim:45); (q)dpf(rnn+mln+in/en+gs,dim:45) [Proposed]. 5. Experimental Results and Analysis Figs. 3 and 4 explain the effects of ΔDPF and ΔΔDPF parameters, which are inputted to the second stage MLN of hybrid neural network (HNN)-based

5 Japanese Phoneme Recognition Based on Recurrent Neural Network Integrating Dynamic Parameters 321 phoneme recognizer. From the Fig. 3, in which GS orthogonalization is not used, it is observed that an addition of Δ and ΔΔ parameters in the method (1) increase PCR by 1.37% at mixture component 16. Again, an improvement of 2.34% PCR, because of Δ and ΔΔ parameters, is shown in Fig. 4 at 16 mixture component by the HNN-based method (q) with the GS orthogonalization procedure. Fig. 5 also shows the effect of using ΔDPF and ΔΔDPF as input to the second stage MLN in the hybrid neural network-based phoneme recognizers with In/En and GS. In the figure, an addition of Δ and ΔΔ parameters always increases PCR significantly. For example, at mixture component 16, the proposed Phoneme Correct Rate(%) (1) DPF(RNN+Not-Δ.MLN,dim:45) (5) DPF(RNN+Δ.MLN,dim:45) Clean Number of mixture component(s) Fig. 3 Effects of articulatory dynamic parameters ( and ) on the method (1), DPF (RNN + Not-.MLN,dim:45). Phoneme Correct Rate(%) (2) DPF(RNN+Not-Δ.MLN+GS,dim:45) (i) DPF(RNN+MLN+GS,dim:45) Clean Number of mixture component(s) Fig. 4 Effects of articulatory dynamic parameters ( and ) on the method (2) containing GS orthogonalization, DPF(RNN + Not-.MLN + GS, dim:45). Phoneme Correct Rate(%) (9) DPF(RNN+NotΔ.MLN+In/En+GS,dim:45) (q) DPF(RNN+MLN+In/En+GS,dim:45) Clean Number of mixture component(s) Fig. 5 Effects of articulatory dynamic parameters ( and ) on the method (9) containing In/En and GS orthogonalization, DPF (RNN + Not-.MLN + In/En + GS, dim:45) method with articulatory dynamic parameters. method (q) that incorporates Δ and ΔΔ parameters improves PCR by 0.73% in comparison with the method (9). It is claimed that the proposed method reduces mixture components in HMMs and hence computation time. For an example from the Figure 5, approximately 81.50% phoneme correct rate is obtained by the methods (9) and (q) at mixture components 16 and one, respectively. 6. Conclusions This paper has presented an articulatory feature based phoneme recognition method using a hybrid neural network for an ASR system, which integrates articulatory dynamic parameters into it. From the experiments on Japanese Newspaper Article Sentences (JNAS), the following conclusions are drawn: (1) The proposed method provides a higher phoneme correct rate over the method that does not incorporate dynamic articulatory parameters. (2) It reduces mixture components in HMM for obtaining a higher phoneme recognition performance. In near future, the authors would like to do some experiments for evaluating Bangla phonemes spoken by Bangladeshi People. Moreover, we have intension to evaluate word recognition performance using the proposed method.

6 322 Japanese Phoneme Recognition Based on Recurrent Neural Network Integrating Dynamic Parameters References [1] K. Kirchhoff, et. al., Combining acoustic and articulatory feature information for robust speech recognition, Speech Commun. 37 (2002) [2] K. Kirchhoffs, Robust Speech Recognition Using Articulatory information, Ph.D thesis, University of Bielefeld, Germany, July [3] K.Y. Leung, M.W. Mak, S.Y. Kung, Applying articulatory features to telephone-based speaker verification, Proc. IEEE ICASSP 04, 2004, pp [4] T. Fukuda, W. Yamamoto, T. Nitta, Distinctive Phonetic feature Extraction for robust speech recognition, Proc. ICASSP 03, 2003, pp [5] T. Fukuda, T. Nitta, Orthogonalized Distinctive Phonetic feature Extraction for Noise-Robust Automatic Speech Recognition, The Institute of Electronics, Information and Communication Engineers (IEICE) Transactions on Information and Systems 5 (2004) [6] Huda, et. al., Distinctive Phonetic Feature (DPF) based phone segmentation using 2-stage multilayer neural network, NCSP 07, Shanghai, China, [7] L. Ansary, et. al., Modeling phones coarticulation effects in a neural network based speech recognition system, Proc. Interspeech, [8] T. Robinson, An application of recurrent nets to phone probability estimation, IEEE Trans. Neural Networks 5 (1994). [9] M.N. Huda, et. al, Phoneme recognition based on hybrid neural network with inhibition/enhancement of distinctive phonetic feature (DPF) trajectories, InterSpeech 08, Brisbane, Australia, [10] S. King, P. Taylor, Detection of phonological features in continuous speech using neural networks, Computer Speech and Language 14 (2000) [11] E. Eide, Distinctive features for use in an automatic speech recognition system, Proc. Eurospeech 2001, pp [12] T. Nitta, Feature extraction for speech recognition based on orthogonal acoustic-feature planes and LDA, Proc. ICASSP 99, 1999, pp [13] T. Kobayashi, et al., ASJ Continuous speech corpus for research, Acoustic Society of Japan Trans. 48 (1992) [14] JNAS: Japanese Newspaper Article Sentences, available online at: