Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of Technology SE-100 44 Stockholm SWEDEN project.yellow@sweden.com Abstract This work describes the development of an isolated word recognition system for use in mobile phones. By using a TMS320C6701 EVM, a voice dialing system with functionality similar to the one found in the Ericsson T18s mobile phone has been designed. The speech recognition algorithm is based on a Discrete Hidden Markov Model (HMM), which is used for adapting the spectral characteristics of the speech. By comparing the observed data of the spoken word with the models of the prerecorded words in the phone book, detection is performed. Tests conducted in a noisy environment with limited training data show that the EVM implementation has an average detection ratio of 65 % and the Matlab implementation a detection ratio of 80 %. Introduction Mobile phones are becoming so small that in some situations e.g. driving a car, it is difficult and thereby not desirable to use a keypad. A voice controlled dialing functionality is then a good solution to these kinds of problems. Some investigations have been done using different types of speech features and then making a comparison between prerecorded and new data. Other works have tried to use neural network or matched filters for solving the recognition problem. Our assignment as participants in a project course in signal processing at the Royal Institute of Technology was to implement a voice dialing system similar to the one found on the Ericsson T18s mobile phone. The implementation consists of a speech recognition algorithm running on the Texas Instruments DSP board TMS320C6701 EVM (which is an evaluation module) and a user interface displayed on the host PC. A Discrete Hidden Markov Model (HMM) of the words forms the basis of our speech recognition system. The Speech Recognition System Overview The user of the voice recognition system says the word, corresponding to the phone number desired to call, into the microphone. The noisy speech is recorded and saved as a sampled speech signal. The speech signal then passes through a FIR filter and both a start and an end point of the word is detected. Thereafter the speech signal, containing only a word, is divided into frames. For each frame, certain coefficients that represent the characteristics of the speech are extracted. For this purpose the Mel Cepstrum method is used. The extracted coefficients are vector quantized, using an optimized codebook, so that one integer number is obtain for each frame. We call this number an observation. A sequence of observations is collected in a vector and used in conjunction with the HMM. The HMM has mainly two tasks. The first one is to train a finite set of words used for the phone book. The second one is the recognition itself. Based on the HMM for each word, as determined from the training process, the probability of the observation sequence is computed. A Maximum Likelihood (ML) detector finally chooses the word having the largest probability. If the largest probability is below a given threshold, all the words in the phone book are rejected. Functionality The functionality of our system is the same as the one found in the Ericsson T18s. A maximum of ten labels can be recorded. A label consists of a label name, a phone number and an HMM of the recorded word. Each recording contains only one word. The voice dialing system has three main functionalities: 1
1. One of the most important functionalities is the ability to make a phone call through the voice dialer. Given a phone book with ten prerecorded labels, the system is supposed to detect the word spoken by the user and then display the corresponding label name and the phone number on the screen. 2. To record a new label in the phone book. This includes register the label name and phone number corresponding to the new label as well as training the HMM based on the word spoken by the user. 3. The ability to improve an existing label by letting the user repeat the word. The increased amount of training data then makes recognition of the corresponding word more robust. It is also possible to change the label names and the phone number without making a new recording. If an incorrect word is spoken, the system displays a rejection message to the user and no label is matched. Theoretical Solution Signal Preprocessing The recorded speech signal is via a microphone sampled at a sampling rate of 8000 Hz and preemphasized by a first order high pass FIR filter to limit the background noise and to enhance the higher frequency components. The next step is the Start and End point detection. An energy-based algorithm is used. Because of this, the background noise before and after the spoken word is removed. Details on this algorithm can be found in [1] and [2]. After the detection, the remaining speech signal is divided into frames of 256 samples each. An overlap of 156 samples between successive frames is used and the individual frame is weighted using a Hamming window. Features Extraction In speech recognition one of the most important steps is to extract features of the speech signal in order to obtain observations that as well as possible represent the characteristics of the speech. In our system the Mel Cepstrum, which is a kind of Cepstrum with a mel scale, provides the necessary features. A mel is a unit of measure of perceived pitch and its scale makes it possible to obtain a accurate perception, thus outperforming many other methods. The procedure for calculating the Mel Cepstrum [3] coefficients for each frame is as follows: 1. The spectrum is filtered using a set of bandpass filters with a bandwidth of 300 mel. The center frequencies are spaced 150 1 mel apart, leading to overlapping filters. The conversion between mel and Hz is given by the expression below: F mel = 1000 log 2 (1 + F Hz / 1000) 2. The integral of the log power spectrum belonging to each band is computed. The calculated values form a sequence. 3. By computing the inverse discrete Fourier transform of the sequence and taking the symmetry into account, nine mel-coefficients are finally obtained. Vector quantization To utilize the HMM, the speech signal needs to be converted into a finite number of discrete observations. For this purpose, a vector quantizer is used. The feature vector with the mel coefficients is assigned an index corresponding to the nearest codebook vector, in an Euclidian distance sense. In our system, the codebook consists of 256 vectors. The codebook was created by using 4000 feature vectors obtained from different speakers. For training the vector quantizer, the Linde Buzo Gray algorithm [4] was used. Hidden Markov Model In order to successfully detect the spoken word, we need a statistical model of the word, which takes the time varying behavior of the speech into account. The HMM is well suited for this purpose and is therefore used in our system. A spoken word can be divided into phonemes [3] during which the statistical nature of the sound is fairly stationary. As previously mentioned, the statistics is in our case given by the indices of the vectors in the codebook. These indices are also referred to as observation symbols and directly related to the spectral characteristics of the signal. Loosely speaking, each phoneme corresponds to a state in the HMM. Assigned to each state is an observation symbol probability distribution, which describes the statistical nature of the particular phoneme. The HMM used herein is illustrated in fig. 2 and described in more detail in [1] and [5]: 1 Experimentally tested. See ref. [3] 2
1. There are six hidden states in our Left-Right model 2. The different states are denoted as: S = {S 1, S 2,..,S 6 } Let q t represent the state at time t. 2. One state contains 256 different discrete observation symbols. The set of symbols are denoted: V= {v 1,v 2,,v 256 } 3. The state transition probability distribution is: A= {a ij } where a ij = P[q t+1 = S j q t = S i ] for 1 i, j 6 4. The set of observation symbol probability distributions is denoted: B ={b j (k)} where j is the state number and b j (k) = P[v k at t q t = S j ] for 1 j 6, 1 k 256 5. The initial state distribution is denoted π i = Pr[q 1 = S i ] for 1 i 6. Due to the particular Left-Right model being used, π i = 1 for i = 1 and a ij =0, for j > i + 2. Let λ v = (A, B, π) be a compact notation for the parameters in the HMM corresponding to the v:th word stored in the phone book, 1 v 10. S1 S2 S3 S4 S5 S6 O 1 O 2 O 3 O T Figure 2: Six states Left-Right HMM model and the observation sequence during the time. Training the HMM recognizer Since each word is characterized by λ v, training must be performed prior to the detection in order to determine λ v for the particular word. The received observation sequence O = O 1 O 2...O T, corresponding to the word which is being trained, forms the training data. The observation sequence consists of the extracted features from the word, in this case the indices corresponding to the quantized Mel Cepstrum coefficients. A maximum likelihood estimate of the model parameters λ v is obtained by maximizing Pr[O λ v ] with respect to λ v. Since this results in an intractable optimization problem, we resort to a suboptimum procedure based on the Baum-Welch 2 See 2.6.4 ref [1] algorithm [5]. This algorithm iterates until at least a local optimum is found and makes extensive use of the Forward and Backward algorithm [5]. A random initialization is used for the parameters of interest, λ v. From the output of the Forward algorithm, a scaling variable is computed. The use of this variable keeps the dynamic range of the computations within reasonable bounds. Furthermore, it is needed in the Backward algorithm. Note that if a word is repeated several times, more training data is available and a better statistical model can be obtained. Speech recognition The words are detected by comparing the received observation sequence, O = O 1 O 2...O T, corresponding to the spoken word with the statistical models of the words stored in the phone book. The detection of a word is based on computing Pr[O λ v ] for all models, λ v, (words in the phone book). The Forward algorithm is again used in the calculations. Finally, the word which maximizes Pr[O λ v ] with respect to v is chosen. Implementation The EVM we use is equipped with a Digital Signal Processor (DSP), which gives the computational power needed for the speech recognition. The algorithms described in earlier sections were first implemented in Matlab in order to test and evaluate the techniques. This allowed a smooth transition to a C-language implementation running on the EVM. The development tools include both Code Composer Studio, for compiling and debugging on the EVM, and Visual C++ for developing the portion of the program running on the PC. The implementation of the algorithms on the EVM can be divided into three main problems: 1. Programming specific functions for the speech recognition algorithm. 2. Using the EVM for acoustic recording and playback. 3. Communication with the PC Speech recognition algorithm The speech recognition algorithm runs entirely on the EVM. To facilitate the Matlab-C translation, convenient functions, which deal with common vector and matrix operations, were developed. Acoustic Recording and Playback In order to manage both acoustic recording and playback, functions from the codec library API 3
were used [6]. These functions are specific to the EVM of use and relieve the programmer from having to directly deal with the hardware registers. PC-EVM Communication Communication between the PC and the EVM is based on the mailbox system available on the EVM [6]. The mailboxes are located on the EVM and are used for transmitting and receiving 32 bit messages over the PCI bus. Orders are passed from the PC to the EVM by a mailbox message. Depending on the message the EVM takes the appropriate action and returns the result to the PC by another mailbox message. In Figure 3, the PC-EVM communication process used for voice recognition is illustrated. PC EVM Figure 4: Performance in voice recognition using Matlab implementation. The figure shows how often the correct label was chosen for the different labels. Init state! User chooses voice recognition Send Voice Recognition message message from PC Read incomming mailbox Start voice recognition Send Matched label message messagefrom PC Similar to the Ericsson T18s system, performance drops when a user different to the one that trained the phone book tries to use the voice dialer. It is therefore evident that the similarity between the voice used for training the labels in the phone book and the voice used in the detection process is a critical factor for obtaining acceptable results. No Retrieve Matched label Valid label? Display message: No label found message from PC Further Improvements Yes Download label sound file Display name & number Play dial sound file Figure 3: PC-EVM communication process. Play label sound file As indicated in the result section, it seems that the Matlab implementation is better than the implementation running on the EVM. We are currently in the process of investigating the cause of these problems, which are believed to be due, in part, to inconsistent numerical rounding and to the difference in start/end detection algorithms. Results The recognition system was tested and trained using the same speaker. The overall result of the testing is presented in Figure 4, using the Matlab implementation. The average recognition ratio is 80 %. For the EVM implementation, the performance drops to an average of about 65 %. However, the latter test was conducted in a considerably noisier environment. Moreover, a slightly different start and end point detection was used. The Ericsson T18s was also tested, giving a recognition ratio close to 100 %. In this case, the location of the microphone close to the mouth is believed to contribute to the excellent recognition capabilities. From an algorithmic perspective, there are a number of possible improvements. For example, the training of the models could be enhanced. The start/end detection could also be made more robust and less sensitive to background noise. The use of other features, e.g. Perceptual Linear Prediction, PLP [7] has been shown [8] to perform excellent in noisy environments. Another possible improvement is to find better initial values when training the HMM [5]. Conclusions This work considered the implementation of an isolated word recognition system similar in functionality to the one used in the Ericsson T18s mobile phone. The actual voice recognition algorithm was implemented on a TMS320C6701 EVM, whereas the user interface was displayed on a host PC in a Windows environment. In order to detect the spoken words, an HMM was used in 4
conjunction with ML decoding. Although satisfactory performance was obtained, particularly taking the limited amount of training into account, the performance level of the Ericsson T18s was not completely attained. Microphones better suited for our application is currently investigated as well as improvements to the detection algorithm itself. Acknowledgement The authors thank Dr. Peter Händel, Prof. Arne Leijon, Ph.D.student Rickard Stridh and Ph.D.student George Jöngren, for their encouraging comments and suggestions that improved this paper. References [1] M. Aracena Kovacevic, A. Dehlbom, J. Ekeberg, G. Gariazzo, V. Troncoso, E. Lästh, Ericsson T18s Voice Dialing Simulator, Final Report in KTH-S3-2E1366, May 2000. [2] L.R. Rabiner and M.R.Sambur, An algorithm for determining the endpoint of isolated utterances, the bell System Technical Journal, Vol.54 No2, pp297, February 1975. [3] J. Dellar, J. Proakis, J Hansen, Discrete-Time Processing of Speech Signals, Macmillan Publishing company 1993. [4] Y Linde, A.Buzo, and R.M. Gray, An algorithm for vector quantizer design, IEEE trans. Comm., Vol. COM-28 pp 84-95, 1980. [5] L.R. Rabiner A tutorial on Hidden markov Model and selected Application in speech Recognition IEEE proceeding Vol.77, No2, February 1989. [6] TMS320C6201/6701 Evaluation Module Technical Reference, Literature Number SPRU305, Texas Instruments, 1998. [7] H. Hermansky, B.A. Hanson, and H.Wakita, Perceptually Based Linear Predictive Analysis of Speech, IEEE Proc. ICASSP 85, Vol.1 pp 509-512, 1985. [8] M.Gadallah, E.Soleit, A. Mahran, Noise immune speech recognition system Radio Science Conference, 1999. NRSC 99. Proceedings of the Sixteenth national, 1999, page(s): C21/1-C21/8. 5