KEYWORD SPOTTING USING HIDDEN MARKOV MODELS. by Şevket Duran B.S. in E.E., Boğaziçi University, 1997

Transcription

1 KEYWORD SPOTTING USING HIDDEN MARKOV MODELS by Şevket Duran B.S. in E.E., Boğaziçi University, 1997 Submitted to the Institute for Graduate Studies in Science and Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering Boğaziçi University 2001

2 i ACKNOWLEDGEMENTS To Dr. Levent M. Arslan: Thank you for the sacrifices of your personal time that you have made unselfishly to help me prepare this thesis. Thank you for your encouraging me to study in the area of speech processing. It is a privilege for me to be your student. Şevket Duran

3 ii ABSTRACT KEYWORD SPOTTING USING HIDDEN MARKOV MODELS The aim of keyword spotting system is to detect a small set of keywords from a continuous speech. It is important to obtain the highest possible keyword detection rate without increasing the number of false insertions in this system. Modeling only keywords is not enough. To seperate keywords from non-keywords, models for out-of-vocabulary words are needed, too. Since the structure and type of garbage model has great effect on the entire system performance, out-of-vocabulary modeling is done by the use of garbage models. The subject of this MS thesis is to examine context independent phonemes as garbage models and evaluate the performance of different criteria as confidence measures for out-of-vocabulary word rejection. Two different databases are collected for keyword spotting and isolated word recognition experiments over telephone lines. For keyword spotting use of monophone models together with one-state general garbage model gives the best performance. Using average phoneme likelihoods with phoneme durations gives the best performance for confidence measures.

4 iii ÖZET SAKLI MARKOV MODELLERİ KULLANILARAK ANAHTAR KELİME YAKALAMA Anahtar kelime yakalama sisteminin amacı sürekli bir sesin içinde barınan küçük bir anahtar kelimeler gurubu ortaya çıkarmaktır. Bu sistemde önemli olan, kelime olmadığı halde hata verme oranını artırmaksızın olası en yüksek anahtar kelime bulma oranını elde etmektir. Bunun için sadece anahtar kelimeleri modelleme yapmak yeterli değildir. Anahtar kelimeleri, olmayanlardan ayırmak için, sözlük dışı kelimelerin modellemesi de gerekmektedir. Bu modelleme, yapısı ve türü itibarıyla tüm sistem performansı üzerinde büyük etkisi bulunan garbage modellemesi ile yapılmaktadır. Bu tezin konusu garbage modelleri olarak bağımsız içerikli sesbirimleri (monophone) incelemek ve sözlük dışı kelime dışlamaları için güvenilirlik oranları bazında değişik kriterlerin performansını değerlendirmektir. Anahtar kelime yakalama ve telefon üzerinden yalıtılmış ses tanıma denemeleri için iki veritabanı oluşturuldu. Anahtar kelime bulma için en iyi performansı tek fazlı genel garbage modelleme ile birlikte tek-sesbirimsel modellerin kullanılması verdi. Güvenilirlik oranları içinse süreleri ile ortalama sesbirim benzeşmelerinin birlikte kullanımı en iyi performansı gösterdi.

5 iv TABLE OF CONTENTS ACKNOWLEDGEMENTS..iii ABSTRACT..iv ÖZET.. v LIST OF FIGURES....viii LIST OF TABLES x 1. INTRODUCTION BACKGROUND Speech Recognition Problem Speech Recognition Process Gathering Digital Speech Input Feature Extraction Hidden Markov Model Assumption in the Theory of HMMs The Markov Assumption The Stationarity Assumption The Output Independence Assumption Three Basic Problem of HMMs The Evaluation Problem The Decoding Problem The Learning Problem The Evaluation Problem and the Forward Algorithm The Decoding Problem and the Viterbi Algorithm The Learning Problem Maximum Likelihood (ML) Criterion Baum-Welch Algorithm Types of Hidden Markov Models Use of HMMs in Speech Recognition Subword Unit Selection Word Networks Training of HMMs Recognition Viterbi Based Recognition N-Best Search Keyword Spotting Problem PROPOSED KEYWORD SPOTTING ALGORITHM Introduction Experiment Data Performance of a System System Structure Performance of Monophone Models for Isolated Word Recognition CONFIDENCE MEASURES FOR ISOLATED WORD RECOGNITION Introduction Experiment Data Minimum Edit Distance....38

6 4.4. Phoneme Durations Garbage Model Using Same 512 Mixtures Comparison of Confidence Measures CONCLUSION APPENDIX A: SENTENCES USED FOR KEYWORD SPOTTING 49 APPENDIX B: MINIMUM EDIT DISTANCE ALGORITHM...51 REFERENCES.52 v

7 vi LIST OF FIGURES Figure 2.1. The waveform and spectrogram of ev and ben eve...3 Figure 2.2. The waveform and spectrogram of okul and holding...4 Figure 2.3. Components of a typical recognition system...5 Figure 2.4. The spectrogram of /S/ sound and /e/ sound in word Sevket...6 Figure 2.5. Flowchart of deriving Mel Frequency Cepstrum Coefficients...8 Figure 2.6. A simple isolated speech unit recognizer that uses null-grammar...24 Figure 2.7. The expanded network using the best match triphones...25 Figure 2.8. The null-grammar network showing the underlying states...25 Figure 3.1. General structure of the proposed keyword spotter...31 Figure 3.2. ROC points for different alternatives for garbage model...32 Figure 3.3. ROC points for different number of keywords for keyword spotting...33 Figure 3.4. Network structure for the keyword spotter used as a post-processor for isolated word recognizer...34 Figure 3.5. ROC curves for monophone and garbage model based out-of-vocabulary word rejection...36 Figure 4.1. ROC curves before/after applying Minimum Edit Distance Revision...39 Figure 4.2. Forced alignment of the waveform for keyword iszbankasizkurucu...41 Figure 4.3. Forced alignment of the waveform for keyword milpa...42

8 vii Figure 4.4. ROC curves for phoneme duration based confidence measure...42 Figure 4.5. Likelihood profiles for ceylangiyim and the base garbage model proposed..43 Figure 4.6. ROC curves for different emphasis values for power value Figure 4.7. ROC curves for different power values with emphasis set to Figure 4.8. ROC curve for phoneme duration based confidence measure and confidence measure with likelihood ratio scoring included...46

9 viii LIST OF TABLES Table 3.1. Database used for keyword spotting...30 Table 3.2. Number of occurrences of the keywords used for keyword spotting tests...30 Table 3.3. Results for monophone model based out-of-vocabulary word rejection for isolated word recognition...35 Table 3.4. Results for general garbage model based out-of-vocabulary word rejection for isolated word recognition...35 Table 4.1. Average phoneme durations in Turkish...40 Table 4.2. Computation time required with/without phoneme duration evaluation...47

10 1 1. INTRODUCTION Communication between people and computers using more natural interfaces is an important issue in order to use computers in our daily lives. To interact with computers you always have to use your hands. The device may be a keyboard or a mouse or the dialing pad on your phone if you want to access information on a computer over a telephone line. A more natural interface for input is speech. Human-computer interaction via speech involves speech recognition [1, 2, 3] and speech synthesis [4]. Speech recognition is the conversion of speech signal into text and synthesis is the opposite. Speech recognition may range from understanding simple commands to getting all information in speech signal such as all words, the meaning and the emotional state of the speaker. After many years work speech recognition is at a level mature enough to be used in practical applications. This is due to the availability of algorithms developed and the increase in the computational power. Speech recognition may be speaker dependent or speaker independent. If the application is for home use, and the same person will use the same microphone at the same place, then the problem is simple and you don t need a robust algorithm. But if it is an application that will recognize speech over a public telephone network where speaker variability and the environment that speech passes through are different among different calls, you need a robust algorithm. If recognition of isolated words or phrases is the problem, then you will have less of a problem as far as the speakers only give the required input. If the speakers also use other words in addition to the keywords you require, you need to perform keyword spotting which means recognizing the keywords among other non-keyword filler words. If we go further, recognition from a large vocabulary where you have to recognize all of the words, it is called dictation, which is a harder task. We will be dealing with the keyword-spotting problem in this thesis.

11 2 For speech recognition, the digitized speech signal that is in the time domain must be transformed into another domain. Generally some part of the speech is taken and a feature vector is derived to represent that part. Next these feature vectors are used to guess the sequence of words that generated this speech signal. We need algorithms that account for the variability in the speech signal. The most common technique for acoustic modeling is called hidden Markov modeling (HMM). We have used this model in this thesis. In order to have an operating system independent notation we preferred not to use non-ansi characters in Turkish character set. We have used lower case letters for characters that are in the ANSI character set, and upper case letters for Turkish characters. We have used /S/ instead of /ş/, /U/ instead of /ü/, and so on. We use /Z/ for interword silence. So the word savaş alanı is represented as savaszalani. In this thesis we investigate some models for garbage models for keyword spotting and try to find some confidence measure for detection of out-of-vocabulary words in an isolated word recognizer. In chapter 2, we give the theory of each step in speech recognition process and give details of the techniques we have used. In chapter 3, we study the keyword-spotting algorithm we have proposed and conclude that using the monophone models of the words and a one-state 16-mixture general garbage models with different bonus values give the best performance. We evaluate the performance of monophone models as garbage model for isolated word recognition. In chapter 4, we evaluate some measures to obtain a good confidence measure and decide both likelihood and phoneme duration are important for obtaining a good confidence measure. Finally, in chapter 5 we give our conclusions from these experiments and suggest some directions for future study.

12 3 2. BACKGROUND 2.1. Speech Recognition Problem The speech signal is different if input is given with isolated words or the speech is continuous. If the speaker knows that a computer will try to recognize the speech, then he/she may pause between words. However in continuous speech some sounds will disappear and sometimes there will be no silence between words. It may be hard to say a word in a different context. An exaggerated example may be a tongue twister like SemsiZpaSaZpasajIndaZsesiZbUzUSesiceler. Even in normal cases there is great difference in the characteristics of the speech signal. Figure 2.1 shows the same /e/ sound in ev and ben eve. The waveforms are shown at the top of the figure. The spectrograms at the bottom show the energy at different frequencies versus time. The darkness shows the amplitude. The effect of the context can be seen on the characteristics of the /e/ sound. The effect of the context leads us to model each phoneme according to the neighboring phonemes. Figure 2.1. The waveform and spectrogram of ev (on the left) and ben eve (on the right).

13 4 Spontaneous speech may contain other fillers that are not words like ee or himm. It is another difficulty in continuous speech recognition. The task should be known while designing the algorithm. If we are to use a recognizer in continuous speech recognition, the training data should consist of continuous speech as well. The main difficulty of the speech recognition problem comes from the variability of the source of the signal. First, the characteristics of phonemes, the smallest sound units, are dependent on the context in which they appear. An example to phonetic variability is the acoustic differences of the phoneme /o/ in, okul and holding in Turkish. See Figure 2.2. The marked region corresponds to the /o/ sound. Figure 2.2. The waveform and spectrogram of okul (on the left) and holding (on the right) The environment also causes variability. The same speaker will say a word differently according to his physical and emotional state, speaking rate, or voice quality. The difference in the vocal tract size and shape of different people also causes variability. The problem is to find the meaningful information in the speech signal. The meaningful information is different for speech recognition and speaker recognition. The same information in the speech signal may be necessary for some application and redundant for

14 5 some other application. For speaker independent speech recognition we have to get rid of as much speaker related features as possible Speech Recognition Process Figure 2.3 shows the main components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10 milliseconds. These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters. Figure 2.3. Components of a typical speech recognition system Gathering Digital Speech Input Speech recognition is the process of converting digital speech signal into words. To capture speech signal we need a device that converts physical speech wave into digital signal. This may be a microphone that converts the speech into analog signal and a sound card that is an A/D converter that converts the analog signal to the digital signal. Another way of obtaining digital speech input is to use a telephone card that converts the analog

15 6 signal that comes from the telephone line into digital signal. There are also devices that can take the digital signal coming from E-1 or T-1 lines directly. Dialogic has JCT LS240 and JCT LS300 for T-1 lines and E-1 lines respectively. We have been using a JCT LS120 card, which is a speech-processing card for 12 analog lines. We have used 8 KHz sampling rate which is the sampling rate for telephone lines and converted the µ -law encoded signal into 16-bit linear encoded signal before processing Feature Extraction To get rid of redundancies in the speech signal mentioned earlier we have to represent the signal by only taking the perceptually most important speaker-independent features [5]. Passing the excitation signal generated by the larynx through the vocal tract produces the speech signal. We are interested in the properties of the speech generated by the overall shape of the vocal tract. To distinguish phonemes better (the voiced /unvoiced distinction), we examine if the vocal folds are vibrating but ignore the variations in the frequency of vibration. The spectrum of voiced sounds has several sharp peaks, which are called formant frequencies. The spectrum of unvoiced sounds looks like white noise spectrum. Figure 2.4 shows the spectrum of the unvoiced sound /S/ and the voiced sound /e/. Figure 2.4. The spectrum (found using 256 point FFT) of /S/ sound (on the left) and /e/ sound (on the right) in word Sevket Since our ears are insensitive to phase effects we use the power spectrum as a basis for speech recognition front-end. The power spectrum is represented on a log scale. When

16 7 the overall gain of the signal varies, the shape of the log power spectrum is the same but shifted up or down. The convolutional effects of the telephone lines are multiplied with the signal on the linear power spectrum. In log power spectrum the effect is additive. Since a voiced speech waveform corresponds to convolution of a quasi-periodic excitation signal and a time-varying filter (shape of the vocal tract), we can separate them in the log power spectrum. Assigning a lower limit to the log function solves the problem of low energy levels at some part of the spectrum. Before computing short-term power spectra, the waveform is processed by a simple pre-emphasis filter to give a 6 db/octave increase in gain. This makes the average speech spectrum roughly flat. We have to extract the effects caused by the shape of the vocal tract. One method is to predict the coefficients of the filter that corresponds to the shape of the vocal tract. The vocal tract is assumed to be a combination of lossless tubes with different radius. The number of parameters derived corresponds to the number of tubes assumed. The filter is assumed to be an all-pole linear filter. The parameters are called Linear Predictive Coding (LPC) parameters and the procedure is known as LPC analysis. There are different methods to calculate these coefficients [6]. To calculate the short-term spectra we take overlapping portions of the waveform. We take a frame of 25 milliseconds and multiply it with a window function to avoid artificial high frequencies. We use a Hamming window. Then we apply Fourier transform. We have to get rid of harmonic structure at the multiples of fundamental frequency, f 0, because it is the effect of the excitation signal. The smoothed spectrum without the effect of the excitation signal corresponds to the Fourier Transform of the LPC parameters. We use a different method and group components of the power spectrum and form frequency bands. Grouping is not linear; the human ear sensitivity is taken into account. The bands are linear up to 1 khz and logarithmic at higher frequencies. The frequency bands are broader at higher frequencies. The positions of the bands are set according to the mel frequency scale [7].

17 8 The relation between mel frequency scale and linear frequency scale is as follows: f Mel ( f ) = 2595log10 (1 + ) (2.1) 700 Figure 2.5. Flowchart for deriving Mel Frequency Cepstrum Coefficients To calculate the filterbank coefficients, the magnitude coefficients of the spectrum are accumulated after windowing with these triangular windows. Triangular filters are

18 9 spread over the whole frequency range from zero upto the Nyquist frequency. We have chosen 16 filter banks. Since the shape of the spectrum imposed by the vocal tract is smooth, energy levels in adjacent bands are correlated. We have to remove correlation since in further statistical analysis we assume that feature vector elements are uncorrelated and use a diagonal variance vector. Removing the correlation helps the number of parameters to be reduced without loss of useful information. The discrete cosine transform (a version of the Fourier transform using only cosine basis functions) converts the set of log energies to a set of cepstral coefficients, which are largely uncorrelated. The formula for Discrete Cosine Transform is: c i = 2 N πi m j cos ( j 0.5) N j= 1 N, i = 1,..., P ( 2.2) where { m } are log filter bank amplitudes. j and N is the number of filterbank channels which we set to 16. The required number of cepstral coefficients is P and we set it to 12. Figure 2.5 shows the steps in obtaining Mel Frequency Cepstrum Coefficients (MFCCs). Many systems use the rate of change of the short-term power spectrum as additional information. The simplest way to obtain this dynamic information is to take the difference between consecutive frames. But this is too sensitive to random interframe variations. So, linear trends are estimated over sequences of typically five or seven frames [8]. We use five frames; there will be a delay of 2 times step size in real-time operation. d G (2 * c + c c 2 * c ) ( 2.3) t = * t+ 2 t+ 1 t 1 t 2 where d is the difference evaluated at time t, and c, c,, t t+2 t+1 c t 1 t 2 c are the coefficients at time t+2, t+1, t-1 and t-2, respectively. G is a gain factor selected as Some systems use the acceleration features as well as linear rates of change. These second-order dynamic features need longer sequences of frames for reliable estimation [9].

19 10 Since cepstral coefficients are largely uncorrelated, probability estimates are easier in further analysis. We can simply calculate Euclidean distances from reference model vectors. Statistically based methods weigh coefficients by the inverse of their standard deviations computed around their overall means. Current representations concentrate on the spectrum envelope and ignore fundamental frequency; but we know that even in isolated-word recognition fundamental frequency contours carry important information. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of training data. Effects of context at the acoustic phonetic level are handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling. Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Another technique is to add different pronunciations to the network because after pruning common nodes at the network, it corresponds to different pronunciations of the same word Hidden Markov Model The most widely used recognition algorithm in the past fifteen years is Hidden Markov Models (HMM) [10, 11, 12]. Although there had been some attempts at using Neural Networks, those have not been very successful. The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution. Transition probabilities are assigned to the transitions among the states. In a particular state an outcome or observaiton can be generated, according to the associated probability distribution. The external observer can only see the the outcome, not the states. Therefore states are hidden to the outside. The following part is the teory of the HMMs taken from the tutorial [3]. The

20 11 advanced reader can skip this part. In order to define an HMM completely, following elements are needed: The number of states of the model, N. The number of observation symbols in the alphabet, M. If the observations are continuous then M is infinite. A set of state transition probabilities, A = a } { ij a = p{ q + 1 = j q i}, 1 i, j N ( 2.4) ij t t = where q t denotes the state index at time t and a ij corresponds to the transition probability from state i to state j. Transition probabilities should satisfy the normal stochastic constraints, a 0, 1 i, j N ( 2.5) ij and N a ij j= 1 = 1, 1 i N ( 2.6) A probability distribution in each of the states, B = { b ( k)}. j b ( k) = p{ o = ν q j}, 1 i N, 1 k M ( 2.7) j t k t = where ν k denotes the k th observation symbol in the alphabet, and o t the current observation vector. Following stochastic constraints must be satisfied. b j ( k) 0, 1 j N, 1 k M ( 2.8) and

21 12 M b j ( k) = 1, 1 j N ( 2.9) k= 1 If the observations are continuous then we will have to use a continuous probability density function, instead of a set of discrete probabilities. In this case we specify the parameters of the probability density function. Usually the probability density is approximated by a weighted sum of M Gaussian distributions, j t M b ( o ) = c Ν( µ, Σ, o ) ( 2.10) m= 1 jm jm jm t where, c jm = mixture weights for th j state s th m mixture µ jm = mean vectors Σ jm = covariance matrices c jm should satisfy the stochastic constrains, c 0, 1 j N, 1 m M ( 2.11) jm and M c jm m= 1 = 1, 1 j N ( 2.12) The initial state distribution, π = π }. { i where, π = p{ q 1 i}, 1 i N ( 2.13) i = Therefore we can use the compact notation

22 13 λ = ( A, B, π ) ( 2.14) can be used to denote an HMM with discrete probability distributions, while λ = ( A,, µ, Σ, π ) ( 2.15) c jm jm jm to denote one with continuous densities Assumptions in the Theory of HMMs For the sake of mathematical and computational tractability, following assumptions are made in the theory of HMMs The Markov Assumption. As given in the definition of HMMs, transition probabilities are defined as, a = p{ q + 1 = j q i}, 1 i, j N ( 2.16) ij t t = In other words it is assumed that the next state is dependent only upon the current state. This is called the Markov assumption and the resulting model becomes actually a first order HMM. However the next state may depend on past k states and it is possible to obtain such a model, called a k th order HMM. But a higher order HMM will have a higher complexity The Stationarity Assumption. Here it is assumed that state transition probabilities are independent of the actual time at which the transitions takes place. Mathematically, p{ qt + 1 = j qt = i} = p{ qt + 1 = j qt = i} ( 2.17) for any t 1 and t The Output Independence Assumption. This is the assumption that current output (observation) is statistically independent of the previous outputs(observations). We can formulate this assumption mathematically, by considering a sequence of observations,

23 14 O = o o,..., ( 2.18) 1, 2 o T Then according to the assumption for an HMM λ, T p{ O / q1, q1,..., qt, λ } = p( ot qt, λ) ( 2.19) t= 1 However unlike the other two, this assumption has a very limited validity. In some cases this assumption may not be fair enough and therefore becomes a severe weakness of the HMMs Three Basic Problems of HMMs Once we have an HMM, there are three problems of interest The Evaluation Problem. Given an HMM λ and a sequence of observations O = o o,...,, what is the probability that the observations are generated by the model, 1, 2 p { O λ}? o T The Decoding Problem. Given a model λ and a sequence of observations O = o o,...,, what is the most likely state sequence in the model that produced the 1, 2 observations? o T The Learning Problem. Given a model λ and a sequence of observations O = o o,...,, how should we adjust the model parameters ( A, B, π ) in order to 1, 2 o T maximize p { O λ}? Evaluation problem can be used for isolated (word) recognition. Decoding problem is related to the continuous recognition as well as to the segmentation. Learning problem must be solved, if we want to train an HMM for the subsequent use of recognition tasks The Evaluation Problem and the Forward Algorithm

24 15 We have a model λ = ( A, B, π ) and a sequence of observations O = o1, o2,..., ot, and p { O λ} must be found. If we can calculate this quantity using simple probabilistic arguments the number of operations are on the order of. This is very large even if the length of the sequence, T is small. The idea of keeping the multiplications that are common led to the idea of using an auxiliary variable, which is called the forward variable and denoted as α (i). t T N The forward variable is defined as the probability of the partial observation sequence O = o1, o2,..., ot, when it terminates at the state i. Mathematically, α i) = p{ o, o,..., o, q i } ( 2.20) t ( 1 2 t t = λ Then it is easy to see that following recursive relationship holds. N α + 1 ( j) = b ( o + 1) α ( i) a, 1 j N, 1 t T 1 ( 2.21) t j t i= 1 t ij where, α j) = π b ( ), 1 j N ( 2.22) 1( j j o1 Using this recursion we can calculate α (i), 1 j N T and then the required probability is given by, N { O λ } = α ( i ) ( 2.23) p i= 1 T N 2 T The complexity of this method, known as the forward algorithm is proportional to, which is linear with respect to T whereas the direct calculation had an exponential complexity. In a similar way the backword variable β (i) is defined as the probability of the partial observation sequence o + o,..., o, given that the current state is i. t 1, t+ 2 T t

25 16 Mathematically, βt ( i) = p{ ot+ 1, ot+ 2,..., ot qt = i, λ} ( 2.24) As in the case of α (i) there is a recursive relationship which can be used to calculate β (i) efficiently. t t N β ( i) = β ( j) a b ( o, 1 i N, 1 t T 1 ( 2.25) t j= 1 t+ 1 ij j t+ 1) where, β ( i) = 1, 1 i N ( 2.26) T Further we can see that, α ( i) β ( i) = p{ O, q i λ}, 1 i N, 1 t T ( 2.27) t t t = Therefore this gives another way to calculate p { O λ}, by using both forward and backward variables : N p{ O λ } = p{ O, q = i λ} = α ( i) β ( i) ( 2.28) t i= 1 i= 1 N t t The last equation is very useful, especially in deriving the formulas required for gradient based training.

26 The Decoding Problem and the Viterbi Algorithm In this case we want to find the most likely state sequence for a given sequence of observations, O = o1, o2,..., ot and a model, λ = ( A, B, π ). The solution to this problem depends upon the way most likely state sequence is defined. One approach is to find the most likely state qt at t=t and to concatenate all such ' q t 's. But some times this method does not give a physically meaningful state sequence. Therefore we would need another method which has no such problems. In this method, commonly known as Viterbi algorithm [13], the whole state sequence with the maximum likelihood is found. In order to facilitate the computation we define an auxiliary variable, δ t ( i) = max p{ q1, q2,..., qt 1, qt = i, o1, o2,... ot 1 q1q 2... q t 1 λ} ( 2.29) which gives the highest probability that partial observation sequence and state sequence up to t=t can have, when the current state is i. It is easy to observe that the following recursive relationship holds. δ t+ 1( j ) = b j ( ot+ 1) maxδ t ( i) aij, 1 i N, 1 t T 1 ( 2.30) 1 i N where, δ j) = π b ( ), 1 i N ( 2.31) 1( j j o1 So the procedure to find the most likely state sequence starts from calculation of δ ( j), 1 i N using recursion in Eqn (2.30) while always keeping a pointer to the T winning state in the maximum finding operation. Finally the state * j, is found where j * = arg max δ ( j) ( 2.32) 1 j N T and starting from this state, the sequence of states is back-tracked as the pointer in each