KEYWORD SPOTTING USING HIDDEN MARKOV MODELS. by Şevket Duran B.S. in E.E., Boğaziçi University, 1997

Transcription

1 KEYWORD SPOTTING USING HIDDEN MARKOV MODELS by Şevket Duran B.S. in E.E., Boğaziçi University, 1997 Submitted to the Institute for Graduate Studies in Science and Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering Boğaziçi University 2001

2 i ACKNOWLEDGEMENTS To Dr. Levent M. Arslan: Thank you for the sacrifices of your personal time that you have made unselfishly to help me prepare this thesis. Thank you for your encouraging me to study in the area of speech processing. It is a privilege for me to be your student. Şevket Duran

3 ii ABSTRACT KEYWORD SPOTTING USING HIDDEN MARKOV MODELS The aim of keyword spotting system is to detect a small set of keywords from a continuous speech. It is important to obtain the highest possible keyword detection rate without increasing the number of false insertions in this system. Modeling only keywords is not enough. To seperate keywords from non-keywords, models for out-of-vocabulary words are needed, too. Since the structure and type of garbage model has great effect on the entire system performance, out-of-vocabulary modeling is done by the use of garbage models. The subject of this MS thesis is to examine context independent phonemes as garbage models and evaluate the performance of different criteria as confidence measures for out-of-vocabulary word rejection. Two different databases are collected for keyword spotting and isolated word recognition experiments over telephone lines. For keyword spotting use of monophone models together with one-state general garbage model gives the best performance. Using average phoneme likelihoods with phoneme durations gives the best performance for confidence measures.

4 iii ÖZET SAKLI MARKOV MODELLERİ KULLANILARAK ANAHTAR KELİME YAKALAMA Anahtar kelime yakalama sisteminin amacı sürekli bir sesin içinde barınan küçük bir anahtar kelimeler gurubu ortaya çıkarmaktır. Bu sistemde önemli olan, kelime olmadığı halde hata verme oranını artırmaksızın olası en yüksek anahtar kelime bulma oranını elde etmektir. Bunun için sadece anahtar kelimeleri modelleme yapmak yeterli değildir. Anahtar kelimeleri, olmayanlardan ayırmak için, sözlük dışı kelimelerin modellemesi de gerekmektedir. Bu modelleme, yapısı ve türü itibarıyla tüm sistem performansı üzerinde büyük etkisi bulunan garbage modellemesi ile yapılmaktadır. Bu tezin konusu garbage modelleri olarak bağımsız içerikli sesbirimleri (monophone) incelemek ve sözlük dışı kelime dışlamaları için güvenilirlik oranları bazında değişik kriterlerin performansını değerlendirmektir. Anahtar kelime yakalama ve telefon üzerinden yalıtılmış ses tanıma denemeleri için iki veritabanı oluşturuldu. Anahtar kelime bulma için en iyi performansı tek fazlı genel garbage modelleme ile birlikte tek-sesbirimsel modellerin kullanılması verdi. Güvenilirlik oranları içinse süreleri ile ortalama sesbirim benzeşmelerinin birlikte kullanımı en iyi performansı gösterdi.

5 iv TABLE OF CONTENTS ACKNOWLEDGEMENTS..iii ABSTRACT..iv ÖZET.. v LIST OF FIGURES....viii LIST OF TABLES x 1. INTRODUCTION BACKGROUND Speech Recognition Problem Speech Recognition Process Gathering Digital Speech Input Feature Extraction Hidden Markov Model Assumption in the Theory of HMMs The Markov Assumption The Stationarity Assumption The Output Independence Assumption Three Basic Problem of HMMs The Evaluation Problem The Decoding Problem The Learning Problem The Evaluation Problem and the Forward Algorithm The Decoding Problem and the Viterbi Algorithm The Learning Problem Maximum Likelihood (ML) Criterion Baum-Welch Algorithm Types of Hidden Markov Models Use of HMMs in Speech Recognition Subword Unit Selection Word Networks Training of HMMs Recognition Viterbi Based Recognition N-Best Search Keyword Spotting Problem PROPOSED KEYWORD SPOTTING ALGORITHM Introduction Experiment Data Performance of a System System Structure Performance of Monophone Models for Isolated Word Recognition CONFIDENCE MEASURES FOR ISOLATED WORD RECOGNITION Introduction Experiment Data Minimum Edit Distance....38

6 4.4. Phoneme Durations Garbage Model Using Same 512 Mixtures Comparison of Confidence Measures CONCLUSION APPENDIX A: SENTENCES USED FOR KEYWORD SPOTTING 49 APPENDIX B: MINIMUM EDIT DISTANCE ALGORITHM...51 REFERENCES.52 v

7 vi LIST OF FIGURES Figure 2.1. The waveform and spectrogram of ev and ben eve...3 Figure 2.2. The waveform and spectrogram of okul and holding...4 Figure 2.3. Components of a typical recognition system...5 Figure 2.4. The spectrogram of /S/ sound and /e/ sound in word Sevket...6 Figure 2.5. Flowchart of deriving Mel Frequency Cepstrum Coefficients...8 Figure 2.6. A simple isolated speech unit recognizer that uses null-grammar...24 Figure 2.7. The expanded network using the best match triphones...25 Figure 2.8. The null-grammar network showing the underlying states...25 Figure 3.1. General structure of the proposed keyword spotter...31 Figure 3.2. ROC points for different alternatives for garbage model...32 Figure 3.3. ROC points for different number of keywords for keyword spotting...33 Figure 3.4. Network structure for the keyword spotter used as a post-processor for isolated word recognizer...34 Figure 3.5. ROC curves for monophone and garbage model based out-of-vocabulary word rejection...36 Figure 4.1. ROC curves before/after applying Minimum Edit Distance Revision...39 Figure 4.2. Forced alignment of the waveform for keyword iszbankasizkurucu...41 Figure 4.3. Forced alignment of the waveform for keyword milpa...42

8 vii Figure 4.4. ROC curves for phoneme duration based confidence measure...42 Figure 4.5. Likelihood profiles for ceylangiyim and the base garbage model proposed..43 Figure 4.6. ROC curves for different emphasis values for power value Figure 4.7. ROC curves for different power values with emphasis set to Figure 4.8. ROC curve for phoneme duration based confidence measure and confidence measure with likelihood ratio scoring included...46

9 viii LIST OF TABLES Table 3.1. Database used for keyword spotting...30 Table 3.2. Number of occurrences of the keywords used for keyword spotting tests...30 Table 3.3. Results for monophone model based out-of-vocabulary word rejection for isolated word recognition...35 Table 3.4. Results for general garbage model based out-of-vocabulary word rejection for isolated word recognition...35 Table 4.1. Average phoneme durations in Turkish...40 Table 4.2. Computation time required with/without phoneme duration evaluation...47

10 1 1. INTRODUCTION Communication between people and computers using more natural interfaces is an important issue in order to use computers in our daily lives. To interact with computers you always have to use your hands. The device may be a keyboard or a mouse or the dialing pad on your phone if you want to access information on a computer over a telephone line. A more natural interface for input is speech. Human-computer interaction via speech involves speech recognition [1, 2, 3] and speech synthesis [4]. Speech recognition is the conversion of speech signal into text and synthesis is the opposite. Speech recognition may range from understanding simple commands to getting all information in speech signal such as all words, the meaning and the emotional state of the speaker. After many years work speech recognition is at a level mature enough to be used in practical applications. This is due to the availability of algorithms developed and the increase in the computational power. Speech recognition may be speaker dependent or speaker independent. If the application is for home use, and the same person will use the same microphone at the same place, then the problem is simple and you don t need a robust algorithm. But if it is an application that will recognize speech over a public telephone network where speaker variability and the environment that speech passes through are different among different calls, you need a robust algorithm. If recognition of isolated words or phrases is the problem, then you will have less of a problem as far as the speakers only give the required input. If the speakers also use other words in addition to the keywords you require, you need to perform keyword spotting which means recognizing the keywords among other non-keyword filler words. If we go further, recognition from a large vocabulary where you have to recognize all of the words, it is called dictation, which is a harder task. We will be dealing with the keyword-spotting problem in this thesis.

11 2 For speech recognition, the digitized speech signal that is in the time domain must be transformed into another domain. Generally some part of the speech is taken and a feature vector is derived to represent that part. Next these feature vectors are used to guess the sequence of words that generated this speech signal. We need algorithms that account for the variability in the speech signal. The most common technique for acoustic modeling is called hidden Markov modeling (HMM). We have used this model in this thesis. In order to have an operating system independent notation we preferred not to use non-ansi characters in Turkish character set. We have used lower case letters for characters that are in the ANSI character set, and upper case letters for Turkish characters. We have used /S/ instead of /ş/, /U/ instead of /ü/, and so on. We use /Z/ for interword silence. So the word savaş alanı is represented as savaszalani. In this thesis we investigate some models for garbage models for keyword spotting and try to find some confidence measure for detection of out-of-vocabulary words in an isolated word recognizer. In chapter 2, we give the theory of each step in speech recognition process and give details of the techniques we have used. In chapter 3, we study the keyword-spotting algorithm we have proposed and conclude that using the monophone models of the words and a one-state 16-mixture general garbage models with different bonus values give the best performance. We evaluate the performance of monophone models as garbage model for isolated word recognition. In chapter 4, we evaluate some measures to obtain a good confidence measure and decide both likelihood and phoneme duration are important for obtaining a good confidence measure. Finally, in chapter 5 we give our conclusions from these experiments and suggest some directions for future study.

12 3 2. BACKGROUND 2.1. Speech Recognition Problem The speech signal is different if input is given with isolated words or the speech is continuous. If the speaker knows that a computer will try to recognize the speech, then he/she may pause between words. However in continuous speech some sounds will disappear and sometimes there will be no silence between words. It may be hard to say a word in a different context. An exaggerated example may be a tongue twister like SemsiZpaSaZpasajIndaZsesiZbUzUSesiceler. Even in normal cases there is great difference in the characteristics of the speech signal. Figure 2.1 shows the same /e/ sound in ev and ben eve. The waveforms are shown at the top of the figure. The spectrograms at the bottom show the energy at different frequencies versus time. The darkness shows the amplitude. The effect of the context can be seen on the characteristics of the /e/ sound. The effect of the context leads us to model each phoneme according to the neighboring phonemes. Figure 2.1. The waveform and spectrogram of ev (on the left) and ben eve (on the right).

13 4 Spontaneous speech may contain other fillers that are not words like ee or himm. It is another difficulty in continuous speech recognition. The task should be known while designing the algorithm. If we are to use a recognizer in continuous speech recognition, the training data should consist of continuous speech as well. The main difficulty of the speech recognition problem comes from the variability of the source of the signal. First, the characteristics of phonemes, the smallest sound units, are dependent on the context in which they appear. An example to phonetic variability is the acoustic differences of the phoneme /o/ in, okul and holding in Turkish. See Figure 2.2. The marked region corresponds to the /o/ sound. Figure 2.2. The waveform and spectrogram of okul (on the left) and holding (on the right) The environment also causes variability. The same speaker will say a word differently according to his physical and emotional state, speaking rate, or voice quality. The difference in the vocal tract size and shape of different people also causes variability. The problem is to find the meaningful information in the speech signal. The meaningful information is different for speech recognition and speaker recognition. The same information in the speech signal may be necessary for some application and redundant for

14 5 some other application. For speaker independent speech recognition we have to get rid of as much speaker related features as possible Speech Recognition Process Figure 2.3 shows the main components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10 milliseconds. These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters. Figure 2.3. Components of a typical speech recognition system Gathering Digital Speech Input Speech recognition is the process of converting digital speech signal into words. To capture speech signal we need a device that converts physical speech wave into digital signal. This may be a microphone that converts the speech into analog signal and a sound card that is an A/D converter that converts the analog signal to the digital signal. Another way of obtaining digital speech input is to use a telephone card that converts the analog

15 6 signal that comes from the telephone line into digital signal. There are also devices that can take the digital signal coming from E-1 or T-1 lines directly. Dialogic has JCT LS240 and JCT LS300 for T-1 lines and E-1 lines respectively. We have been using a JCT LS120 card, which is a speech-processing card for 12 analog lines. We have used 8 KHz sampling rate which is the sampling rate for telephone lines and converted the µ -law encoded signal into 16-bit linear encoded signal before processing Feature Extraction To get rid of redundancies in the speech signal mentioned earlier we have to represent the signal by only taking the perceptually most important speaker-independent features [5]. Passing the excitation signal generated by the larynx through the vocal tract produces the speech signal. We are interested in the properties of the speech generated by the overall shape of the vocal tract. To distinguish phonemes better (the voiced /unvoiced distinction), we examine if the vocal folds are vibrating but ignore the variations in the frequency of vibration. The spectrum of voiced sounds has several sharp peaks, which are called formant frequencies. The spectrum of unvoiced sounds looks like white noise spectrum. Figure 2.4 shows the spectrum of the unvoiced sound /S/ and the voiced sound /e/. Figure 2.4. The spectrum (found using 256 point FFT) of /S/ sound (on the left) and /e/ sound (on the right) in word Sevket Since our ears are insensitive to phase effects we use the power spectrum as a basis for speech recognition front-end. The power spectrum is represented on a log scale. When

16 7 the overall gain of the signal varies, the shape of the log power spectrum is the same but shifted up or down. The convolutional effects of the telephone lines are multiplied with the signal on the linear power spectrum. In log power spectrum the effect is additive. Since a voiced speech waveform corresponds to convolution of a quasi-periodic excitation signal and a time-varying filter (shape of the vocal tract), we can separate them in the log power spectrum. Assigning a lower limit to the log function solves the problem of low energy levels at some part of the spectrum. Before computing short-term power spectra, the waveform is processed by a simple pre-emphasis filter to give a 6 db/octave increase in gain. This makes the average speech spectrum roughly flat. We have to extract the effects caused by the shape of the vocal tract. One method is to predict the coefficients of the filter that corresponds to the shape of the vocal tract. The vocal tract is assumed to be a combination of lossless tubes with different radius. The number of parameters derived corresponds to the number of tubes assumed. The filter is assumed to be an all-pole linear filter. The parameters are called Linear Predictive Coding (LPC) parameters and the procedure is known as LPC analysis. There are different methods to calculate these coefficients [6]. To calculate the short-term spectra we take overlapping portions of the waveform. We take a frame of 25 milliseconds and multiply it with a window function to avoid artificial high frequencies. We use a Hamming window. Then we apply Fourier transform. We have to get rid of harmonic structure at the multiples of fundamental frequency, f 0, because it is the effect of the excitation signal. The smoothed spectrum without the effect of the excitation signal corresponds to the Fourier Transform of the LPC parameters. We use a different method and group components of the power spectrum and form frequency bands. Grouping is not linear; the human ear sensitivity is taken into account. The bands are linear up to 1 khz and logarithmic at higher frequencies. The frequency bands are broader at higher frequencies. The positions of the bands are set according to the mel frequency scale [7].

17 8 The relation between mel frequency scale and linear frequency scale is as follows: f Mel ( f ) = 2595log10 (1 + ) (2.1) 700 Figure 2.5. Flowchart for deriving Mel Frequency Cepstrum Coefficients To calculate the filterbank coefficients, the magnitude coefficients of the spectrum are accumulated after windowing with these triangular windows. Triangular filters are

18 9 spread over the whole frequency range from zero upto the Nyquist frequency. We have chosen 16 filter banks. Since the shape of the spectrum imposed by the vocal tract is smooth, energy levels in adjacent bands are correlated. We have to remove correlation since in further statistical analysis we assume that feature vector elements are uncorrelated and use a diagonal variance vector. Removing the correlation helps the number of parameters to be reduced without loss of useful information. The discrete cosine transform (a version of the Fourier transform using only cosine basis functions) converts the set of log energies to a set of cepstral coefficients, which are largely uncorrelated. The formula for Discrete Cosine Transform is: c i = 2 N πi m j cos ( j 0.5) N j= 1 N, i = 1,..., P ( 2.2) where { m } are log filter bank amplitudes. j and N is the number of filterbank channels which we set to 16. The required number of cepstral coefficients is P and we set it to 12. Figure 2.5 shows the steps in obtaining Mel Frequency Cepstrum Coefficients (MFCCs). Many systems use the rate of change of the short-term power spectrum as additional information. The simplest way to obtain this dynamic information is to take the difference between consecutive frames. But this is too sensitive to random interframe variations. So, linear trends are estimated over sequences of typically five or seven frames [8]. We use five frames; there will be a delay of 2 times step size in real-time operation. d G (2 * c + c c 2 * c ) ( 2.3) t = * t+ 2 t+ 1 t 1 t 2 where d is the difference evaluated at time t, and c, c,, t t+2 t+1 c t 1 t 2 c are the coefficients at time t+2, t+1, t-1 and t-2, respectively. G is a gain factor selected as Some systems use the acceleration features as well as linear rates of change. These second-order dynamic features need longer sequences of frames for reliable estimation [9].

19 10 Since cepstral coefficients are largely uncorrelated, probability estimates are easier in further analysis. We can simply calculate Euclidean distances from reference model vectors. Statistically based methods weigh coefficients by the inverse of their standard deviations computed around their overall means. Current representations concentrate on the spectrum envelope and ignore fundamental frequency; but we know that even in isolated-word recognition fundamental frequency contours carry important information. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of training data. Effects of context at the acoustic phonetic level are handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling. Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Another technique is to add different pronunciations to the network because after pruning common nodes at the network, it corresponds to different pronunciations of the same word Hidden Markov Model The most widely used recognition algorithm in the past fifteen years is Hidden Markov Models (HMM) [10, 11, 12]. Although there had been some attempts at using Neural Networks, those have not been very successful. The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution. Transition probabilities are assigned to the transitions among the states. In a particular state an outcome or observaiton can be generated, according to the associated probability distribution. The external observer can only see the the outcome, not the states. Therefore states are hidden to the outside. The following part is the teory of the HMMs taken from the tutorial [3]. The

20 11 advanced reader can skip this part. In order to define an HMM completely, following elements are needed: The number of states of the model, N. The number of observation symbols in the alphabet, M. If the observations are continuous then M is infinite. A set of state transition probabilities, A = a } { ij a = p{ q + 1 = j q i}, 1 i, j N ( 2.4) ij t t = where q t denotes the state index at time t and a ij corresponds to the transition probability from state i to state j. Transition probabilities should satisfy the normal stochastic constraints, a 0, 1 i, j N ( 2.5) ij and N a ij j= 1 = 1, 1 i N ( 2.6) A probability distribution in each of the states, B = { b ( k)}. j b ( k) = p{ o = ν q j}, 1 i N, 1 k M ( 2.7) j t k t = where ν k denotes the k th observation symbol in the alphabet, and o t the current observation vector. Following stochastic constraints must be satisfied. b j ( k) 0, 1 j N, 1 k M ( 2.8) and

21 12 M b j ( k) = 1, 1 j N ( 2.9) k= 1 If the observations are continuous then we will have to use a continuous probability density function, instead of a set of discrete probabilities. In this case we specify the parameters of the probability density function. Usually the probability density is approximated by a weighted sum of M Gaussian distributions, j t M b ( o ) = c Ν( µ, Σ, o ) ( 2.10) m= 1 jm jm jm t where, c jm = mixture weights for th j state s th m mixture µ jm = mean vectors Σ jm = covariance matrices c jm should satisfy the stochastic constrains, c 0, 1 j N, 1 m M ( 2.11) jm and M c jm m= 1 = 1, 1 j N ( 2.12) The initial state distribution, π = π }. { i where, π = p{ q 1 i}, 1 i N ( 2.13) i = Therefore we can use the compact notation

22 13 λ = ( A, B, π ) ( 2.14) can be used to denote an HMM with discrete probability distributions, while λ = ( A,, µ, Σ, π ) ( 2.15) c jm jm jm to denote one with continuous densities Assumptions in the Theory of HMMs For the sake of mathematical and computational tractability, following assumptions are made in the theory of HMMs The Markov Assumption. As given in the definition of HMMs, transition probabilities are defined as, a = p{ q + 1 = j q i}, 1 i, j N ( 2.16) ij t t = In other words it is assumed that the next state is dependent only upon the current state. This is called the Markov assumption and the resulting model becomes actually a first order HMM. However the next state may depend on past k states and it is possible to obtain such a model, called a k th order HMM. But a higher order HMM will have a higher complexity The Stationarity Assumption. Here it is assumed that state transition probabilities are independent of the actual time at which the transitions takes place. Mathematically, p{ qt + 1 = j qt = i} = p{ qt + 1 = j qt = i} ( 2.17) for any t 1 and t The Output Independence Assumption. This is the assumption that current output (observation) is statistically independent of the previous outputs(observations). We can formulate this assumption mathematically, by considering a sequence of observations,

23 14 O = o o,..., ( 2.18) 1, 2 o T Then according to the assumption for an HMM λ, T p{ O / q1, q1,..., qt, λ } = p( ot qt, λ) ( 2.19) t= 1 However unlike the other two, this assumption has a very limited validity. In some cases this assumption may not be fair enough and therefore becomes a severe weakness of the HMMs Three Basic Problems of HMMs Once we have an HMM, there are three problems of interest The Evaluation Problem. Given an HMM λ and a sequence of observations O = o o,...,, what is the probability that the observations are generated by the model, 1, 2 p { O λ}? o T The Decoding Problem. Given a model λ and a sequence of observations O = o o,...,, what is the most likely state sequence in the model that produced the 1, 2 observations? o T The Learning Problem. Given a model λ and a sequence of observations O = o o,...,, how should we adjust the model parameters ( A, B, π ) in order to 1, 2 o T maximize p { O λ}? Evaluation problem can be used for isolated (word) recognition. Decoding problem is related to the continuous recognition as well as to the segmentation. Learning problem must be solved, if we want to train an HMM for the subsequent use of recognition tasks The Evaluation Problem and the Forward Algorithm

24 15 We have a model λ = ( A, B, π ) and a sequence of observations O = o1, o2,..., ot, and p { O λ} must be found. If we can calculate this quantity using simple probabilistic arguments the number of operations are on the order of. This is very large even if the length of the sequence, T is small. The idea of keeping the multiplications that are common led to the idea of using an auxiliary variable, which is called the forward variable and denoted as α (i). t T N The forward variable is defined as the probability of the partial observation sequence O = o1, o2,..., ot, when it terminates at the state i. Mathematically, α i) = p{ o, o,..., o, q i } ( 2.20) t ( 1 2 t t = λ Then it is easy to see that following recursive relationship holds. N α + 1 ( j) = b ( o + 1) α ( i) a, 1 j N, 1 t T 1 ( 2.21) t j t i= 1 t ij where, α j) = π b ( ), 1 j N ( 2.22) 1( j j o1 Using this recursion we can calculate α (i), 1 j N T and then the required probability is given by, N { O λ } = α ( i ) ( 2.23) p i= 1 T N 2 T The complexity of this method, known as the forward algorithm is proportional to, which is linear with respect to T whereas the direct calculation had an exponential complexity. In a similar way the backword variable β (i) is defined as the probability of the partial observation sequence o + o,..., o, given that the current state is i. t 1, t+ 2 T t

25 16 Mathematically, βt ( i) = p{ ot+ 1, ot+ 2,..., ot qt = i, λ} ( 2.24) As in the case of α (i) there is a recursive relationship which can be used to calculate β (i) efficiently. t t N β ( i) = β ( j) a b ( o, 1 i N, 1 t T 1 ( 2.25) t j= 1 t+ 1 ij j t+ 1) where, β ( i) = 1, 1 i N ( 2.26) T Further we can see that, α ( i) β ( i) = p{ O, q i λ}, 1 i N, 1 t T ( 2.27) t t t = Therefore this gives another way to calculate p { O λ}, by using both forward and backward variables : N p{ O λ } = p{ O, q = i λ} = α ( i) β ( i) ( 2.28) t i= 1 i= 1 N t t The last equation is very useful, especially in deriving the formulas required for gradient based training.

26 The Decoding Problem and the Viterbi Algorithm In this case we want to find the most likely state sequence for a given sequence of observations, O = o1, o2,..., ot and a model, λ = ( A, B, π ). The solution to this problem depends upon the way most likely state sequence is defined. One approach is to find the most likely state qt at t=t and to concatenate all such ' q t 's. But some times this method does not give a physically meaningful state sequence. Therefore we would need another method which has no such problems. In this method, commonly known as Viterbi algorithm [13], the whole state sequence with the maximum likelihood is found. In order to facilitate the computation we define an auxiliary variable, δ t ( i) = max p{ q1, q2,..., qt 1, qt = i, o1, o2,... ot 1 q1q 2... q t 1 λ} ( 2.29) which gives the highest probability that partial observation sequence and state sequence up to t=t can have, when the current state is i. It is easy to observe that the following recursive relationship holds. δ t+ 1( j ) = b j ( ot+ 1) maxδ t ( i) aij, 1 i N, 1 t T 1 ( 2.30) 1 i N where, δ j) = π b ( ), 1 i N ( 2.31) 1( j j o1 So the procedure to find the most likely state sequence starts from calculation of δ ( j), 1 i N using recursion in Eqn (2.30) while always keeping a pointer to the T winning state in the maximum finding operation. Finally the state * j, is found where j * = arg max δ ( j) ( 2.32) 1 j N T and starting from this state, the sequence of states is back-tracked as the pointer in each

27 18 state indicates. This gives the required set of states. This whole algorithm can be interpreted as a search in a graph whose nodes are formed by the states of the HMM in each of the time instant t, 1 t T The Learning Problem Generally, the learning problem is how to adjust the HMM parameters, so that the given set of observations (called the training set) is represented by the model in the best way for the intended application. Thus it would be clear that the quantity we wish to optimize during the learning process can be different from application to application. In other words there may be several optimization criteria for learning. The criteria we will be using is the Maximum Likelihood (ML) criteria Maximum Likelihood (ML) Criterion. In ML we try to maximize the probability of a given sequence of observations W O, belonging to a given class w, given the HMM λw of the class w, with respect to the parameters of the model λ w. This probability is the total likelihood of the observations and can be expressed mathematically as L tot W = p{ O λw} ( 2.33) However since we consider only one class w at a time we can drop the subscript and superscript 'w's. Then the ML criterion can be given as, L tot = p{ O λ} ( 2.34) However there is no known way to analytically solve for the model λ = ( A, B, π ), which maximizes the quantity L tot. But we can choose model parameters such that it is locally maximized, using an iterative procedure, like the Baum-Welch method [12] Baum-Welch Algorithm. This method can be derived using simple occurrence counting arguments or using calculus to maximize the auxiliary quantity

28 19 [ p{ O, q, ] Q ( λ, λ ) = p{ q O, λ}log λ } q ( 2.35) over λ [3]. A special feature of the algorithm is the guaranteed convergence. To describe the Baum-Welch algorithm, ( also known as Forward-Backward algorithm), we need to define two more auxiliary variables, in addition to the forward and backward variables defined in a previous section. These variables can however be expressed in terms of the forward and backward variables. First one of those variables is defined as the probability of being in state i at t=t and in state j at t=t+1. Formally, ξ ( i, j) p{ q = i, q 1 j O, λ} t = t t+ = ( 2.36) This is the same as, { qt = i, qt j O, λ} p{ O λ} p + 1 ξ ( i, j) = = ( 2.37) t Using forward and backward variables this can be expressed as, α t ( i) aij βt+ 1( j) b j ( ot+ 1) ξt ( i, j) = N N α ( i) a β ( j) b ( o i= 1j= 1 t ij t+ 1 j t+ 1 ) ( 2.38) The second variable is the a posteriori probability, γ ( i) = p{ q i O, λ} ( 2.39) t t = that is the probability of being in state i at t=t, given the observation sequence and the model. In forward and backward variables this can be expressed by,

29 20 α t ( i) βt ( i) γ t ( i) = N ( 2.40) α t ( i) βt ( i) i= 1 It can be seen that the relationship between γ (i) and ξ ( i, j) is given by, t t N γ ( i) = ξ ( i, j), 1 i N, 1 t M ( 2.41) t t j= 1 Now it is possible to describe the Baum-Welch learning process, where parameters of the HMM is updated in such a way to maximize the quantity, p { O λ}. Assuming a starting model λ = ( A, B, π ), we calculate the 'α 's and ' β 's using the recursions (2.21) and (2.25), and then 'ξ 's and 'γ 's using (2.38) and (2.41). Next step is to update the HMM parameters according to eqns (2.42) to (2.43), known as re-estimation formulas. π i = γ 1 ( i), 1 i N ( 2.42) a ij T ξt ( i, j) t= 1 = T 1 γ ( i) t t= 1, 1 i N, 1 j N ( 2.43) T γ t ( j) t= 1 b j ( k) = T γ ( j) t= 1 t, 1 j N, 1 k M ( 2.44)

30 Types of Hidden Markov Models HMMs can be classified according to the nature of the elements of the B matrix, which are distribution functions. In discrete HMMs, distributions are defined on finite spaces. Observations are vectors of symbols in a finite alphabet of N different elements. For each one of the Q vector components, a discrete density { w ( k) k = 1,... N} is defined, and the distribution is obtained by multiplying the probabilities of each component. Another possibility is to define distributions as probability densities on continuous observation spaces. In this case, functional form of the distributions has to have certain characteristics, in order to have a manageable number of statistical parameters to estimate. The density functions are usually Gaussian or Laplacian. The statistics can be characterized by the mean vector and the covariance matrix. HMMs with these kinds of distributions are usually referred to as continuous HMMs. A large number of base densities have to be used in every mixture. Since most of the time the training data is not enough, different models share the same distributions. Different models are expressed in terms of base distribution functions using different weights. This type of HMMs is called semi continuous HMMs [11]. Base densities are assumed to be statistically independent; so the distributions associated with model transitions are products of the component density functions. Parameters of statistical models are estimated using iterative learning algorithms [14]. The likelihood of a set of training data increases at each step Use of HMMs in Speech Recognition In a statistical framework, a set of elementary probabilistic models of basic linguistic units (e.g., phonemes) is used to build word representations. A sequence of acoustic parameters, extracted from a speech signal, is seen as an output of a HMM which is formed by concatenating elementary processes. The underlying state sequence corresponds to the meaningful combinations of the phonemes. The transitions between the states of the phoneme correspond to the variability in duration. The stochastic observable

31 22 outputs correspond to the spectral variability. It is not practical training word networks for each word. Words are usually represented as networks of phonemes. Each path in a word network represents a pronunciation of the word Subword Unit Selection The same phoneme can have different acoustic distributions of observations if pronounced in different contexts. Allophone models of a phoneme are models of that phoneme in different contexts. The decision as to how many allophones should be considered for a given phoneme may depend on many factors, e.g., the availability of enough training data to determine the model parameters. A conceptually interesting approach is the use of polyphones [15]. In principle, an allophone should be considered for every different word in which a phoneme appears. If the vocabulary is large, it is unlikely that there are enough data to train all these allophone models, so models for allophones of phonemes are considered at a different level of detail (word, syllable, triphone, diphone, context independent phoneme). We have been using 790 triphones during the tests. These are selected to cover most of the triphones in Turkish [16]. We have been using capital letters for Turkish characters that are not in the ANSI character set and /Z/ for silence. For example we have used /C/ instead of /ç/. Another approach consists of choosing allophones by clustering possible contexts. This choice can be made automatically with Classification and Regression Trees (CART). A CART is a binary tree having a phoneme at the root and, associated with each node n i, a question Q i about the context. Questions Q i are of the type, Is the previous phoneme a nasal consonant? For each possible answer (YES or NO) there is a link to another node with which other questions are associated. There are algorithms for growing and pruning CARTs based on automatically assigning questions to a node from a manually determined pool of questions. The leaves of the tree may be simply labeled by an allophone symbol [17, 18].

32 23 We use a score to find the best match for a triphone that is not in the HMM list. According to the spectral similarity a score is assigned to each couple of phonemes. The similarity score of /m/ and /n/ is The score is calculated automatically using the spectral distance so it is easy to find for a new language. Total similarity score is calculated by summing the weighted scores. The center phoneme has a weight of 1 and the weight decreases exponentially. ctx S ( X, H ) = s( x, h ) W ( 2.45) i= ctx i i i where S ( X, H ) is the similarity score between unseen triphone X and triphone H in the list s x i, h ) is the similarity score between phonemes at position I ( i ctx is the context level W is a weighting factor, which we choose 0.1 So if a triphone is in the HMM list, there will be exact mach with a score of We have expanded some words in terms of the best match triphones below: penguen Z-p+e t-e+n e-n+i y-g+u g-u+l l-e+n e-n+z milpa Z-m+e m-i+l i-l+g Z-p+a s-a+z bossa Z-b+o b-o+r k-s+t k-s+a s-a+z Each allophone model is an HMM made of states, transitions and probability distributions. In order to improve the estimation of the statistical parameters of these models, some distributions can be the same or tied. For example, the distributions for the central portion of the allophones of a given phoneme can be tied reflecting the fact that they represent the stable (context-independent) physical realization of the central part of the phoneme, uttered with a stationary configuration of the vocal tract.

33 24 Another approach consists of having clusters of distributions characterized by the same set of Gaussian probability density functions. Allophone distributions are built by considering mixtures with the same components but with different weights [14] Word Networks Isolated recognition in general means recognition of speech based on any kind of isolated speech unit, which can be a word or a sub word or even a concatenation of words. However only isolated word recognition has direct practical applications. In a simple isolated speech unit recognition task, where the vocabulary contains N speech units, we can use the system depicted in Figure 2.6. Figure 2.6. A simple isolated speech unit recognizer that uses null-grammar We have a finite network, since only one of the words can be spoken. The speech contains an initial silence and a final silence. This simple network used for isolated word recognition should be expanded because we do not have models for each different word. We should generate networks using the HMMs for the best match triphones. We have two silence models; Z-Z+Z for the interword silence and X-X+X for the silence at the beginning and the end of the word. The difference of the X-X+X model is the self transition probaility. Since the self transition probability is high we can compansate for loose end-pointing. The expanded network will be as in Figure 2.7.

34 25 Figure 2.7. The expanded network using the best match triphones Figure 2.8. The null-grammar network showing the underlying states For the silence models we have a one-state HMM, and for triphones we have threestate HMMs. The reason for choosing three-state models is that we can model the transition between the adjacent phonemes with the first and the third states. The second state stands for the steady state of the phoneme. Actually we have two more states at the beginning and the end of the models that produce no observation. They are not shown in the figure for the sake of simplicity. Figure 2.8 shows the actual HMM network that contains 3-state models for triphones and 1-state model for the silence.

35 Training of the HMMs We will have semicontinuous HMMs which means every output probability distribution of each state is a linear combination of Gaussian density functions. 512 mixtures have been proposed for optimal performance [16]. Using tied-mixtures makes recognition easier because only the probabilities of 512 Gaussian density functions are calculated for each feature vector. The probability of each triphone is a combination of these pdf s. In training we need to estimate the means and variances of Gaussian densities, mixture coefficients and state transition matrix for each triphone. The core process in training a set of subword models (phonemes) involves the embedded training. We use a program train for this purpose. Continuously spoken utterances is parametrized as the training data. In embedded training re-estimates of the complete set of subword HMMs are done simultaneously. For each input utterance, we need a transcription of the speech but labeling is not required. Labeling means specifying the boundaries of phonemes in that uttarance. A single composite HMM is built for each input uttarance using the transcription of the input speech and the initial HMMs. This composite HMM collects statistics for the re-estimation of the parameters. When all of the training utterances have been processed, the total set of accumulated statistics are used to re-estimate the parameters of all of the phone HMMs. To find the initial parameters for the HMMs before embedded re-estimation we have two choices. One method is to assign a global mean and variance to all Gaussian distributions in all HMMs. Another method is to begin with a small set of hand-labelled training data to intialize the mean and variance of each Gaussian density. Then Baum-Welch method is used to update mean and/or variance of each Gaussian density.

36 Recognition To find the most probable sequence we need to have a search space and that search space is represented with a word network. In the case of isolated word recognition it is a simple network of N words between start and end nodes. In the case of keyword spotting we need different kind of networks which we will discuss in the next chapter. After building the network using the nearest triphones we have to find the optimal path for a given observation, speech unit sequence. Then it is possible to trace for the corresponding speech unit sequence, via the state sequence. In order to calculate the optimal state sequence ( q ) we can use the Viterbi Algorithm directly, level building method which is a variant of the Viterbi Algorithm. Since the Viterbi based recognition is suboptimal, unless each speech unit correspons to a HMM state, some attempts have been made to develop efficient methods for calculating the sentence likelihoods. The N-best algorithm is one of these. * Viterbi Based Recognition. The Viterbi score, γ (i) can be computed for all the states in the language model Λ at t=t and then can advance to the time instant t=t+1, in an inductive manner, as formulated in [13]. This procedure is known as time synchronous Viterbi search because it completely processes at time t before going into the time t+1. Finally a backtracking pass gives the required state sequences. t Viterbi search can be very expensive if the number of states is large. When the number of states is large, at every time instant, a large portion of states have an accumulated likelihood which is much less than the highest one, so it is expected that a path passing through one of these states would not become the best path at the end of the utterance. This consideration leads to a complexity reduction technique called beam search [19]. Beam search neglects states whose accumulated score is lower than the best one minus a given threshold. Pruning less likely paths avoids extra computation. However if the pruning threshold is chosen poorly, the best path can be lost. In practice, good tuning of the beam threshold results in a gain in speed by an order of magnitude, while introducing a negligible amount of search errors.

37 N-best Search. N-best search algorithm is very similar to the time synchronous Viterbi search. Since the purpose of the N-best method is to find the optimum speech unit sequence instead of the optimum state sequence, a summing operation should be done instead of the maximum finding operation. However if we completely drop the maximum finding operation it will become the forward algorithm, and we go back again to the start. Therefore a pruning is performed at every state, (in addition to the pruning of the beam) keeping only the first N paths with the highest scores. Therefore even this algorithm does not give the theoretically optimum sentence. At the end, the algorithm gives N most likely sentences, and for a simple task without post processors, N=1 is enough. We used token passing paradigm [20] Keyword Spotting Problem In the keyword spotting problem a continuously spoken utterance is tested for the existance of a keyword. The speech signal will contain any combination of silences, keywords and non-keywords. The words that are not keywords are called garbage or outof-vocabulary words. We have discussed how to model a word HMMs for phonemes. We need to model the out-of-vocabulary words in some form. Since we use tied-mixtures for modeling the output produced at each state of the network, we will be using Gaussian mixtures for garbage models. We may choose to use less mixtures to model garbage words. These Gaussian mixtures will have greater variances which means they are more general models.

38 29 3. PROPOSED KEYWORD SPOTTING ALGORITHM 3.1. Introduction The best we can do in detecting keywords in speech signal is to recognize all of the words using Large Vocabulary Continuous Speech Recognizer (LVCSR). However the cost is very much and when context of the keywords is unknown it is impossible to use LVCSR system. Instead it is common to model all of the out-of-vocabulary words as garbage models. The garbage model we have used is a 16-mixture model. The reason for using little number of mixtures is to model the general properties of the speech signal. Little number of mixtures means greater variance of the Gaussian distribution functions. We have used the notation J-J+J for garbage model. Another idea we have tried can be described as follows: If we create a network with triphone models and monophone models of the same word, the triphone model of the word should get the best score if the keyword exists. If the keyword does not exist, the monophone or the garbage model will get the best score. The reason for this hypothesis is that the monophone models represent the context independent phonemes and the triphone models represent the context dependent phonemes. Monophone models have greater variances to model all phonemes more generally. We have used 32-mixture monophone models. In order to favor some of the models we have added a bonus value to all transition probabilities of the HMMs on that path. Although the probability of an event cannot be greater than zero and sum of probability of exclusive events should be equal to 1, adding some value to the transition probabilities works fine; because it will increase the probability of passing through some path. This corresponds to one-pass implementation of Likelihood Ratio Scoring.

39 Experiment Data The details of the database that we have used for keyword spotting is described in Table 3.1 and Table 3.2. We asked 12 speakers to read 20 different sentences from a sheet over the telephone. Since some speakers made mistakes during recording, 44 of the sentences were removed from the database. Therefore total number of sentences that we used in our simulations was 196. The sentences are given in Appendix A. In the simulations we tested 1, 3, 5, 10 keyword sets. The keywords in each set and their number of occurances are listed in Table 3.2. Table 3.1.Database used for keyword spotting Number of sentences 196 Number of words 3,391 Number of speakers 12 Sound Quality 8KHz µ -law encoded telephony signal Total record time 15 minutes Table 3.2. Number of occurrences of the keywords used for keyword spotting tests Medya Holding 12 Ankara + Yargıtay + Fenerbahçe 36 Pejo + Sabah + Ankara + Medya Holding + Yargıtay 86 Türkiye + Pejo + Pirelli + Devalüasyon + Fenerbahçe + rüşvet Sabah + Yargıtay + Medya Holding + Rahmi Koç 3.3. Performance of a System For a keyword spotting system there are two kinds of errors, false alarm and miss. If there is no keyword and the keyword spotter detects a keyword then it is called a false alarm. If there is a keyword present but the keyword spotter cannot detect it then it is called a miss. The same recognizer cannot reduce false alarms without increasing miss rate.

40 31 Receiver Operating Characteristic (ROC) curves show the probability of detection versus probability of false alarm [20]. Probability of detection is given by the dividing number of detection to number of keywords. Probability of false alarms is given by dividing total number of false alarms to total number of words that are not keywords. However since we do not know much about the out-of-vocabulary words, false alarms per hour is much more reasonable than probability of false alarms System Structure Figure 3.1 shows the general network structure that can be used to detect the keyword sabah. It is an infinite-state network. As far as the input data is available, the same path may be used multiple times. Since the silence model Z-Z+Z is always used, this grammar attempts to model each word in the speech signal either the keyword or garbage. Figure 3.1. General structure of the proposed keyword spotter We have tested the performance of the keyword spotter using a. Triphone model of the word and one garbage model b. Triphone model and monophone model of the word c. Triphone model and monophone model of the word and garbage model In order to obtain a Receiver Operating Characteristic (ROC) we tested the probability of detection and number of false alarms per hour. Each time the test is

41 32 performed we have used different combinations of bonus values for monophone model of the word and garbage model. We have a program that counts the false alarm/miss rate automatically. We performed the test using only the monophone models of the keywords as the garbage model or 1-state 16-mixture general garbage model. Figure 3.2 shows the ROC points for test we have performed. The operating point of the ROC is determined by bonus values of monophone models and garbage model. However these bonus values are related. As a result, the points obtained using different combinations do not form a curve. We preferred to call these figures as ROC points. Figure 3.2. ROC points for different alternatives for garbage model As it can be seen using the monophone models of the keywords as garbage model gives approximately the same performance as using 1-state general garbage model. However, when we use combination of these two alternatives with different bonus values, we get a better performance. Actually monophones are good candidates for garbage modeling. The recognizer cannot model very short utterances using the monophone model of the word. So a better alternative may be using an ergodic model for the monophone

42 33 model of the keyword. In an ergodic model, transition from each phoneme to another is possible, unlike the left-to-right models we have been using. The cost of using an ergodic model is too much. So monophone model of the word coupled with a general 1-state garbage model is a better alternative. We have tested the system for different number of keywords. As the number of the keyword increases, the performance of the system has decreased as expected. Figure 3.3 shows the ROC points for 1, 3, 5 and 10 keywords. All of the operating points are shown together at this figure regardless of the structure used. Of course, using monophone models of the keywords as the garbage model has given better results for multiple keywords, too. Figure 3.3. ROC points for different number of keywords for keyword spotting simulations Monophone models

43 Performance of Monophone Models for Isolated Word Recognition Monophone models did not result in an increase at the performance when used in keyword-spotting task. The reason may be the infinite-state network used for keyword spotting. We have used the same idea in isolated word recognition. We have decided to use our keyword spotter as a post-processor for isolated word recognizer. The output of the isolated word recognition is used as the keyword for the keyword spotter. The keyword spotter creates a temporary network using the triphone model of the word and the monophone model of the word. Figure 3.4. Network structure for the keyword spotter used as a post-processor for isolated word recognizer The network is a finite-state network as shown in Figure 3.4. If triphone model of the output gets more likelihood from the post-processor that the word is from the vocabulary,

44 35 otherwise we assume that it is garbage. We have performed the experiment using different bonus values. Table 3.3 and Table 3.4 show the results of the experiments. Figure 3.5 shows the ROC curve for monophone models and the garbage model. Table 3.3. Results for monophone model based out-of-vocabulary word rejection for isolated word recognition Bonus Value Probability Of Detection Probability Of False Alarm Table 3.4. Results for general garbage model based out-of-vocabulary word rejection for isolated word recognition Bonus Value Probability Of Detection Probability Of False Alarm

45 36 Figure 3.5. ROC curves for monophone and garbage model based out-of-vocabulary word rejection

46 37 4. CONFIDENCE MEASURES FOR ISOLATED WORD RECOGNITION 4.1. Introduction Out-of-vocabulary word rejection is an important issue in isolated word recognition as well as in keyword spotting. If somebody dials an automated information retrieval system and asks about a subject that is irrelevant to that service, the system will interpret the request as one of the known inputs. Of course the user will not notice that he uttered an out-of-vocabulary word and will complain about the service. If you tell about the weather to a person who wants information about a hospital, it will be nonsense. The user may accidentally give an input especially to a bargein-enabled system. A bargein-enabled system stops playing the prompt as soon as it detects voice activity of sufficient duration. Out-of-vocabulary word rejection may be as important as recognizing the true item from the vocabulary for a pleasing service. A recognition system may produce a likelihood score instead of rejecting or accepting a hypothesis. Confidence value is a score that shows how much confidant the recognizer is about the recognition result. However confidence value is not suitable for an average system designer. It should be converted into a more meaningful value like percent confidence. The percent confidence is a number between zero and 100 and reflects how likely the result is. If you group the results that have percent confidence of 75, the recognition rate will be 75%. A recognizer cannot assign 100% confidence to a result. The confidence value is useful especially when you have a chance to ask the speaker to tell the word again. If the confidence value is too low then the system may tell the user that it could not understand. If the confidence value is high then the system may directly advance to the next step in the scenario. If the confidence value is medium, the system may need to confirm the recognition result. A higher percent confidence may be required for critical applications. Using different threshold values for confidence values correspond to changing the operating point on a ROC curve.

47 Experiment Data We have collected 1,176 records of isolated words/phrases from 4 speakers. Each speaker spoke the names of the stocks exchanged at İMKB as isolated phrases once. Some of the records are very bad, some have end-pointing errors but we have kept them for an objective evaluation. We set aside half of the words and used the remaining ones in the recognition system Minimum Edit Distance The robustness of the recognizer requires that it operate at different situations without great degradation in the performance. Due to the variability of speech there is no strict decision about speech units. As a result it is natural that the recognizer will match a very close word if the spoken word is not in the vocabulary. For example our HMM based recognizer can make the distinction between words kardemirzbe and kardemirzde if both of them are in the vocabulary. If one of them is removed from the vocabulary and the incoming speech signal corresponds to the removed word, the recognizer will match the remaining word. This feature can be useful if a speaker says sabancolding instead of sabancizholding. So we have decided not to penalize the recognizer if the recognized word and the actual word are very close to each other. In order to decide the similarity of two words we have used the minimum edit distance criteria. The minimum edit distance between two strings is defined as the minimum number of editing operations (insertion, deletion, substitution) needed to transform one string into another [22]. The minimum edit distance algorithm is an example of the class of dynamic programming algorithms and is given in Appendix B. The cost of insertion, deletion and substitution may be assigned according to the specific task. For speech recognition assigning a cost of 1 for insertion and deletion and 2 for substitution is reasonable [23]. If the recognizer does not reject an out-of-vocabulary, which is very close to a word in the vocabulary we counted it as detection and called it Minimum Edit Distance Revision (MEDR). If the minimum edit distance is less than or equal to 3, we assumed them to be

48 39 correct. The results got better as expected. In Figure 4.1 we show the results with and without Minimum Edit Distance Revision. Since the number of detection is increased the ROC is closer to the upper left corner. The remaining figures are according to the MEDR applied results unless otherwise stated. Figure 4.1. ROC curves before/after applying Minimum Edit Distance Revision (MEDR) 4.4. Phoneme Durations The HMMs we have used assign transition probabilities to each state to model temporal variability. It requires that each state should be used at least during 2 adjacent frames in order to be the part of the best path. However it does not impose any maximum duration constraint on the states or the phonemes. We have developed a visual tool that displays the likelihood of a given word using the input speech signal and the state sequence. During the N-best search we may choose to keep the phoneme information (phoneme index, end time and probability at that instant) list for each token accumulated. If we keep phoneme information for all paths (some of

49 40 which will be pruned after a few frames), it will require extra memory and CPU time. We have preferred finding the phoneme boundaries after the recognition has been ended and the best path has been found. After finding the best path for the input speech signal we create a temporary network with only the recognized word and tell the recognizer to keep the phoneme information. As a result the recognizer does forced alignment for the specified word. We compare the phoneme durations found with the average phoneme durations for each phoneme. Average phoneme durations are shown in Table 4.1. If the difference is not small enough then we assign a penalty score for the phoneme. The formula is empirical and can be summarized as follows: difference = phoneme duration-average duration ratio = difference / average duration ratio > 4.0 penalty score = difference * ratio > 2.5 ratio < 4.0 penalty score = difference * 5000 ratio > 1.5 ratio < 2.5 penalty score = difference * 2000 ratio < 0.2 penalty score = difference * 2000 ratio > 0.2 ratio < 1.5 penalty score = 0 Penalty scores for each phoneme are added. These penalty scores give us a confidence measure. Table 4.1. Average phoneme durations in Turkish Phoneme Average Duration(sec) Phoneme Average Duration(sec) /C/ /i/ /G/ /j/ /I/ /k/ /O/ /l/ /S/ /m/ /U/ /n/ /Z/ /o/ /a/ /p/ /b/ /r/ /c/ /s/ /d/ /t/ /e/ /u/ /f/ /v/ /g/ /y/ /h/ /z/

50 41 Sometimes low penalty scores can be found for meaningless results. This can be seen when a long word is matched for a shorter utterance. The recognizer completes the word if it cannot find enough data to proceed till the end of the path. The purpose is to compensate for the end-point errors. So at this point we have added a simple additional confidence measure. This measure is total number of forwarded phonemes without input data, denoted as N. The situation can be seen in Figure 4.2. The recognizer claims that the speech signal corresponds to the word iszbankasizkurucu. But it is unlikely to be a waveform of such a long word. Number of forwarded phonemes is 9 in this case. Figure 4.2. Forced alignment of the waveform for keyword iszbankasizkurucu Figure 4.3 shows a typical case where the recognition result deserves high penalty. Although the number of forwarded phonemes is 1, the phoneme durations do not seem reasonable. The recognizer has assigned a penalty score of 1571 for this recognition. In order to model phoneme durations we should consider the effect of the context. The same phoneme may have different statistics at different contexts. Using the mean and variance of the phoneme is one choice. The deviation from the average duration may be used as a measure. However using the median of the durations may be a better statistic. Instead of using the standard deviation, we may choose to use percentile rank. The statistics should be derived from the durations found using forced alignment.

51 42 Figure 4.3. Forced alignment of the waveform for keyword milpa Figure 4.4 shows ROC curves for various values for N, the threshold number of forwarded phonemes. Figure 4.4. ROC curves for phoneme duration based confidence measure

52 43 If the recognizer is to be operated at low false alarm rates, then you should select a small threshold for N. However increasing the threshold beyond 5 does not increase the performance too much. The optimum operating point for many applications is N = Garbage Model Using Same 512 Mixtures During the N-best search likelihood of triphones are calculated using the weighted sum of each Gaussian mixture probability. Since we already have the probabilities of these mixtures, we can find a garbage probability at each frame. This garbage modeling technique has been used for speaker recognition [10]. The garbage likelihood and the likelihood of ceylangiyim are shown in Figure 4.5. The shaded area between two probabilities may give an indication of a confidence measure. The question is how to convert this area into a confidence measure. Figure 4.5. Likelihood profiles for ceylangiyim and the base garbage model proposed We have tried a few things. First of all we should deal with phonemes that are meaningful for us. We do not use triphone X-X+X or Z-Z+Z for our calculations. These correspond to the long silence and short silence (inter word silence), respectively. If the forced alignment is unsuccessful, we will not attempt to assign a confidence measure;

53 44 instead we will directly assign a low number. Since wrong alignment leads us to wrong decisions about likelihood ratio, we penalize too much deviation from the average phoneme duration again. The penalty method we have chosen is to decrease the probability of that phoneme artificially. Since the area corresponds to the log difference of two likelihood profiles shown, it is reasonable to take the difference and add them up. This will give the likelihood of the all path. However we have tried a few things to see that this is really the best measure. We have taken the difference for each frame. Then we have weighted this difference according to the characteristics of the phoneme. Then we have taken some power of this value and summed them up. Finally we normalized it according to the accumulated weights. Mathematically, S = totalframes i= 1 γ β i ( p( w ) p( g ) totalframes i= 1 i γ i i β ( 4.1) where γ i is the weighting coefficient and equal to γ for phonemes /a/, /e/, /i/, /I/, /o/, /O/, /u/, /U/, /S/, /s/, /z/ and 1 for all other phonemes. We expect that the recognizer can discriminate these phonemes better than the remaining, so have given γ values greater than or equal to 1. Unfortunately we did not obtain any improvement. Figure 4.6 illustrates this result. The ROC corresponds to the results obtained selecting β =1 and trying values 1, 2 and 3 for γ. All curves are very close.

54 45 Figure 4.6. ROC curves for different emphasis values for power value 1 Figure 4.7. ROC curves for different power values with emphasis set to 1

55 46 After this experiment we attempted to change the value of β and keep γ =1, since changing the value of γ did not matter. The best value was again β =1 case, which means using the difference according to its real meaning (difference of log likelihood means ratio). Figure 4.7 illustrates this case Comparison of Confidence Measures Finally we compare the confidence measure obtained using the phoneme duration and the one that uses 512-mixture garbage model with phoneme duration. Since the last method we have examined also added the likelihood ratio score in addition to duration constraint, it has given a better result. Figure 4.8 clearly shows the increase in the performance. Figure 4.8. ROC curve for phoneme duration based confidence measure and confidence measure with likelihood ratio scoring included

56 47 When we perform the phoneme alignment after the recognition step, we spend extra time. The average recognition time required is as shown in Table 4.2. Table 4.2. Computation time required with/without phoneme duration evaluation Recognition 1.35 seconds Recognition + Monophone based rejection 1.62 seconds Recognition + Phoneme duration evaluation 1.62 seconds

57 48 5. CONCLUSION In this study, we have examined the performance of monophone models of the keywords as garbage models. The performance over 1-state general garbage model is very little in keyword spotting tasks. However there is a significant increase at the performance of the monophone models of the keywords over garbage models when the finite-state keyword spotter is used as a post-processor to an isolated word recognizer. Several confidence measures were used for isolated word recognition. The best performance was achieved when phoneme duration information and average phoneme likelihood were used. We have used the same pool of Gaussian functions for the garbage likelihood. Duration based confidence measure can be examined in more detail. Since the duration of a phoneme depends on the context it is used, the context information should be used. The minimum and maximum durations for each phoneme can be used instead of the average durations. The durations found using forced alignment is more appropriate for the statistics of the phonemes. The monophone models for each phoneme can be used as garbage model for likelihood ratio scoring.

58 49 APPENDIX A: SENTENCES USED FOR KEYWORD SPOTTING 1. Sabah grubu içinde yer alan sabah pazarlamanın sabah otomobilden ve medya holdingten çeşitli alacaklarının bulunduğu tespit edildi 2. Aradan zaman geçiyor. ne olup ne bittiğini merak eden Ankara polonya büyükelçiliğinden ses çıkmadığını görünce varşovaya soruyor 3. Taşların artık yerine oturmaya başladığını söyleyen Rahmi Koç gelecek dönemde çok daha itibarlı bir Türkiyenin ortaya çıkacağını söyledi 4. Fenerbahçe teknik direktörü rüştünün tıbba göre kupa finalinde oynaması imkansız değil dedi ve ekledi: maça daha iki gün var, son dakikaya kadar bekleyeceğim 5. Kahvenin egzersiz sırasında adelelerin tükettiği gılikojen miktarını azaltararak daha uzun süre egzersiz yapmaya yardımcı olduğu belirtiliyor 6. Konuyla ilgili araştırma haberini internetten yayınlayan riyıl eyç gurubu spor yapan insanların gerçek yaşlarından dokuz yaş daha genç kalabileceklerine değiniyor 7. Türkiye ihtiyacı olan dış katkıyı gerektiği oranda bulmak için zorlanacak 8. ikinci el otomobil piyasasında şubat ayında yaşanan devalüasyonun ardından satışlar durma noktasına geldi 9. internet üzerinden düzenlenen pejo dizayn yarışmasının kazanan tasarımcısı cenevre otomobil fuarında açıklandı 10. Oldukça cesur ve yenilikçi bir şekilde dizayn edilen model şık kelimesini hak ediyor. Dizayn halihazırda bütün bu özellikleriyle pejonun internet sitesini ziyaret eden otomobil meraklılarının kalbini kazanmayı başardı 11. Yarışmaya en fazla ilginin geldiği ülkeler arasında fıransa, rusya, italya ve kolombiyanın yanısıra Türkiye de bulunuyor 12. Barselona pirelli en ileri teknoloji ile ürettiği yeni yüksek performans lastikleri pe altı ve pe yediyi piyasaya sunuyor 13. Beyaz enerji operasyonu kapsamında gündeme gelen rüşvet çarkına adı karışanları zor günler bekliyor 14. Dekoder sahiplerinin muhatabı teleondan önce futbol federasyonu olmalıdır dedi. çünkü anlaşmanın koşullarını koyan federasyondur 15. Silah avadanlık ve mühimmat hakkında personele gerekli açıklamaların

59 50 yapılmasından sonra eğitici tatbiki olarak günlük haftalık bakımlar ile atış menzili bakımlarının uygulanmasına geçer 16. Savcı dosya münderecatına göre suçu işlediği sabit olan abuzettin beyin tecekanın bilmem kaçıncı maddesine göre cezalandırılması talep olunur dedi 17. İstanbul memorial hastanesince düzenlenen insan gen haritası projesi embriyoda veya erken gebelik döneminde genetik tanı başlıklı toplantıya katılan halifa türkiyede ilk defa embriyolarda genetik inceleme tekniği başlattıklarını belirtti 18. Yargıtay bozma kararını af yasasının çıkmasından üç gün önce vererek şahinin davanın ertelenmesi olanağından faydalanmasını da önledi 19. Göstericiler vaşingtın post gazetesi binasını da çürük meyve yağmuruna tuttu ve suçlu buş katil buş sıloganları attı 20. Hazine enflasyon hedefi kadar zam önerisine memur ve işçi arasında giderek büyüyen ücret uçurumunu gerekçe gösterirken türkiş bu yaklaşımı kışkırtma olarak nitelendirdi

60 51 APPENDIX B: MINIMUM EDIT DISTANCE ALGORITHM function MIN-EDIT-DISTANCE(target, source) returns min-distance n LENGTH(target) m LENGTH(source) Create a distance matrix distance[n+1, m+1] distance[0, 0] 0 for each column i from 0 to n do for each row j from 0 to m do distance[i, j] MIN( distance[i-1, j] + ins-cost(target i ), distance[i-1, j-1] + subst-cost(source j, target i ), distance[i, j-1] + del-cost(source j ) )

61 52 REFERENCES 1. Deller, J. R., J. H. L. Hansen and J.G. Proakis, Discrete-Time Processing of Speech Signals. IEEE Press, Young, S., A Review of Large-Vocabulary Continuous-Speech Recognition. IEEE Signal Processing Magazine, September Rabiner, L. R., A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2) pp , February Sproat, R., Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, Massachusetts, Chapter 5, Hermansky, H., Perceptual Linear Predictive (PLP) Analysis For Speech. Journal of the Acoustical Society of America, 87(4) pp , April Markel, J. D. and A. H. Gray, Linear Prediction of Speech. Springer-Verlag, Berlin, Davis, S. B. and P. Mermelstein, Comparison of Parametric Representations For Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-28 pp , August Furui, S., Speaker-Independent Isolated Word Recognition Using Dynamic Features of The Speech Spectrum. IEEE Transactions on Acoustics, Speech and Signal Processing, 29(1) pp , Applebaum, T. H. and B. A. Hanson, Regression Features For Recognition of Speech in Quiet and in Noise, Proceedings of the 1989 International Conference on Acoustics, Speech, and Signal Processing, Glasgow, Scotland, May 1989, pp

62 Furui, S., Digital Speech Processing, Synthesis and Recognition Second Edition Revised and Expanded. Marcel Dekker, NY, Feb Huang, X. D., Y. Ariki and M. Jack, Hidden Markov Models for Speech Recognition. Edinburgh University Press, Rabiner, L. R. and B. Juang, Fundamentals of Speech Recognition.Prentice-Hall, Engle- wood Cliffs, NJ, Forney G. D., The Viterbi Algorithm. Proc. IEEE, Vol 61, pp , Digalakis V. and H. Murveit, Genones: Optimizing the Degree of Mixture Tying in a Large Vocabulary Hidden Markov Model Based Speech Recognizer, Proceedings of the 1994 International Conference on Acoustics, Speech, and Signal Processing, Adelaide, Australia, April 1994, pp Shukat-Talamazzini E. G., H. Niemann, W. Eckert, T. Kuhn and S. Rieck, Acoustic Modeling of Sub-Word Units in the ISADORA Speech Recognizer, Proceedings of the 1992 International Conference on Acoustics, Speech, and Signal Processing, San Francisco, March 1992, pp Yapanel, U., Garbage Modeling Techniques For a Turkish Keyword Spotting System, M.S. Thesis, Bogazici University, Bahl L. R., P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo and M. A. Picheny, Decision Trees for Phonological Rules in Continuous Speech, Proceedings of the 1991 International Conference on Acoustics, Speech, and Signal Processing, Toronto, May 1991, pp Hsiao Wuen H. and Kai-Fu Lee, CMU Robust Vocabulary-Independent Speech Recognition System, Proceedings of the 1991 International Conference on Acoustics, Speech, and Signal Processing, Toronto, May 1991, pp

63 Ney H., D. Mergel, A. Noll and A. Paesler, Data Driven Search Organization for Continuous Speech Recognition. IEEE Transactions on Signal Processing, 40(2) pp , February Young, S. J., N. H. Russell and J. H. S. Thornton, Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems. Technical Report, Cambridge University Engineering Department, July Peterson W. W., T. G. Birdsall and W. C. Fox, The theory of signal detectability. IRE Trans. Info. Theory, PGIT-4, pp , September Jurafsky D. and J. H. Martin, Speech and Language Processing. Prentice-Hall Upper Saddle River, NJ, 2000, pp Levenshtein, V. I, Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10(8), pp