ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING

Transcription

1 Speech and Audio Research Laboratory of the SAIVT program Centre for Built Environment and Engineering Research ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING A. J. Kishan Thambiratnam BE(Electronics)/BInfTech SUBMITTED AS A REQUIREMENT OF THE DEGREE OF DOCTOR OF PHILOSOPHY AT QUEENSLAND UNIVERSITY OF TECHNOLOGY BRISBANE, QUEENSLAND 9 MARCH 2005

2

3 Keywords Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword Verification, Confidence Scoring, Speech Recognition, Utterance Verification i

4 ii

5 Abstract Keyword Spotting is the task of detecting keywords of interest within continuous speech. The applications of this technology range from call centre dialogue systems to covert speech surveillance devices. Keyword spotting is particularly well suited to data mining tasks such as real-time keyword monitoring and unrestricted vocabulary audio document indexing. However, to date, many keyword spotting approaches have suffered from poor detection rates, high false alarm rates, or slow execution times, thus reducing their commercial viability. This work investigates the application of keyword spotting to data mining tasks. The thesis makes a number of major contributions to the field of keyword spotting. The first major contribution is the development of a novel keyword verification method named Cohort Word Verification. This method combines high level linguistic information with cohort-based verification techniques to obtain dramatic improvements in verification performance, in particular for the problematic short duration target word class. The second major contribution is the development of a novel audio document indexing technique named Dynamic Match Lattice Spotting. This technique augments lattice-based audio indexing principles with dynamic sequence matching techniques to provide robustness to erroneous lattice realisations. The resulting algorithm obtains significant improvement in detection rate over lattice-based iii

6 audio document indexing while still maintaining extremely fast search speeds. The third major contribution is the study of multiple verifier fusion for the task of keyword verification. The reported experiments demonstrate that substantial improvements in verification performance can be obtained through the fusion of multiple keyword verifiers. The research focuses on combinations of speech background model based verifiers and cohort word verifiers. The final major contribution is a comprehensive study of the effects of limited training data for keyword spotting. This study is performed with consideration as to how these effects impact the immediate development and deployment of speech technologies for non-english languages. iv

7 Contents Keywords Abstract List of Tables List of Figures List of Abbreviations Authorship Acknowledgments i iii xiii xvi xxi xxiii xxv 1 Introduction Overview Aims and Objectives Research Scope Thesis Organisation Major Contributions of this Research List of Publications A Review of Keyword Spotting Introduction v

8 2.2 The keyword spotting problem Applications of keyword spotting Keyword monitoring applications Audio document indexing Command controlled devices Dialogue systems The development of keyword spotting Sliding window approaches Non-keyword model approaches Hidden Markov Model approaches Further developments Performance Measures The reference and result sets The hit operator Miss rate False alarm rate False acceptance rate Execution time Figure of Merit Equal Error Rate Receiver Operating Characteristic Curves Detection Error Trade-off Plots Unconstrained vocabulary spotting HMM-based approach Neural Network Approaches Approaches to non-keyword modeling Speech background model Phone models vi

9 2.7.3 Uniform distribution Online garbage model Constrained vocabulary spotting Language model approaches Event spotting Keyword verification A formal definition Combining keyword spotting and verification The problem of short duration keywords Likelihood ratio based approaches Alternate Information Sources Audio Document Indexing Limitations of the Speech-to-Text Transcription approach Reverse dictionary lookup searches Indexed reverse dictionary lookup searches Lattice based searches HMM-based spotting and verification Introduction The confusability circle framework Analysis of non-keyword models All-speech models SBM methods Phone-set methods Target-word-excluding methods Evaluation of keyword spotting techniques Experiment setup vii

10 3.4.2 Results Tuning the phone set non-keyword model Output score thresholding for SBM spotting Performance across keyword length Evaluation sets Results HMM-based keyword verification Evaluation set Evaluation procedure Results Discriminative background model KV System architecture Results Summary and Conclusions Cohort word keyword verification Introduction Foundational concepts Cohort-based scoring The use of language information Overview of the cohort word technique Cohort word set construction The choice of d min and d max Cohort word set downsampling Distance function Classification approach class classification approach viii

11 4.5.2 Hybrid N-class approach Summary of the cohort word algorithm Comparison of classifier approaches Evaluation set Recogniser parameters Cohort word selection Evaluation procedure Results Performance across target keyword length Evaluation set Recogniser parameters Results Analysis of poor 8-phone performance Conclusions Effects of selection parameters Cohort word set downsampling Cohort word selection range MED cost parameters Conclusions Fused cohort word systems Training dataset Neural network architecture Experimental procedure Baseline unfused results Fused SBM-CW experiments Fused CW-CW experiments Comparison of fused and unfused systems Conclusions and Summary ix

12 5 Dynamic Match Lattice Spotting Introduction Motivation Dynamic Match Lattice Spotting method Basic method Optimised Dynamic Match Lattice Search Evaluation of DMLS performance Evaluation set Recogniser parameters Lattice building Query-time processing Baseline systems Evaluation procedure Results Analysis of dynamic match rules System configurations Results Analysis of DMLS algorithm parameters Number of lattice generation tokens Pruning beamwidth Number of lattice traversal tokens MED cost threshold Tuned systems Conclusions Conversational telephone speech experiments Evaluation set Recogniser parameters x

13 5.7.3 Results Non-destructive optimisations Prefix sequence optimisation Early stopping optimisation Combining optimisations Optimised system timings Experimental procedure Results Summary Non-English Spotting Introduction The issue of limited resources The role of keyword spotting Experiment setup Database design Model architectures Evaluation set design Evaluation procedure English and Spanish stage 1 evaluations English and Spanish post keyword verification Indonesian spotting and verification Extrapolating Indonesian performance Summary and Conclusions Summary, Conclusions and Future Work HMM-based Spotting and Verification Conclusions xi

14 7.1.2 Future Work Cohort Word Verification Conclusions Future Work Dynamic Match Lattice Spotting Conclusions Future Work Non-English Spotting Conclusions Final Comments Bibliography 210 A The Levenstein Distance 217 A.1 Introduction A.2 Applications A.3 Algorithm xii

15 List of Tables 3.1 Keyword spotting performance of baseline systems on Switchboard 1 data Effect of target word insertion penalty on PM-KS performance Equal error rates of unnormalised and duration normalised output score thresholding applied to SBM-KS Details of phone-length dependent evaluation sets SBM-KS performance on Switchboard 1 data for different phonelength target words Statistics for keyword verification evaluation sets Equal error rates for SBM-based keyword verification Equal error rates for SBM and MLP-SBM keyword verification Evaluated cohort word selection parameters Performance of selected cohort word KV systems on TIMIT evaluation set. Cohort word systems are qualified with the appropriate cohort word selection parameters using a tag in the format {d min, d max, ψ d, ψ i } Performance of SBM-KV and selected cohort word systems on the SWB1 evaluation sets. Cohort word selection parameters are specified with each system in the format {d min, d max, ψ d, ψ i } xiii

16 4.4 Mean and standard deviation of the number cohort words used in the 3 best performing cohort word KV methods for the SWB1 evaluation set Performance of baseline SBM-KV and best cohort word systems on the SWB1 evaluation sets Performance of the best fused SBM-cohort systems on the SWB1 evaluation sets Performance of the best fused cohort-cohort systems on the SWB1 evaluation sets Correlation analysis of fused EER and individual unfused EER Summary of best performing systems Phone substitution costs for DMLS Baseline keyword spotting results evaluated on TIMIT TIMIT performance when isolating various DP rules Effect of adjusting number of lattice generation tokens Effect of adjusting pruning beamwidth Effect of adjusting number of traversal tokens Effect of adjusting MED cost threshold S max Optimised DMLS configurations evaluated on TIMIT Keyword spotting results on SWB Relative speeds of optimised DMLS systems Performance of a fully optimised DMLS system on Switchboard data Summary of key results Summary of training data sets Codes used to refer to model architectures Summary of evaluation data sets Stage 1 spotting rates for various model sets and database sizes. 191 xiv

17 6.5 Equal error rates after keyword verification for various model sets and training database sizes Stage 1 spotting and stage 2 post verification results for S1I experiments xv

18 xvi

19 List of Figures 2.1 An example of a Receiver Operating Characteristic curve An example of a Detection Error Trade-off plot Recognition grammar for HMM-based keyword spotting Sample recognition grammar for small non-keyword vocabulary keyword spotting System architecture for HMM keyword spotting using a Speech Background Model as the non-keyword model System architecture for HMM keyword spotting using a composite non-keyword model constructed from phone models Constructing a recognition network for constrained vocabulary keyword spotting An optimised constrained vocabulary keyword spotting recognition network (language model probabilities omitted) An event spotting network for detecting occurrences of times [16] Likelihood ratio based keyword occurrence verification with multiple verifier fusion Applying reverse dictionary searches to the detection of the word ACQUIRE in a phone stream Example of indexed reverse dictionary searching for the detection of the word ACQUIRE xvii

20 2.13 Using lattice based searching to locate instances of the word AC- QUIRE within a phone lattice Confusability circle for the target word STOCK Example of the shared subevent confusable acoustic region for the keyword STOCK Incorporating target word insertion penalty into HMM-based keyword spotting DET plots for unnormalised and duration normalised output score thresholding applied to SBM-KS DET plots for duration normalised output score thresholding applied to SBM-KS for keyword length dependent evaluation sets DET plots for different target keyword lengths for SBM-KV on Switchboard 1 evaluation sets System architecture for MLP background model based KV DET plots for SBM and MLP-SBM systems for 4-phone words DET plots for SBM and MLP-SBM systems for 6-phone words DET plots for SBM and MLP-SBM systems for 8-phone words Controlling the degree of CAR region modeling d min and d max tuning A N-class classifier approach to cohort word verification for the keyword w and cohort word set R(w) DET plot for best cohort word and SBM-KV systems on SWB1 4-phone length evaluation set DET plot for best cohort word and SBM-KV systems on SWB1 6-phone length evaluation set Equal error rate versus mean number of cohort words Trends in equal error rate with changes in cohort word set downsampling size xviii

21 4.7 Trends in equal error rate with changes in cohort word selection range for 4-phone length cohort word KV Trends in equal error rate with changes in cohort word selection range for 6-phone length cohort word KV Trends in equal error rate with changes in cohort word selection range for 8-phone length cohort word KV Trends in equal error rate with changes in MED cost parameters Correlation between unfused system performances and fused system performances Boxplot of EERs for all evaluated architectures and phone-lengths Boxplot of log(eers) for all evaluated architectures and phonelengths Segment of phone lattice for an instance of the word STOCK Effect of lattice traversal token parameter Trends in miss rate and FA/kw rate performance for various types of tuning Plot of miss rate versus FA/kw rate for HMM, CLS and DMLS systems evaluated on Switchboard The relationship between cost matrices for subsequences Demonstration of the MED prefix optimisation algorithm Effect of training dataset size on speech recognition [24] Trends in miss rate across training database size Trends in FA/kw rate across training database size DET plot for T16 experiments. 1=T16S3E, 2=T16S2E, 3=T16S1E, 4=T16S2S, 5=T16S1S DET plot for M16 experiments. 1=M16S3E, 2=M16S2E, 3=M16S1E, 4=M16S2S, 5=M16S1S xix

22 6.6 DET plot for M32 experiments. 1=M32S3E, 2=M32S2E, 3=M32S1E, 4=M32S2S, 5=M32S1S Trends in EER across training dataset size DET plot for S2S experiments. 1=T16S2S, 2=M16S2S, 3=M32S2S DET plot for S1I experiments. 1=T16S1I, 2=M16S1I, 3=M32S1I Extrapolations of Indonesian keyword spotting performance using larger sized databases A.1 Example of cost matrix calculated using Levenstein algorithm for transforming deranged to hanged. Cost of substitutions, deletions and insertions all fixed at 1, cost of match fixed at xx

23 List of Abbreviations ADI CAR CLS CMS CW DAR DET DMLS EER FA GMM HMM IRDL KS KV LVCSR MED MLP PLP RDL Audio Document Indexing Confusable Acoustic Region Conventional Lattice-based Spotting Cepstral Mean Subtraction Cohort Word Disparate Acoustic Region Detection Error Trade-off Dynamic Match Lattice Spotting Equal Error Rate False Alarm Gaussian Mixture Model Hidden Markov Model Indexed Reverse Dictionary Lookup Keyword Spotting Keyword Verification Large Vocabulary Continuous Speech Recognition Minimum Edit Distance Multi-Layer Perceptron Perceptual Linear Prediction Reverse Dictionary Lookup xxi

24 ROC Receiver Operating Characteristic SBM Speech Background Model SBM-KS Speech Background Model based Keyword Spotting SBM-KV Speech Background Model based Keyword Verification STT Speech-to-Text Transcription SWB1 Switchboard-1 TAR Target Acoustic Region WSJ1 Wall Street Journal 1 xxii

25 Authorship The work contained in this thesis has not been previously submitted for a degree or diploma at any other higher educational institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made. Signed: Date: xxiii

26 xxiv

27 Acknowledgments Foremost I would like to acknowledge my Lord and Saviour Jesus Christ. It is by His grace that I was given the opportunity and necessary abilities to partake in this research. I would also like to thank my beautiful wife, Melenie, who has been a constant source of support and inspiration. Your words of encouragement have seen me through the more difficult and frustrating times of this work. To my supervisor, Professor Sridha Sridharan, I would like to offer my heartfelt gratitude for your unrelenting support in bringing this research to completion. Your positive words and guidance have been a true blessing. I would also like to offer a special thanks for the friendship of the members of the QUT Speech Research Labs. In particular, I would like to thank Terry Martin, Robbie Vogt, Michael Mason and Brendan Baker for their constructive criticism as well as their constant joviality. Finally, I would like to thank my loving two families for believing in and supporting me during this long venture, and my wonderful dogs for always giving me a reason to smile. Kit Thambiratnam Queensland University of Technology February 2005 xxv

28 xxvi

29 Chapter 1 Introduction 1.1 Overview Keyword Spotting (KS) is the automated task of detecting keywords of interest within continuous speech. This technology has been used in a variety of applications, ranging from telephone call centre systems to covert surveillance applications. Keyword spotting is closely related to the task of speech transcription, but offers many advantages for certain applications. Primarily, keyword spotting is well suited to data-mining tasks that process large amounts of speech. This is because keyword spotting requires significantly less processing power than transcription, and can therefore run at considerably faster speeds. Real-time stream monitoring is one such example where this is required. These applications monitor audio in real-time and flag occurrences of segments of interest, such as news stories related to a specific topic. Clearly, the majority of the stream does not require attention, and therefore a keyword spotting solution that simply detects occurrences of topical keywords will be more efficient than a fully-fledged large vocabulary transcription engine. Keyword spotting is also an excellent technology for audio search applications, 1

30 2 Chapter 1. Introduction such as audio document indexing. In particular, recent developments in KS including lattice-based searching and reverse dictionary lookup methods have made possible the development of unrestricted vocabulary audio document database search engines that can search hours of data in seconds. However, many keyword spotting technologies are encumbered by poor detection performance or slow search speeds. There is a trade-off between accuracy and speed that needs to be managed, and unfortunately to date, many practical keyword spotting applications are forced to sacrifice detection performance to realise the execution speeds required for commercial deployment. One has only to use speech-recognition-enabled telephony services such as telephone banking to conclude that these systems are far from perfect. Nevertheless, keyword spotting is a powerful and relevant technology. Used appropriately, a keyword spotting solution brings with it reduced computational requirements, increased scalability and potentially higher accuracies than a large vocabulary transcription system Aims and Objectives This work specifically examines the application of keyword spotting technologies to two data mining tasks: real-time keyword monitoring and large audio document database indexing. With the ever-increasing amounts of audio and multimedia being generated daily, the ability to extract information from audio streams at high speeds while maintaining good detection rates is paramount. A desirable feature of data mining applications is the support for unrestricted vocabulary keyword queries. However, a significant portion of past keyword spotting research has dealt primarily with restricted vocabulary methods. Although these approaches offer advantages in terms of detection and false alarm performance, they limit the flexibility of queries. As such, this work concerns itself

31 1.1. Overview 3 solely with the study of unrestricted vocabulary keyword spotting techniques. Data throughput is also another major consideration when dealing with large amounts of data. Although the cost of computing is constantly becoming cheaper, it is nevertheless beneficial to run at high speeds. This is particularly true for audio indexing applications, where literally hundreds of hours may need to be interactively searched by a user. Unfortunately many published KS works neglect to consider execution time during experimentation. This research will therefore give considerable attention to the issue of processing speed. The primary objectives of this thesis are as follows: 1. To review and investigate current state-of-the-art keyword spotting techniques that are relevant to the tasks of real-time keyword monitoring and audio document indexing 2. To assess and evaluate the performance of these techniques with regards to crucial performance metrics relevant to the target applications, and as such, identify potential issues that need to be addressed 3. To investigate and develop novel techniques that can be used to improve the performance of keyword spotting techniques for data mining applications 4. To investigate the application of keyword spotting technologies for non- English data mining Research Scope Keyword spotting encompasses a plethora of speech recognition research topics that unfortunately cannot be fully addressed in a single work. As such, the scope of this research was limited to issues that were directly related to the application of keyword spotting to real-time keyword monitoring and audio document indexing. Additionally, the following restrictions and constraints were applied to this

32 4 Chapter 1. Introduction research: 1. Primarily this work concerns itself with the application of HMM-based speech recognition techniques to the keyword spotting task. Alternate statistical modeling approaches, such as neural network techniques, have been proposed and demonstrated to be suitable for keyword spotting. However, it is believed that the HMM-based approach provides a greater degree of flexibility particularly with regards to unrestricted vocabulary tasks, and as such is the modeling architecture of choice for this research. 2. Experiments reported within this work are limited to single keyword detection. Although most practical applications of keyword spotting use multiword detection during a single pass, it is believed that research constrained to single keyword detection offers a number of advantages. Primarily, it allows ease of comparison between results in this thesis and other published works. Additionally, the variability in performance due to different mixtures of words within a multi-word keyword set can be avoided, thereby ensuring greater consistency between experiments. Finally, it is believed that trends in single keyword spotting across methods will easily translate to multi-word keyword spotting tasks, and as such, does not limit the value of this research. 1.2 Thesis Organisation An overview of the organisation of this thesis is given below: Chapter 2 - A Review of Keyword Spotting presents a thorough review of keyword spotting and associated technologies. A formal definition of the keyword spotting problem is given, as well a discussion of its primary applications. This is followed by an overview of the key performance metrics that

33 1.2. Thesis Organisation 5 are relevant to evaluating and understanding keyword spotting methodology. A detailed review of KS literature is then presented covering the topics of unrestricted and restricted spotting techniques, non-keyword modeling architectures, keyword verification and confidence scoring methods, and audio indexing approaches. Chapter 3 - HMM-based Spotting and Verification discusses and evaluates existing HMM-based keyword spotting and verification techniques. Such methods have a strong following within the keyword spotting community. However, to date, there has been little published work that compares the performances of the various approaches. What little that has been published has primarily focused on measuring performance for simplistic domains such as read microphone speech. A number of HMM-based techniques are evaluated in this chapter and the strengths and weaknesses of these methods are discussed. Chapter 4 - Cohort Word Verification proposes a novel keyword verification approach that combines high level linguistic information with cohortbased verification techniques to yield improved performance. A number of experiments are reported on to measure the performance of this method for the conversational telephone speech and read microphone speech domains. The results demonstrate that significant gains can be obtained particularly for the difficult task of short-word keyword verification. In addition, experiments are performed using a fused architecture that combines cohort word verification with traditional background model based verification. Further gains in performance are obtained using this approach. Chapter 5 - Dynamic Match Lattice Spotting proposes a novel audio indexing technique that is presented and evaluated in this chapter. Although existing unrestricted audio indexing methods are capable of very fast search

34 6 Chapter 1. Introduction speeds, they are encumbered by very poor miss rate performance. It is argued here that this poor miss rate is a result of inherent phone recogniser errors that are not accommodated for by these techniques. As such, a new method of lattice-based searching is proposed that incorporates dynamic sequence matching methods to provide robustness against erroneous lattice realisations. The results demonstrate that dramatic gains in performance can be obtained while still maintaining extremely fast search speeds. Chapter 6 - Non-English Spotting studies the application of keyword spotting technologies to non-english languages. In particular, it examines the effects of limited training data on keyword spotting performance. The lack of availability of non-english training data has greatly hindered the development of other speech technologies such as large vocabulary speech transcribers. However, keyword spotting is a significantly more constrained task, and therefore may be less affected by reduced amounts of training data. If so, this may allow the immediate development of speech technologies for non-english languages without the need for the costly task of creating large training databases. Chapter 7 - Summary, Conclusions and Future Work presents the summary and conclusions of this work as well as a discussion of future research directions. 1.3 Major Contributions of this Research This work has generated a number of novel contributions to the field of keyword spotting. These are: 1. The development of the novel Cohort Word Verification technique. This

35 1.4. List of Publications 7 method combines high level linguistic knowledge with cohort-based verification techniques to yield significant improvements particularly for the problematic area of short-word keyword verification. 2. The use of multiple keyword verifier fusion, in particular applied to the combination of cohort word verification with existing HMM-based techniques. It is demonstrated that such fusion techniques allow the strengths of individual verifiers to be combined to yield considerable improvements in verification performance. 3. The development of the novel Dynamic Match Lattice Spotting approach. This technique augments existing lattice-based audio indexing techniques with dynamic sequence matching to improve robustness to erroneous lattice realisation. The resulting algorithm is capable of searching hours of speech using seconds of processor time while maintaining good miss and false alarm rates. 4. A detailed study of the effects of limited training data for keyword spotting, as well as how this impacts the immediate development and deployment of speech technologies for non-english languages. 1.4 List of Publications The research presented in this thesis has resulted in the publication of a number of fully referenced peer reviewed works. 1. K. Thambiratnam and S. Sridharan. Isolated word verification using Cohort Word-level Verification, in Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), (Geneva, Switzerland), 2003

36 8 Chapter 1. Introduction 2. K. Thambiratnam and S. Sridharan. A study on the effects of limited training data for English, Spanish and Indonesian keyword spotting, in Proceedings of the 10th Australian International Conference on Speech Science and Technology (SST), (Sydney, Australia), T. Martin, K. Thambiratnam and S. Sridharan. Target Structured Cross Language Model Refinement, in Proceedings of the 10th Australian International Conference on Speech Science and Technology (SST), (Sydney, Australia), K. Thambiratnam and S. Sridharan, Fusion of cohort-word and speech background model based confidence scores for improved keyword confidence scoring and verification, in Proceedings of the IEEE 3rd International Conference on Sciences of Electronic, Technologies of Information and Telecommunications, (Susa, Tunisia), K. Thambiratnam and S. Sridharan, Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting, in Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (Philadelphia, USA), 2005

37 Chapter 2 A Review of Keyword Spotting 2.1 Introduction This chapter presents a comprehensive review of keyword spotting technologies to date. Section 2.2 gives a formal definition of the keyword spotting problem and is followed by a discussion of the various applications of keyword spotting in section 2.3. A brief synopsis of the development of keyword spotting research is provided in section 2.4 as well as a detailed description of how keyword spotting performance is measured in section 2.5. Subsequent sections discuss the current methods of keyword spotting with respect to their key applications. Section 2.6 discusses a number of algorithms for unconstrained vocabulary keyword spotting. This is followed by a description of the various approaches to non-keyword modeling in section 2.7. Approaches to constrained vocabulary keyword spotting are then presented in section 2.8 as well as methods for keyword occurrence verification in section 2.9. Finally, methods of applying KS to the task of audio document indexing are discussed in section

38 10 Chapter 2. A Review of Keyword Spotting 2.2 The keyword spotting problem Keyword spotting can be viewed as a special case of Speech-to-Text Transcription (STT), in which the transcription vocabulary is restricted to keywords of interest plus a non-keyword symbol that is used to represent all other words in the target application domain. Let O be an observation sequence, V be the vocabulary of the target application domain, Q be the set of keywords of interest and Ω be the non-keyword symbol. If STT is represented as the transformation W = T ranscribe(o, V ), where W = {w 1, w 2,...} is the resulting hypothesised word sequence, then the keyword spotting task can be defined as KS(O, V, Q) = f(t ranscribe(o, V ), Q) (2.1) where f(w, Q) is a transformation applied to the output of STT and is given by W W = 1, w 1 Q f(w, Q) = {Ω} W = 1, w 1 Q {w 1, f(t ail(w ), Q)} W > 1, w 1 Q {Ω, f(t ail(w ), Q)} f(t ail(w ), Q) W > 1, w 1 Q, w 2 Q otherwise and T ail({x i } N i=1) = {x i } N i=2 f(w, Q) essentially replaces all sequences of non-keywords in the word sequence output by the transcriber by a single non-keyword symbol Ω. Although valid, this formulation of keyword spotting is inefficient as it requires full transcription using a vocabulary of size V. Typically keyword spotting is

39 2.3. Applications of keyword spotting 11 only interested in occurrences of a much smaller set of words defined by Q. Given this simplification, a more practical and efficient formulation of keyword spotting is KS(O, V, Q) = T ranscribe(o, g(q)) (2.2) where g(q) = Q {Ω} This alternate approach requires transcription using a much smaller vocabulary of size Q + 1. Clearly, this is a considerably less computationally intensive task than transcription using the formulation in equation 2.1. However, it introduces the additional burden of an acoustic model representation of the nonkeyword symbol Ω. Definition of the non-keyword symbol is one of the active areas of keyword spotting research and is discussed further in section Applications of keyword spotting Keyword spotting lends itself to a plethora of speech-enabled applications. Keyword spotting is particularly well suited to applications where large amounts of speech need to be processed. This is because it offers a significant speed benefit over a large vocabulary STT approach. Four major applications of this technology are keyword monitoring, audio document indexing, command control devices and dialogue systems Keyword monitoring applications Keyword monitoring applications are required to continuously monitor a realtime stream of audio and to flag any occurrences of a keyword in the query set. Specific keyword monitoring applications include telephone tapping, listening device monitoring and broadcast monitoring.

40 12 Chapter 2. A Review of Keyword Spotting Telephone tapping and listening device monitoring are used extensively by security organisations to detect criminal or malicious activity. Keyword spotting provides a fast and automatic solution to this task and potentially a higher detection accuracy then human monitoring, particularly when a very large number of audio streams needs to be monitored. However, these applications create a considerable challenge for keyword spotting because of the noisy nature of the speech being monitored. Telephone conversations may be plagued with significant background noise, multiple languages and even multiple speakers, providing challenges for acoustic modeling. Listening device audio may suffer from very low signal-to-noise ratios, a difficulty for any speech processing application. Broadcast monitoring is actively performed by commercial broadcast monitoring companies to locate segments that may be of interest to a client. For example, a senator may be interested in all news stories in which he or she is mentioned in - broadcast monitoring organisations provide such a service at a fee. A significant challenge of broadcast monitoring is the amount of audio that needs to be processed daily. Broadcast monitoring clients may be interested in stories from a comprehensive set of broadcast sources, including free-to-air television, cable-television, commercial radio and community radio. It is easy to see that the vast numbers of these combined with the fact that many of these sources broadcast continually 24 hours a day, 7 days a week, makes broadcast monitoring a very data intensive problem. Keyword spotting provides an excellent solution to all these keyword monitoring tasks. Faster-than-real-time keyword spotting technologies are likely to process audio faster than a human processor. Additionally the accuracy of an automatic system is also likely to exceed that of a human processor since computers do not suffer from fatigue and mental distractions that plague a human processor. Keyword spotting is particularly well suited to the broadcast monitoring task since audio quality in this domain is usually of much higher quality

41 2.3. Applications of keyword spotting 13 than telephone and listening device audio Audio document indexing Audio document indexing is the task of rapidly searching an audio document database for keywords and topics of interest. This functionality is analogous to traditional text document indexing systems such as the Google [11] Internet search engine, but operates on audio documents instead. The need for efficient and fast audio document indexing is paramount in a world where audio and multimedia documents play a greater role in everyday life. STT systems are one solution to the audio document indexing problem. Audio is first transcribed to text that can then be rapidly searched during query time. However, many applications of audio document indexing, such as news database searching, require support for proper noun queries such as names, places and foreign words terms that in many cases are not a part of the transcription system s vocabulary. As such, alternates to the STT-based approach that do not constrain the query vocabulary are required. Thankfully, a keyword spotting solution does provide support for unrestricted vocabulary queries. The trade-off though is a reduction in query speeds, since most KS approaches are nowhere near as fast as text-based searching methods. Nevertheless, the support for unrestricted vocabulary queries is important, and as such, a keyword spotting system can be used to augment an STT-based system to provide very fast queries for in-vocabulary words while still supporting out-ofvocabulary queries Command controlled devices Command controlled devices monitor the ambient audio and react when they detect specific command words. Examples of command controlled devices are

42 14 Chapter 2. A Review of Keyword Spotting speech-enabled mobile phones, voice-controlled VCRs and command-controlled factory machinery. Although generic keyword monitoring technologies can be used for command controlled devices, they typically place too high a processing or memory requirement to be feasible, especially in the case of DSP-based or embedded applications. Additionally, the query terms of command controlled devices tend to be fixed, allowing more application-specific information to be incorporated into the keyword detection process. This includes query word linguistic context information and environmental noise conditions. Hence command controlled device KS lends itself to the development of custom solutions. Though many of these solutions may be based on existing keyword spotting approaches, significant enhancements and modifications are made to provide maximum performance for the intended application Dialogue systems Automated dialogue systems are becoming more common in the commercial environment as a viable alternative to human-operated call centres. A dialogue system mimics a human call-centre operator by playing voice prompts to a caller and then attempting to detect keywords that indicate the response of caller. Since the volume of calls processed by a call-centre can be very large, large vocabulary STT approaches have proven infeasible due to their high computational requirements. Instead restricted grammar speech recognisers or keyword spotting technologies are used to interpret the response of callers. Keyword spotting approaches offer a benefit over restricted grammar speech recogniser approaches because they allow greater flexibility in the response of the speaker. This is because KS accommodates out-of-vocabulary words by means of non-keyword modeling. However, a cleverly constructed restricted grammar

43 2.4. The development of keyword spotting 15 speech recogniser can better understand the intention of a caller using contextual information, and therefore may prove more appropriate for certain applications. 2.4 The development of keyword spotting In a similar fashion to general speech recognition theory, keyword spotting has undergone a number of generations of development. Early approaches were limited by low computing resources and hence KS research was limited to simpler tasks such as isolated keyword detection. As speech recognition technology matured, more advanced tasks were explored, such as the detection of keywords embedded in noise or continuous speech Sliding window approaches Initial methods focused on using sliding window approaches such as the dynamic time warping approaches proposed by Sakoe and Chiba [29] and Bridle [6], or the sliding window based neural network method prescribed by Zeppenfeld [40]. Such techniques yielded acceptable results in isolated keyword spotting tasks, but suffered from considerable drops in performance when spotting keywords embedded in continuous speech. A major reason for this drop in performance was because sliding window approaches did not model non-keywords either implicitly or explicitly. Spotting of keywords in continuous speech is essentially a 2-class discrimination task, attempting to classify regions as either a keyword or a non-keyword instance. Since the traditional sliding window approaches did not model non-keywords, they essentially were attempting discrimination with only knowledge of the target class. This was analogous to making measurements without a point of reference - all observations were purely relative and therefore provided little confidence for making absolute decisions.

44 16 Chapter 2. A Review of Keyword Spotting Non-keyword model approaches To address the lack of knowledge of the non-target class, the concept of nonkeyword models (also known as filler models) was introduced into keyword spotting. Non-keyword models attempted to model all speech that did not form a part of the target keyword speech. For example, in a closed vocabulary system, a non-keyword model would attempt to model all words in the vocabulary except for the target keywords. Using a non-keyword model provided more confidence when accepting or rejecting putative instances of target keywords compared to the sliding window approaches because a comparison was being made between the target keyword model and the non-keyword model. One of the initial approaches used to incorporate non-keyword models was proposed by Higgins and Wohlford [13]. Here a DTW-based continuous speech recogniser was modified to use filler non-keyword models to represent non-keyword speech. The modified speech recogniser was then used to transcribe continuous speech into regions of keywords and non-keywords. Finally, a likelihood ratio was used to normalise keyword likelihoods by the corresponding likelihood of the non-keyword model over the same observation sequence. Non-keyword models in this particular approach were modeled by using pieces and subsequences of the target keyword. The introduction of non-keyword models into keyword spotting saw the fusion of continuous speech recognition research with keyword spotting techniques. Whereas previously KS approaches had exclusively used sliding window techniques, the use of non-keyword models required a paradigm shift into the speech recognition context. Specifically, keyword spotting could be simply viewed as a special case of continuous speech recognition, where all non-keyword speech was labeled with a single non-keyword tag. Operating within the speech recognition framework allowed the latest developments in continuous speech recognition such

45 2.4. The development of keyword spotting 17 as advances in modeling techniques to be transferred to the KS domain. Hence, keyword spotting research began to more closely follow the trends of speech recognition research Hidden Markov Model approaches The advent of Hidden Markov Model (HMM) based speech recognition lead to the introduction of HMM-based keyword spotting techniques. As for DTW-based keyword spotting, HMM-based keyword spotting could be viewed as a special case of HMM-based speech recognition, where all non-target words were represented by a non-keyword model. One common approach was to use a word loop consisting of all target keywords in parallel with the non-keyword. Target keywords were typically modeled using either word models or sub-word models, while non-keyword speech was modeled using a plethora of architectures, including a high-order Gaussian Mixture Model as prescribed by Wilpon et al. [35] or a monophone model set as suggested by Rose and Paul [28]. This lead to the development of better performing KS systems, paving the way to more complex keyword spotting applications Further developments Advances in high-level linguistic modeling through recognition grammars and language modeling were also incorporated into keyword spotting. These advances were motivated by the need to reduce false alarm rates of KS systems through the use of contextual information, specifically to reduce or constrain the emission of false putative keyword occurrences. Kenji et al. [18] and Gou et al. [12] both described techniques of incorporating finite state grammars into the spotting process. The reported experiments demonstrated significant gains in performance for simple recognition grammar applications compared to non-grammar-constrained