Automatic Creation and Tuning of Context Free

Transcription

1 Proceeding ofnlp - 'E0 5 Automatic Creation and Tuning of Context Free Grammars for Interactive Voice Response Systems Mithun Balakrishna and Dan Moldovan Human Language Technology Research Institute The University of Texas at Dallas Dallas, Texas mithun,moldovan@hlt.utdallas.edu Ellis K. Cave Waterview Parkway Dallas, Texas Skip.Cave@intervoice.com Abstract-In this paper, we shall focus on the design of an AUTOCFGPROCESSOR procedure to automatically create and tune Context Free Grammars (CFGs) for directed dialog speech applications without the use of any domain specific text corpora. A reranking mechanism is used to post-process the Large Vocabulary Continuous Speech Recognizer (LVCSR) n- best lists with additional phonetic and higher level linguistic knowledge for transcribing the user utterances with improved Word Error Rate (WER). We also shall depict the classification of LVCSR transcriptions into semantic categories and the use of a statistical ifitering mechanism on the valid LVCSRtranscriptions for the CFG creation and tuning tasks. We also illustrate the importance of the additional improvements gained by using semantic classification strength in a feedback loop to the transcription mechanism. I. INTRODUCTION The current generation of directed dialog speech applications (DDSAs) predominantly use Context-Free Grammar (CFG) instead of a statistical n-gram based language model [1] i.e. the Automatic Speech Recognizer (ASR) embedded in speech applications rely on a CFG based language model rather than a statistical n-gram based language model to restrict the options presented to the ASR decoder by the ASR acoustic model. The preference for CFG in telephonic Interactive Voice Response (IVR) systems can be attributed to the very tight constraint placed on the ASR's response time to a user's request and the limited availability of text corpora for a wide range of application domains. The need for only the accurate semantic tag and its corresponding arguments associated with the user response rather than the entire set of words spoken by the user also justifies this preference. In the conventional grammar based IVR, the user response to a particular IVR prompt is fed to a CFG-based ASR. The ASR uses the CFGs to decide whether the user response is valid (Match) or invalid (No-Match) and the IVR's response is based on this decision. For example, the prompt "do you want your account balance or your cleared checks?" might have answers that match the words "checks" or "balance." If the user responds "what is the price of pizza?," the system simply puts the answer down as a no-match and repeats the prompt. Every user prompt in a DDSA needs a CFG to accurately recognize and semantically classify the user's response for that particular prompt. For the IVR to work with maximum accuracy, the IVR CFGs should cover the most probable responses that are expected from the user at every prompt in the application call-flow [1]. For the previous example prompt, the person can easily respond with "the sum of my account" or "the total of my account." These responses are out of the grammar that was written for this prompt, but still good answers. To correct this problem in the current IVRs, once prompts are created, a grammar designer must read the prompts and write what they believe to be all the possible answers for each prompt. The success of well-designed CFG-based DDSAs has resulted in the very negligible deployment of their statistical n-gram language model based counterparts. A CFG for a prompt in the IVR call-flow contains (refer [1] for detailed explanations about CFGs in IVR systems): 1) Universals - words for navigation or accessing assistance at any point in the call-flow of the speech application e.g. "help," "repeat," "start over," "operator." 2) The most probable words or phrases that are expected from the user at that particular prompt. The main purpose of this paper is to describe a method of automatically creating and tuning finite state grammars representing users' responses to prompts in IVR systems. For example, to an IVR prompt "do you want your account balance or your cleared checks?," users might make replies of the form "Total in my account" or "Sum of my account"; and our goal is to produce an efficient grammar containing the full set of actual responses. It should be noted that the goal is not to produce a grammar containing all the possible user response altematives to a particular prompt but to produce one containing only the most probable user responses required to improve the IVR performance. In this paper, we will present improvements of techniques presented in earlier work [2] for determining which category of possible response (as specified earlier by the speech application designer) is exemplified by a actual user response. These categorization techniques involve lexical chain paths [3] computed in a lexical resource like WordNet [4] between words in users' actual responses and words in previously prepared semantic descriptions of the possible responses. We will also present new statistically based techniques for adding new user /05/ $ IEEE 158

2 responses to the grammar of responses, and (importantly) for removing dormant word sequences from the grammar. A. Previous Work Reference [5] proposes a semantically structured model to reduce the manual labor in developing IVR system CFGs. This process creates an IVR system grammar by combining statistical n-grams and CFGs to reduce the understanding error rate by half of that obtained from a manually created grammar. The proposed method however requires a partially labeled (manually performed) text corpus in the IVR's domain for training the semantically structured model. Due to the high deployment demand for directed dialog systems in a wide variety of domains and the lack of any respectable size text corpora in these areas, the traditional manual processes of creating and tuning CFGs for each of the designed user prompts uses only the collected test user speech utterances. Until recently, the CFG creation process and the subsequent tuning process, based on collected user utterances, were primarily manual. A speech application designer first analyzes the system prompts and a sample CFG is created (an initial guess of the most probable user words or phrases) for each IVR prompt. This sample CFG is loaded into the IVR and a set of test users call and respond to the prompts. Their responses are manually transcribed and analyzed to modify the initial sample CFG, increasing its coverage of expected user responses and, hence, improving the IVR performance. Adding a large set of user response alternatives does not necessarily imply improved IVR performance. On the contrary, we have noticed severe IVR performance degradation due to increase in the recognition confusion. In this paper, we present improvements of techniques presented in earlier grammar automation work [2]. Since our corpus of user responses is spoken, a necessary first step in the automation process is transcribing the responses more accurately than has previously been possible. To transcribe the user responses, we will use a Large Vocabulary Continuous Speech Recognizer (LVCSR) system with an extemal reranking mechanism made of phonetic, lexical, syntactic and semantic features. We propose a WordNet based semantic categorizer to determine which category of possible response goals (as manually specified in advance by the system designer) is exemplified by the user utterance transcription. We also propose a statistical selection mechanism which not only adds altemative user responses to the CFG but also removes dormant word sequences from the CFG (to completely justify the observation in [2] that bigger CFGs do not necessarily lead to a better IVR performance, as more alternatives increase the utterance recognition confusion). Finally, we will show how constructive feedback can be facilitated between the improved LVCSR results and the grammar creation/tuning process. A multiple loop feedback mechanism between the semantic categorizer and reranking module using lexical chain strength measurements not only improves the transcription accuracy of the LVCSR system but also improves the IVR Fig. 1. Automatic creation and tuning of CFG performance by presenting better response altematives to the statistical selection mechanism. II. AUTOCFGPROCESSOR The goal of the system is to capture the most probable user responses, which have not yet been adequately represented by the semantic tags (creation task) or the previously created CFG (tuning task). For the automatic CFG creation process, the speech application designer just needs to conceive a semantic tag for each option in a prompt i.e. design the application call-flow with its prompts and semantic categories. The use of a "Wizard-of-oz" procedure [2] for guiding the test callers through the call flow and collecting their responses is expensive. Therefore, we recommend using a IVR loaded with a skeletal CFG (containing only the word sequences in the prompt options) to guide the callers. The automatic CFG tuning process modifies the CFGs (manually or automatically created) using a set of caller responses (previously created CFGs guide the callers through the IVR call-flow) to improve the IVR performance. Fig. 1 depicts our proposed design. A set of test callers respond to the IVR loaded with the manually created prompts and previously created CFGs (for CFG tuning). Irrespective of the task (creation or tuning), the user responses are recorded and transcribed using a LVCSR with the reranking mechanism. The transcriptions need to be as accurate as possible to realize the AUTOCFGPROCESSOR goal and hence the LVCSR with the reranking mechanism plays a major role in improving the Word Error Rate (WER). The LVCSR-transcribed responses are grouped based on the prompt for which they were uttered. For the CFG creation task, the semantic categorizer tries to map the user utterance transcription to the best semantic label from the set corresponding to each prompt. Utterance transcription and semantic label pairs are then statistically validated to create the appropriate CFG. For the CFG tuning 159

3 task, the semantic categorizer uses a set of semantic labels and their corresponding response alternatives (from the CFG to be tuned) and assigns the best semantic label to the user utterance. In this paper, the (utterance transcription, semantic label) pairs are then statistically validated to not only add some valid alternatives but also remove some statistically invalid entries from the old CFG. An automatically created/tuned CFG is loaded into the IVR and the entire process is repeated iteratively until no further improvement is achieved. A. LVCSR N-Best List Reranking Based Transcription Process The majority of the current LVCSRs compute the posterior probability using a Hidden Markov Model (HMM) based acoustic model (conditional probability), and an n-gram based language model (prior probability). Though a LVCSR based on such an architecture performs recognition with reasonable accuracy, a significant amount of untapped WER improvements remains hidden within the LVCSR n-best lists and wordlattices. Reference [2] presents an analysis for the HUB test corpus, showing that the choice of the best hypothesis from a 200-best list gives a 12.4% absolute WVER reduction over the baseline LVCSR hypothesis (produced by a 3-pass recognition using Gender-dependent acoustic models, Vocal Tract Length Normalization (VTLN) + Speaker Adaptive Trained (SAT) acoustic models + Maximum Likelihood Linear Regression (MLLR) and, MLLR(2) adaptation), while the word lattice gives a 13.0% absolute WER reduction. Results show that majority of the most accurate hypotheses are captured within the first 80 to 90 hypotheses for each utterance. These experimental results prove that big improvements can be gained by applying a strong reranking mechanism, even at a small n-best list depth. Hence, AUTOCFGPROCESSOR uses a LVCSR along with the n-best list reranking mechanism [2] for transcribing the test user utterance with considerable accuracy (ignoring the small additional improvements present in word lattices). The external reranking mechanism uses additional domain independent phonetic, lexical, syntactic and semantic knowledge in tandem, complementing each other to improve the WER. The reranking score assigned to the LVCSR n-best hypotheses is a simple linear weighed combination of the individual scores of each participating knowledge source [2]. For every n-best list hypothesis, the confidence score is computed in the following manner: if Ph = ph,, ph2,...,phm is the phoneme sequence corresponding to the n-best list hypothesis word sequence Wh = w1, w2,.w.,wng predicted by the LVCSR for the acoustic frames A of an utterance, then Score(Wh) = w1 * P(CategoriesAA)*P(Phi(CategoriesAA)) + +W2*(1 * P(PhJPh')*P(Phs)) P(PhlP()*(P) + W3 *(PLM(Wh)*H1n Count(w'JwV-boundary)) + W4 * (P(rr, Wh)C 1Count(Slots-Filled(j)Avi) rq Count(Slots-Filled(j)Avi)) + W5 * (Z1 Count(vi) Etj=1 Count(vj ) (W1,W2),W3,W4,W5 represent the weights assigned to the phonetic, lexical, syntactic and semantic features respectively. The remaining section will briefly describe the individual knowledge scores involved in the reranking score calculation equation (please refer to [2] for a detailed description). If Categories is the acoustic property based phoneme group sequence (grouping the phonemes into 13 categories based on acoustic properties avoids huge computation overhead) to which the phoneme sequence Ph belongs, then P(CategoriesIA) represents the Support Vector Machine (SVM) posterior probability of the acoustic frames belonging to Categories containing the original LVCSR phoneme sequence. P(PhlCategoriesAA) represents the SVM class posterior probability of the acoustic frames belonging to the original LVCSR phoneme sequence given the Categories. Let PhS = ph', ph',...,ph' be the best phoneme sequence predicted by SVM, for A, not containing phonemes present in W's phoneme sequence. (1 P(PhIPP1I)*P(Ph8)) - equation models the LVCSR phoneme classification accuracy probability with respect to Phs. The equation returns a high value if the original LVCSR phoneme sequence matches the best SVM phoneme sequence i.e. P(Ph) is high while P(PhS) is low. The probability P(PhIPh8) is computed using a phoneme confusion matrix. This matrix (error counts from training the acoustic model and SVM models) provides the probability of a phoneme being misclassified as ph,, instead of ph'. The numerator of P hjph)p(ph') is assigned a high value and its denominator a low value, if SVM disagrees with the LVCSR phoneme classification, resulting in an low overall LVCSR phoneme classification accuracy score. But this score is high if SVM agrees with the LVCSR phoneme classification as the numerator has a low value and the denominator has a high value. For calculating the lexical score, PLM(Wh) is the trigram score associated with the n-best list hypothesis containing n words. The second part of the lexical confidence score is equivalent to computing the unigram probabilities of the words in each hypothesis, given their word boundaries, using the set of words occurring in the n-best list as a reference i.e. the probability of the word w' occurring in a particular hypothesis time-frame with respect to all the words W occurring in that particular hypothesis time-frame in all the hypotheses of the n-best list. The syntactic score P(7r, Wh) is the probability of the best parse (7r), for the hypothesis Wh, generated by the Charniak's parser [6] as a lexicalized syntactic model. S Count(Slots-Filted(j)Avi) Eq Count(Slots-Filled(j)Av )) The semantic score 5r=- =' Count(vE) rep- Z 1 Count(vj) resents the extent of the predicate arguments' coverage in the hypothesis, where p is the number of arguments for predicate vi in hypothesis Wh, q is the total number of arguments possible for predicate vi in PropBank [7] and t is the total number of predicates identified in PropBank. We use the ASSERT semantic parser [8] to extract the predicate-argument 160

4 1) Grammar: Cable Account Change 4 2) User Utterance: "I'd like to speak to a live person please" 3) LVCSR Sequence: "I'd likelvb to speak/vb to a live/jj person/nn of these" 4) Target Semantic Tag: "Customer Service" 5) Target Grammar Entry: "Human/JJ/#2 Operator/NN/#2 " 6) Best 2 Lexical Chains of 8: a) operator/nn/#2 TO speak/vb: (n-operator#2, manipulator#1) HYPONYM (n-telephoneoperator#1, telephonist#1, switchboardcoperator#l) GLOSS (v-get#14) HYPERNYM (v-communicate#2, intercommunicate#2) HYPONYM (v-talk#1, speak#2) VALUE: 9.22 b) human/jj/#2 TO person/nn: (a-human#2) GLOSS (n-person#1, individual#1, someone#1, somebody#1, mortal#1, human#1, soul#2) VALUE: ) Lexical Chain Strength: Fig. 2. An example illustrating the semantic categorization mapping process relations defined in PropBank. B. Semantic Categorizer and Feedback Reference [3] presents a methodology for finding topically related words by increasing the connectivity between WordNet [4] synsets using the information from WordNet glosses. We use lexical chains to classify the LVCSR-transcription into one of the designer conceived semantic tags. For the CFG creation task, the semantic categorizer tries to map the user utterance transcription to the best semantic label from the set corresponding to each prompt. For the CFG tuning task, the semantic categorizer tries to map the user utterance transcription to a set of semantic labels and their corresponding response alternatives (from the CFG to be tuned) and assigns the best semantic label to the utterance. For each utterance, the top ten hypotheses from the reranked n-best list are selected for the classification task. By finding the best lexical chain between the word sequence and the designer conceived semantic label (for tuning, corresponding word sequence alternatives in the previously created CFG are also used), each candidate word sequence is mapped to a semantic tag. We used a majority voting mechanism to select the appropriate semantic tag for a user response. It is very important to note that designers spend the bulk of their time not only adding alternate word sequences but also rejecting invalid (NO-MATCH label in our experiments) or unnecessary user utterances. Fig. 2 illustrates the mapping process. Lexical chains are found between each content word in the LVCSR Sequence (LVCSR-S) and each content word in the Target Grammar Entry (TGE) (semantic label or, word sequences from the previously created CFG representing the Target Semantic Tag (TST)). Prior to the mapping process, all TGE words are manually assigned their POS tags and WordNet2.0 sense numbers. The LVCSR-S words are not sense disambiguated. A LVCSR-S to TGE mapping is valid if and only if there exists a lexical chain between every word in TGE and at least one word in the LVCSR-S. The Lexical Chain Strength (LCS) is the sum of the semantic similarity values of the best lexical chains from every TGE word. In our example, the mapping from LVCSR-S to TGE is valid because human/nn/#1 and operator/nn/#2 map to at least one word in LVCSR-S (person/nn and speak/ VB). The LCS is (the sum of the best lexical chain values) i.e (operator/nn/#2 to speak/vb, Value= 9.22) and (human/jj/#1 to person/nn, Value= 6.68). The first lexical chain (a) links the second WordNet2.0 sense of the noun operator with the verb speak. The lexical chain implementation [3] does not require a word sense for speak. It finds the best path to the second WordNet2.0 sense of speak and assigns a value of 9.22 to the found lexical path (higher values indicate stronger paths). Fig. 3 presents the algorithm used to map an user utterance transcription into the best semantic category for the prompt. For each user utterance, the top ten LVCSR transcriptions are selected and mapped to one of the various semantic categories available for the prompt using the previously illustrated lexical chain mechanism. A majority voting mechanism is then used to find the best semantic category from the various semantic categories proposed by the top ten LVCSR transcriptions. The confidence score associated with the semantic category chosen from the majority voting procedure is used to rerank the top ten hypotheses again and the process continues until the algorithm settles down to a particular transcription and semantic category. The procedure ValidMapping() identifies the mapping between a given (LVCSR-S, TGE) pair and returns true if and only if the validity condition for mapping holds. For such a valid pair, BestLexicalChain() computes the LCS. For each A in utterances of Tuning Task For each B in top ten LVCSR-S of n-best For each C in TSTs of Tuning Task For each D in TGEs of C If ValidMapping(B, D) BestLexicalChain(B, D, LCS[B, D]) else LCS[B,D]= NO-MATCH endif Value[C] = TGE-Max(LCS[B,D]) TST[B] = TST_Max(C,Value[C]) Sem-Cat(A) = Majority(TST[B]) Fig. 3. Algorithm to map the LVCSR n-best list transcriptions into the best semantic category for the IVR prompt 161

5 TGE-Max( ) finds the best TGE (based on LCS) for every TST and TSTMax( ) returns the best TGT (based on the LCS of the best TGE) for the top ten LVCSR-S of the n-best list. Ma j ority ( ) selects as the utterance semantic category, the TGT with the majority LVCSR-S votes. The selection of the word sequence representing the TST is as important as the selection of the correct TST. The LCS of a hypothesis, associated with the chosen TGT for the utterance, is used as a confidence score to rerank the top ten hypotheses again and the process continues until the system settles down to a particular transcription and semantic label. There are some transcription words which are currently unavailable in WordNet2.0. These missing transcription words cannot take part in the semantic categorization process as the lexical chains procedure relies on the availability of the word in WordNet2.0 to map it to previously prepared descriptions of possible responses. In this paper, the missing transcription words are ignored since we have found their frequency of occurrence in our actual user utterance transcriptions to be negligible. C. Statistical Selection Mechanism We propose an improved statistical selection mechanism to not only add user response alternatives but also remove dormant word sequences from the CFGs to improve the performance of the IVR. We run the following filters on valid LVCSR-transcriptions (labeled with semantic category) to find sequences that should be added/removed from the CFG: Occurrence Probability: Each response alternative inside the CFG contains a value indicating the occurrence probability of the word sequence for a particular semantic category in each prompt. This value indicates the probability of the word sequence being uttered by the user for that semantic category. A high value indicates that the users are more probable to use the corresponding word sequence in response to the system prompt. We prune the CFG to prevent the degradation in the IVR performance due to presence of low probability word sequences already added previously into the CFG (pruning is done only in CFG tuning). We remove response alternatives with occurrence probability lower than 0.05 and then re-adjust the occurrence probabilities of all the remaining alternatives for that particular prompt semantic category (the sum of the occurrence probability of all the alternatives for a semantic category is 1). The remaining rules provide a filtering mechanism to add valid LVCSR-transcriptions into the CFG and assign an occurrence probability to each alternative added into the CFG. Frequency: For a particular semantic category, add only widely used LVCSR-transcriptions (frequency greater than 1% of total valid LVCSR-transcriptions) e.g. A specific transcription like "Account number five one four three" for the "Billing Account Type Choice" prompt should not be added, though the user wants the IVR to map the account to a semantic category (Cable, Cellular Phone, etc.) in the prompt. The occurrence probability of all the response alternatives in the corresponding semantic category is recalculated. The Occurrence Probability rule is then applied to remove low probability response alternatives. Smallest Sequence: If a LVCSR-transcription Si is a substring of S2 (both with a frequency greater than the threshold and both mapped to the same semantic tag), then only the smaller sequence Si should be added to the CFG. e.g. If Si = "billing address" and S2 = "get my billing address," add Si to the CFG. The occurrence probability of all the response alternatives in the corresponding semantic category is recalculated with the count of the added smallest sequence (count of S1+ count of S2). The Occurrence Probability rule is then applied to remove low probability response alternatives. Sub-sequences: The above rules add high frequency transcriptions into the CFG. We also need to consider the fact that the entire LVCSR-transcription might not be valid due to LVCSR errors or due to invalid user uttered words. We might have low frequency LVCSR-transcriptions containing important, high frequency word sub-sequences e.g. the utterance "Ah Can I just speak to someone alive" is transcribed by the LVCSR as "that called religious speak someone by" or, utterance "I want to cancel an appointment would you get me somebody that is not a computer." We find valid LVCSR-transcription sub-sequences by extracting the words used to compute the LCS, for its corresponding semantic tag, in each low frequency LVCSR-transcription. If the count of a particular sub-sequence in a semantic tag class is greater than the frequency threshold, then the corresponding subsequence is added to the CFG for that semantic tag. We are basically trying to filter the worthless words and find useful high frequency sub-sequences. The occurrence probability of all the response alternatives in the corresponding semantic category is recalculated with the count of the added subsequence (count of all the LVCSR-transcriptions containing the sub-sequence). The Occurrence Probability rule is then applied to remove low probability response alternatives. III. RESULTS AND EXPERIMENTAL SETTINGS We use SONIC [9], a LVCSR system from the University of Colorado at Boulder, to produce n-best lists for the reranking mechanism. We trained the LVCSR acoustic models for the telephone transcription task using 160 CallHome English corpus conversation sides and 4826 Switchboard-I Release 2 corpus conversation sides. The SRI HUB model is used as a back-off tri-gram language model. We considered a set of 8013 collected test user utterances (live recordings from IVR systems deployed by Intervoice Inc.) for 4 prompts in 3 different applications: Change Cable Account Details (CCAD) (2452 utterances, 12 semantic categories), Billing Account Type Choice (BATC) (1328 utterances, 12 semantic categories), Bank Future Payment Update Options (BFPUO) (224 utterances, 9 semantic categories), Wireless Account Change Options (WACO) (4009 utterances, 13 semantic categories). Table I presents the transcription results obtained on a 30-best list reranking (our analysis similar to the one presented in [2] showed insignificant improvements beyond the 30-best hypotheses list) using the best set of knowledge 162

6 TABLE I VARIOUS WER RESULTS OBTAINED FOR THE AUTOCFGPROCESSOR TRANSCRIPTION TASK. Test User Utterance Set (8013 Utterances) wj=13,w2=18,w3=14 Error (%) Total W4=26,w5=29 Sub Del Ins Total Correct(%) Baseline Hypothesis Best List Reranking Best List Reranking + LCS Score Feedback TABLE II RESULTS OBTAINED FOR THE AUTOCFGPROCESSOR DIRECTED DIALOG SPEECH APPLICATION TASK. Prompt System Collected Test User Utterance Set (4006 Utterances) (Size) Error (%) Total(%) MisCat InCFG OutCFG Total Correct CCAD Original (1226) Manual Autol _ Auto S3 BATC Original (664) Manual Autol ir Auto BFPUO Original (112) Manual Autol Auto WACO Original (2004) Manual Autol Auto Tg.47U J weights obtained by running WER testing trials on 40 HUB Switchboard-I conversation sides using various weight combinations. Using the best weight combination, we achieve 6.1% absolute WER reduction (12.78% relative reduction). We got the best result (8.6% absolute WER reduction and 18.02% relative WER reduction) by using the recurring LCS score feedback to propose the best hypothesis from the reranked top 10 n-best list hypotheses. For the CFG tuning task, the AUTOCFGPROCESSOR performance is presented in Table III. We used the manually created CFGs for the previously mentioned set of 4 prompts as baseline. An application designer manually listed the semantic categories for each of the 4 prompts. The 8013 utterances set is divided equally, for each prompt, into a training and test set. The Original system results are obtained by running the test set using the original manually created CFGs. The system results (Auto], Auto2 and Manual) represent the quality of each prompt CFGs (tuned using the training set) against the test set. The Manual system prompt CFGs are obtained by tuning the original CFGs with word sequences proposed by a speech application designer, who manually analyzed the training set. The Auto2 (AUTOCFGPROCESSOR) system tunes the original CFGs with the word sequences obtained by using the baseline LVCSR, reranking mechanism, semantic categorizer and semantic categorizer feedback while Auto] is the implementation of [2]. MisCat errors are due to mismatches between the category proposed by the IVR and the actual utterance category. InCFG errors are due to the IVR proposing a category while the utterance's actual category is a NO-MATCH. OutCFG errors are due to the IVR proposing a NO-MATCH while the utterance actually has a valid category. Table Ill shows that the AUTOCFGPROCESSOR can successfully add good alternatives to the baseline CFG and improve the IVR performance. Autol consistently comes close to matching the performance of manual additions, though it adds more alternatives and hence achieves less MisCat errors and more InCFG errors than the manually tuned CFGs. Using the improved statistical selection mechanism to remove dormant word sequences from the CFGs, (Auto2) improves the performance of the IVR even in terms of InCFG errors. The multiple loop feedback using lexical chain strength presents better alternatives and hence, reduces the MisCat errors. IV. CONCLUSIONS AND FUTURE WORK This paper presented a novel method to automatically generate and tune grammars for directed dialog speech applications. The results show improved IVR performance closely matching the manual tuning performance. Non-domain-specific information sources based reranking, lexical chain based semantic category classification and, feedback based on semantic classification strength play an important role in improving the IVR performance. One of the future goals is to enable the proposed mechanism to handle grammar automation for user responses containing variable fields (dates, credit card number, etc.). The handling of transcription words missing from WordNet2.0 for the semantic categorization process will also be the focus of future work in this area. ACKNOWLEDGMENT We are grateful to Marta Tatu for many thoughtful suggestions and comments; Steven Ray, Mark Hittinger and Cathie Ranta for their priceless encouragement and support; and to the anonymous reviewers for their valuable feedback. REFERENCES [1] Intervoice Training Document - Voice User Interface Design - Speechworks 7.0 OSS/OSR and Naunce Speech Forms, Intervoice Inc., [2] M. Balakrishna, D. Moldovan, and E. K. Cave, "Higher level phonetic and linguistic knowledge to improve asr accuracy and its relevance in interactive voice response systems," in Proceedings of the AAAI Workshop on Spoken Language Understanding, [3] D. Moldovan and A. Novischi, "Lexical chains for question answering," in Proceedings of COLING 2002, [4] C. Fellbaum, Ed., WordNet: An Electronic Lexical Database. MIT Press, [5] A. Acero, Y. Y. Wang, and K. Wang, "A semantically structured language model," in Special Workshop in Maui (SWIM), [6] E. Chamiak, "Immediate-head parsing for language models," in Proceedings of ACL 2001, [7] P. Kingsbury, M. Palmer, and M. Marcus, "Adding semantic annotation to the penn treebank," in Proceedings of HLT 2002, [8] S. Pradhan, W. Ward, K. Hacioglu, J. H. Martin, and D. Jurafsky, "Shallow semantic parsing using support vector machines," in Proceedings of HLT-NAACL 2004, [9] B. Pellom, SONIC: The University of Colorado Continuous Speech Recognizer, University of Colorado, Boulder, Colorado, March