Strongly Incremental Repair Detection

Transcription

1 Strongly Incremental Repair Detection Julian Hough 1,2 1 Dialogue Systems Group Faculty of Linguistics and Literature Bielefeld University julian.hough@uni-bielefeld.de Matthew Purver 2 2 Cognitive Science Research Group School of Electronic Engineering and Computer Science Queen Mary University of London m.purver@qmul.ac.uk Abstract We present STIR (STrongly Incremental Repair detection), a system that detects speech repairs and it terms on transcripts incrementally with minimal latency. STIR uses information-theoretic measures from n-gram models as its principal decision features in a pipeline of classifiers detecting the different stages of repairs. Results on the Switchboard disfluency tagg corpus show utterance-final accuracy on a par with state-of-the-art incremental repair detection methods, but with better incremental accuracy, faster time-to-detection and less computational overhead. We evaluate its performance using incremental metrics and propose new repair processing evaluation standards. 1 Introduction Self-repairs in spontaneous speech are annotat according to a well establish three-phase structure from (Shriberg, 1994) onwards, and as describ in Meteer et al. (1995) s Switchboard corpus annotation handbook: John [ likes + {uh} } {{ } } {{ } reparandum interregnum loves ] Mary (1) } {{ } repair From a dialogue systems perspective, detecting repairs and assigning them the appropriate structure is vital for robust natural language understanding (NLU) in interactive systems. Downgrading the commitment of reparandum phases and assigning appropriate interregnum and repair phases permits computation of the user s intend meaning. Furthermore, the recent focus on incremental dialogue systems (see e.g. (Rieser and Schlangen, 2011)) means that repair detection should operate without unnecessary processing overhead, and function efficiently within an incremental framework. However, such left-to-right operability on its own is not sufficient: in line with the principle of strong incremental interpretation (Milward, 1991), a repair detector should give the best results possible as early as possible. With one exception (Zwarts et al., 2010), there has been no focus on evaluating or improving the incremental performance of repair detection. In this paper we present STIR (Strongly Incremental Repair detection), a system which addresses the challenges of incremental accuracy, computational complexity and latency in selfrepair detection, by making local decisions bas on relatively simple measures of fluency and similarity. Section 2 reviews state-of-the-art methods; Section 3 summarizes the challenges and explains our general approach; Section 4 explains STIR in detail; Section 5 explains our experimental set-up and novel evaluation metrics; Section 6 presents and discusses our results and Section 7 concludes. 2 Previous work Qian and Liu (2013) achieve the state of the art in Switchboard corpus self-repair detection, with an F-score for detecting reparandum words of using a three-step weight Max-Margin Markov network approach. Similarly, Georgila (2009) uses Integer Linear Programming post-processing of a CRF to achieve F-scores over 0.8 for reparandum start and repair start detection. However neither approach can operate incrementally. Recently, there has been increas interest in left-to-right repair detection: Rasooli and Tetreault (2014) and Honnibal and Johnson (2014) present dependency parsing systems with reparandum detection which perform similarly, the latter equalling Qian and Liu (2013) s F-score at However, while operating left-to-right, these systems are not design or evaluat for their incremental performance. The use of beam search over 78 Proceings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 78 89, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics

2 different repair hypotheses in (Honnibal and Johnson, 2014) is likely to lead to unstable repair label sequences, and they report repair hypothesis jitter. Both of these systems use a non-monotonic dependency parsing approach that immiately removes the reparandum from the linguistic analysis of the utterance in terms of its dependency structure and repair-reparandum correspondence, which from a downstream NLU module s perspective is undesirable. Heeman and Allen (1999) and Miller and Schuler (2008) present earlier left-toright operational detectors which are less accurate and again give no indication of the incremental performance of their systems. While Heeman and Allen (1999) rely on repair structure template detection coupl with a multi-knowlge-source language model, the rarity of the tail of repair structures is likely to be the reason for lower performance: Hough and Purver (2013) show that only 39% of repair alignment structures appear at least twice in Switchboard, support by the 29% report by Heeman and Allen (1999) on the smaller TRAINS corpus. Miller and Schuler (2008) s encoding of repairs into a grammar also causes sparsity in training: repair is a general processing strategy not restrict to certain lexical items or POS tag sequences. The model we consider most suitable for incremental dialogue systems so far is Zwarts et al. (2010) s incremental version of Johnson and Charniak (2004) s noisy channel repair detector, as it incrementally applies structural repair analyses (rather than just identifying reparanda) and is evaluat for its incremental properties. Following (Johnson and Charniak, 2004), their system uses an n-gram language model train on roughly 100K utterances of reparandum-excis ( clean ) Switchboard data. Its channel model is a statistically-train S-TAG parser whose grammar has simple reparandum-repair alignment rule categories for its non-terminals (copy, delete, insert, substitute) and words for its terminals. The parser hypothesises all possible repair structures for the string consum so far in a chart, before pruning the unlikely ones. It performs equally well to the non-incremental model by the end of each utterance (F-score = 0.778), and can make detections early via the addition of a speculative next-word repair completion category to their S- TAG non-terminals. In terms of incremental performance, they report the novel evaluation metric of time-to-detection for correctly identifi repairs, achieving an average of 7.5 words from the start of the reparandum and 4.6 from the start of the repair phase. They also introduce delay accuracy, a word-by-word evaluation against goldstandard disfluency tags up to the word before the current word being consum (in their terms, the prefix boundary), giving a measure of the stability of the repair hypotheses. They report an F-score of at one word back from the current prefix boundary, increasing word-by-word until 6 words back where it reaches These results are the point-of-departure for our work. 3 Challenges and Approach In this section we summarize the challenges for incremental repair detection: computational complexity, repair hypothesis stability, latency of detection and repair structure identification. In 3.1 we explain how we address these. Computational complexity Approaches to detecting repair structures often use chart storage (Zwarts et al., 2010; Johnson and Charniak, 2004; Heeman and Allen, 1999), which poses a computational overhead: if considering all possible boundary points for a repair structure s 3 phases beginning on any word, for prefixes of length n the number of hypotheses can grow in the order O(n 4 ). Exploring a subset of this space is necessary for assigning entire repair structures as in (1) above, rather than just detecting reparanda: the (Johnson and Charniak, 2004; Zwarts et al., 2010) noisy-channel detector is the only system that applies such structures but the potential runtime complexity in decoding these with their S- TAG repair parser is O(n 5 ). In their approach, complexity is mitigat by imposing a maximum repair length (12 words), and also by using beam search with re-ranking (Lease et al., 2006; Zwarts and Johnson, 2011). If we wish to include full decoding of the repair s structure (as argu by Hough and Purver (2013) as necessary for full interpretation) whilst taking a strictly incremental and time-critical perspective, rucing this complexity by minimizing the size of this search space is crucial. Stability of repair hypotheses and latency Using a beam search of n-best hypotheses on a wordby-word basis can cause jitter in the detector s output. While utterance-final accuracy is desir, 79

3 for a truly incremental system good intermiate results are equally important. Zwarts et al. (2010) s time-to-detection results show their system is only certain about a detection after processing the entire repair. This may be due to the string alignment-inspir S-TAG that matches repair and reparanda: a rough copy dependency only becomes likely once the entire repair has been consum. The latency of 4.6 words to detection and a relatively slow rise to utterance-final accuracy up to 6 words back is undesirable given repairs have a mean reparandum length of 1.5 words (Hough and Purver, 2013; Shriberg and Stolcke, 1998). Structural identification Classifying repairs has been ignor in repair processing, despite the presence of distinct categories (e.g. repeats, substitutions, deletes) with different pragmatic effects (Hough and Purver, 2013). 1 This is perhaps due to lack of clarity in definition: even for human annotators, verbatim repeats withstanding, agreement is often poor (Hough and Purver, 2013; Shriberg, 1994). Assigning and evaluating repair (not just reparandum) structures will allow repair interpretation in future; however, work to date evaluates only reparandum detection. 3.1 Our approach To address the above, we propose an alternative to (Johnson and Charniak, 2004; Zwarts et al., 2010) s noisy channel model. While the model elegantly captures intuitions about parallelism in repairs and modelling fluency, it relies on stringmatching, motivat in a similar way to automatic spelling correction (Brill and Moore, 2000): it assumes a speaker chooses to utter fluent utterance X according to some prior distribution P (X), but a noisy channel causes them instead to utter a noisy Y according to channel model P (Y X). Estimating P (Y X) directly from observ data is difficult due to sparsity of repair instances, so a transducer is train on the rough copy alignments between reparandum and repair. This approach succes because repetition and simple substitution repairs are very common; but repair as a psychological process is not driven by string alignment, and deletes, restarts and rarer substitution forms are not captur. Furthermore, the noisy channel model assumes an inherently utteranceglobal process for generating (and therefore find- 1 Though see (Germesin et al., 2008) for one approach, albeit using idiosyncratic repair categories. ing) an underlying clean string much as similar spelling correction models are word-global we instead take a very local perspective here. In accordance with psycholinguistic evidence (Brennan and Schober, 2001), we assume characteristics of the repair onset allow hearers to detect it very quickly and solve the continuation problem (Levelt, 1983) of integrating the repair into their linguistic context immiately, before processing or even hearing the end of the repair phase. While repair onsets may take the form of interregna, this is not a reliable signal, occurring in only 15% of repairs (Hough and Purver, 2013; Heeman and Allen, 1999). Our repair onset detection is therefore driven by departures from fluency, via information-theoretic features deriv incrementally from a language model in line with recent psycholinguistic accounts of incremental parsing see (Keller, 2004; Jaeger and Tily, 2011). Considering the time-linear way a repair is process and the fact speakers are exponentially less likely to trace one word further back in repair as utterance length increases (Shriberg and Stolcke, 1998), backwards search seems to be the most efficient reparandum extent detection method. 2 Features determining the detection of the reparandum extent in the backwards search can also be information-theoretic: entropy measures of distributional parallelism can characterize not only rough copy dependencies, but distributionally similar or dissimilar correspondences between sequences. Finally, when detecting the repair end and structure, distributional information allows computation of the similarity between reparandum and repair. We argue a local-detectionwith-backtracking approach is more cognitively plausible than string-bas left-to-right repair labelling, and using this insight should allow an improvement in incremental accuracy, stability and time-to-detection over string-alignment driven approaches in repair detection. 4 STIR: Strongly Incremental Repair detection Our system, STIR (Strongly Incremental Repair detection), therefore takes a local incremental ap- 2 We acknowlge a purely position-bas model for reparandum extent detection under-estimates prepositions, which speakers favour as the retrace start and over-estimates verbs, which speakers tend to avoid retracing back to, preferring to begin the utterance again, as (Healey et al., 2011) s experiments also demonstrate. 80

4 John likes S 0 S 1 S 2 T0 proach to detecting repairs and isolat it terms, assigning words the structures in (2). We include interregnum recognition in the process, due to the inclusion of interregnum vocabulary within it term vocabulary (Ginzburg, 2012; Hough and Purver, 2013), a useful feature for repair detection (Lease et al., 2006; Qian and Liu, 2013). John likes uh {...[rm start...rm end + {}...rp end ]......{}... (2) S 0 S 1 S 2 S 3 John likes uh loves? S 0 S 1 S 2 S 3 S 4 John likes rm start rm end uh loves T1 T2 Rather than detecting the repair structure in its left-to-right string order as above, STIR functions as in Figure 1: first detecting it terms (possibly interregna) at step T1; then detecting repair onsets at T2; if one is found, backwards searching to find rm start at T3; then finally finding the repair end rp end at T4. Step T1 relies mainly on lexical probabilities from an it term language model; T2 exploits features of divergence from a fluent language model; T3 uses fluency of hypothesis repairs; and T4 the similarity between distributions after reparandum and repair. However, each stage integrates these basic insights via multiple relat features in a statistical classifier. S 0 S 1 John S 0 S 1 John S 0 S 1 rm start rm end S 2 S 3 likes rm start rm end rm start uh rm end S 2 S 3 likes rm start rm end rm start uh rm end S 2 S 3 loves rp sub end S 4 rp sub end loves rp sub end S 4 rp sub end S 4 Mary Figure 1: Strongly Incremental Repair Detection T3 T4 S 5 T5 4.1 Enrich incremental language models We derive the basic information-theoretic features requir using n-gram language models, as they have a long history of information theoretic analysis (Shannon, 1948) and provide reproducible results without forcing commitment to one particular grammar formalism. Following recent work on modelling grammaticality judgements (Clark et al., 2013), we implement several modifications to standard language models to develop our basic measures of fluency and uncertainty. For our main fluent language models we train a trigram model with Kneser-Ney smoothing (Kneser and Ney, 1995) on the words and POS tags of the standard Switchboard training data (all files with conversation numbers beginning sw2*,sw3* in the Penn Treebank III release), consisting of 100K utterances, 600K words. We follow (Johnson and Charniak, 2004) by cleaning the data of disfluencies (i.e. it terms and reparanda), to approximate a fluent language model. We call these probabilities p lex kn, ppos kn below. 3 3 We suppress the pos and lex superscripts below where we refer to measures from either model. 81

5 We then derive surprisal as our principal default lexical uncertainty measurement s (equation 3) in both models; and, following (Clark et al., 2013), the (unigram) Weight Mean Log trigram probability (WML, eq. 4) the trigram logprob of the sequence divid by the inverse summ logprob of the component unigrams (apart from the first two words in the sequence, which serve as the first trigram history). As here we use a local approach we restrict the WML measures to single trigrams (weight by the inverse logprob of the final word). While use of standard n-gram probability conflates syntactic with lexical probability, WML gives us an approximation to incremental syntactic probability by factoring out lexical frequency. s(w i 2... w i) = log 2 p kn (w i w i 2, w i 1) (3) WML(w 0... w n) = i=n i=2 log 2 p kn (w i w i 2, w i 1) n j=2 log 2 p kn(w j) (4) Distributional measures To approximate uncertainty, we also derive the entropy H(w c) of the possible word continuations w given a context c, from p(w i c) for all words w i in the vocabulary see (5). Calculating distributions over the entire lexicon incrementally is costly, so we approximate this by constraining the calculation to words which are observ at least once in context c in training, w c = {w count(c, w) 1}, assuming a uniform distribution over the unseen suffixes by using the appropriate smoothing constant, and subtracting the latter from the former see eq. (6). Manual inspection show this approximation to be very close, and the trie structure of our n- gram models allows efficient calculation. We also make use of the Zipfian distribution of n-grams in corpora by storing entropy values for the 20% most common trigram contexts observ in training, leaving entropy values of rare or unseen contexts to be comput at decoding time with little search cost due to their small or empty w c sets. H(w c) = p kn (w c) log 2 p kn (w c) (5) H(w c) [ w V ocab w w c p kn (w c) log 2 p kn (w c) [n λ log 2 λ] where n = V ocab w c and λ = 1 w w c p kn (w c) n ] (6) Given entropy estimates, we can also similarly approximate the Kullback-Leibler (KL) divergence (relative entropy) between distributions in two different contexts c 1 and c 2, i.e. θ(w c 1 ) and θ(w c 2 ), by pair-wise computing p(w c 1 ) log 2 ( p(w c 1) p(w c 2 ) ) only for words w wc 1 w c2, then approximating unseen values by assuming uniform distributions. Using p kn smooth estimates rather than raw maximum likelihood estimations avoids infinite KL divergence values. Again, we found this approximation sufficiently close to the real values for our purposes. All such probability and distribution values are stor in incrementally construct direct acyclic graph (DAG) structures (see Figure 1), exploiting the Markov assumption of n-gram models to allow efficient calculation by avoiding re-computation. 4.2 Individual classifiers This section details the features us by the 4 individual classifiers. To investigate the utility of the features us in each classifier we obtain values on the standard Switchboard heldout data (PTB III files sw4[5-9]*: 6.4K utterances, 49K words) Edit term detection In the first component, we utilise the well-known observation that it terms have a distinctive vocabulary (Ginzburg, 2012), training a bigram model on a corpus of all it words annotat in Switchboard s training data. The classifier simply uses the surprisal s lex from this it word model, and the trigram surprisal s lex from the standard fluent model of Section 4.1. At the current position w n, one, both or none of words w n and w n 1 are classifi as its. We found this simple approach effective and stable, although some delay decisions occur in cases where s lex and WML lex are high in both models before the end of the it, e.g. I like I {like} want.... Words classifi as are remov from the incremental processing graph (indicat by the dott line transition in Figure 1) and the stack updat if repair hypotheses are cancell due to a delay it hypothesis of w n Repair start detection Repair onset detection is arguably the most crucial component: the greater its accuracy, the better the input for downstream components and the lesser the overhead of filtering false positives requir. 82

6 WML i havent had any good really very good experience with child care Figure 2: WML lex values for trigrams for a repair utterance exhibiting the drop at the repair onset We use Section 4.1 s information-theoretic features s, WML, H for words and POS, and introduce 5 additional information-theoretic features: WML is the difference between the WML values at w n 1 and w n ; H is the difference in entropy between w n 1 and w n ; InformationGain is the difference between expect entropy at w n 1 and observ s at w n, a measure that factors out the effect of naturally high entropy contexts; BestEntropyRuce is the best ruction in entropy possible by an early rough hypothesis of reparandum onsets within 3 words; and BestWMLBoost similarly speculates on the best improvement of WML possible by positing rm start positions up to 3 words back. We also include simple alignment features: binary features which indicate if the word w i x is identical to the current word w i for x {1, 2, 3}. With 6 alignment features, 16 N-gram features and a single logical feature it which indicates the presence of an it word at position w i 1, detection uses 23 features see Table 1. We hypothesis repair onsets would have significantly lower p lex (lower lexicalsyntactic probability) and WML lex (lower syntactic probability) than other fluent trigrams. This was the case in the Switchboard heldout data for both measures, with the biggest difference obtain for WML lex (non-repair-onsets: (sd=0.359); repair onsets: (sd=0.359)). In the POS model, entropy of continuation H pos was the strongest feature (non-repair-onsets: (sd=0.769); repair onsets: (sd=0.899)). The trigram WML lex measure for the repair utterance I haven t had any [ good + really very good ] experience with child care can be seen in Figure 2. The steep drop at the repair onset shows the usefulness of WML features for fluency measures. To compare n-gram measures against other local features, we rank the features by Information Gain using 10-fold cross validation over the Switchboard heldout data see Table 1. The language model features are far more discriminative than the alignment features, showing the potential of a general information-theoretic approach Reparandum start detection In detecting rm start positions given a hypothesis (stage T3 in Figure 1), we use the noisy channel intuition that removing the reparandum (from rm start to ) increases fluency of the utterance, express here as WMLboost as describ above. When using gold standard input we found this was the case on the heldout data, with a mean WMLboost of (sd=0.267) for reparandum onsets and (sd=0.224) for other words in the 6-word history- the negative boost for non-reparandum words captures the intuition that backtracking from those points would make the utterance less grammatical, and conversely the boost afford by the correct rm start detection helps solve the continuation problem for the listener (and our detector). Parallelism in the onsets of and rm start can also help solve the continuation problem, and in fact the KL divergence between θ pos (w rm start, rm start 1 ) and θ pos (w, 1 ) is the second most useful feature with average merit ( ) in cross- 83

7 validation. The highest rank feature is WML (0.437 ( )) which here encodes the drop in the WMLboost from one backtrack position to the next. In ranking the 32 features we use, again information-theoretic ones are higher rank than the logical features. average merit average rank attribute ( ) 1 ( ) H pos ( ) 2 ( ) WML pos ( ) 3.4 ( ) WML lex ( ) 4 ( ) s pos ( ) 5.9 ( ) w i 1 = w i ( ) 5.9 ( ) BestWMLBoost lex ( ) 5.9 ( ) InformationGain pos ( ) 7.9 ( ) BestWMLBoost pos ( ) 9 ( ) H lex 0.08 ( ) 10.4 ( ) WML pos 0.08 ( ) 10.6 ( ) H pos ( ) 12 ( ) POS i 1 = POS i ( ) 13.1 ( ) s lex ( ) 14.2 ( ) WML lex ( ) 14.7 ( ) BestEntropyRuce pos ( ) 16.3 ( ) InformationGain lex ( ) 16.7 ( ) BestEntropyRuce lex ( ) 18 ( ) H lex ( ) 19 ( ) w i 2 = w i ( ) 20 ( ) POS i 2 = POS i 0.01 ( ) 21 ( ) w i 3 = w i ( ) 22 ( ) it ( ) 23 ( ) POS i 3 = POS i Table 1: Feature ranker (Information Gain) for detection- 10-fold x-validation on Switchboard heldout data Repair end detection and structure classification For rp end detection, using the notion of parallelism, we hypothesise an effect of divergence between θ lex at the reparandum-final word rm end and the repair-final word rp end : for repetition repairs, KL divergence will trivially be 0; for substitutions, it will be higher; for deletes, even higher. Upon inspection of our feature ranking this KL measure rank 5th out of 23 features (merit= ( )). We introduce another feature encoding parallelism ReparandumRepairDifference : the difference in probability between an utterance clean of the reparandum and the utterance with its repair phase substituting its reparandum. In both the POS (merit=0.366 ( )) and word (merit=0.352 ( )) LMs, this was the most discriminative feature. 4.3 Classifier pipeline STIR effects a pipeline of classifiers as in Figure 3, where the classifier only permits non words to be pass on to classification and for rp end classification of the active repair hypotheses, maintain in a stack. The classifier passes positive repair hypotheses to the rm start classifier, which backwards searches up to 7 words back in the utterance. If a rm start is classifi, the output is pass on for rp end classification at the end of the pipeline, and if not reject this is push onto the repair stack. Repair hypotheses are are popp off when the string is 7 words beyond its position. Putting limits on the stack s storage space is a way of controlling for processing overhead and complexity. Embd repairs whose rm start coincide with another s are easily dealt with as they are add to the stack as separate hypotheses. 4 Classifiers Classifiers are implement using Random Forests (Breiman, 2001) and we use different error functions for each stage using Meta- Cost (Domingos, 1999). The flexibility afford by implementing adjustable error functions in a pipelin incremental processor allows control of the trade-off of immiate accuracy against runtime and stability of the sequence classification. Processing complexity This pipeline avoids an exhaustive search all repair hypotheses. If we limit the search to within the rm start, possibilities, this number of repairs grows approximately n(n+1) in the triangular number series i.e. 2, a nest loop over previous words as n gets increment which in terms of a complexity class is a quadratic O(n 2 ). If we allow more than one rm start, hypothesis per word, the complexity goes up to O(n 3 ), however in the tests that we describe below, we are able to achieve good detection results without permitting this extra search space. Under our assumption that reparandum onset detection is only trigger after repair onset detection, and repair extent detection is dependent on positive reparandum onset detection, a pipeline with accurate components will allow us to limit processing to a small subset of this search space. 4 We constrain the problem not to include embd deletes which may share their word with another repair these are in practice very rare. 84

8 Figure 3: Classifier pipeline 5 Experimental set-up We train STIR on the Switchboard data describ above, and test it on the standard Switchboard test data (PTB III files 4[0-1]*). In order to avoid overfitting of classifiers to the basic language models, we use a cross-fold training approach: we divide the corpus into 10 folds and use language models train on 9 folds to obtain feature values for the 10th fold, repeating for all 10. Classifiers are then train as standard on the resulting featureannotat corpus. This result in better feature utility for n-grams and better F-score results for detection in all components in the order of 5-6%. 5 Training the classifiers Each Random Forest classifier was limit to 20 trees of maximum depth 4 nodes, putting a ceiling on decoding time. In making the classifiers cost-sensitive, MetaCost resamples the data in accordance with the cost functions: we found using 10 iterations over a resample of 25% of the training data gave the most effective trade-off between training time and accuracy. 6 We use 8 different cost functions in with differing costs for false negatives and positives of the form below, where R is a repair element word and F is a fluent onset: ( Rhyp F hyp ) R gold 0 2 We adopt a similar technique in rm start using 5 different cost functions and in rp end using 8 different settings, which when combin gives a total of 320 different cost function configurations. We hypothesise that higher recall permitt in the pipeline s first components would result in better overall accuracy as these hypotheses become refin, though at the cost of the stability of the hy- 5 Zwarts and Johnson (2011) take a similar approach on Switchboard data to train a re-ranker of repair analyses. 6 As (Domingos, 1999) demonstrat, there are only relatively small accuracy gains when using more than this, with training time increasing in the order of the re-sample size. potheses of the sequence and extra downstream processing in pruning false positives. We also experiment with the number of repair hypotheses permitt per word, using limits of 1- best and 2-best hypotheses. We expect that allowing 2 hypotheses to be explor per should allow greater final accuracy, but with the trade-off of greater decoding and training complexity, and possible incremental instability. As we wish to explore the incrementality versus final accuracy trade-off that STIR can achieve we now describe the evaluation metrics we employ. 5.1 Incremental evaluation metrics Following (Baumann et al., 2011) we divide our evaluation metrics into similarity metrics (measures of equality with or similarity to a gold standard), timing metrics (measures of the timing of relevant phenomena detect from the gold standard) and diachronic metrics (evolution of incremental hypotheses over time). Similarity metrics For direct comparison to previous approaches we use the standard measure of overall accuracy, the F-score over reparandum words, which we abbreviate F rm (see 7): precision = rmcorrect rm hyp recall = rmcorrect F rm = 2 rm gold precision recall precision + recall We are also interest in repair structural classification, we also measure F-score over all repair components (rm words, words as interregna and rp words), a metric we abbreviate F s. This is not measur in standard repair detection on Switchboard. To investigate incremental accuracy we evaluate the delay accuracy (DA) introduc by (Zwarts et al., 2010), as describ in section 2 against the utterance-final gold standard disfluency annotations, and use the mean of the 6 word F-scores. (7) 85

9 Input and current repair labels its John John likes rm rp John likes uh John likes uh loves rm rp John likes uh loves Mary rm rp ( rm) ( rp) ( rm) ( rp) rm rp Figure 4: Edit Overhead- 4 unnecessary its Timing and resource metrics Again for comparative purposes we use Zwarts et al s time-todetection metrics, that is the two average distances (in numbers of words) consum before first detection of gold standard repairs, one from rm start, TD rm and one from, TD rp. In our 1-best detection system, before evaluation we know a priori TD rp will be 1 token, and TD rm will be 1 more than the average length of rm start repair spans correctly detect. However when we introduce a beam where multiple rm start s are possible per with the most likely hypothesis committ as the current output, the latency may begin to increase: the initially most probable hypothesis may not be the correct one. In addition to output timing metrics, we account for intrinsic processing complexity with the metric processing overhead (PO), which is the number of classifications made by all components per word of input. Diachronic metrics To measure stability of repair hypotheses over time we use (Baumann et al., 2011) s it overhead (EO) metric. EO measures the proportion of its (add, revoke, substitute) appli to a processor s output structure that are unnecessary. STIR s output is the repair label sequence shown in Figure 1, however rather than evaluating its EO against the current gold standard labels, we use a new mark-up we term the incremental repair gold standard: this does not penalise lack of detection of a reparandum word rm as a bad it until the corresponding of that rm has been consum. While F rm, F s and DA evaluate against what Baumann et al. (2011) call the current gold standard, the incremental gold standard reflects the repair processing approach we set out in 3. An example of a repair utterance with an EO of 44% ( 4 9 ) can be seen in Figure 4: of the 9 its (7 repair annotations and 2 correct fluent words), 4 are unnecessary (bracket). Note Figure 6: Delay Accuracy Curves the final rm is not count as a bad it for the reasons just given. 6 Results and Discussion We evaluate on the Switchboard test data; Table 2 shows results of the best performing settings for each of the metrics describ above, together with the setting achieving the highest total score (TS) the average % achiev of the best performing system s result in each metric. 7 The settings found to achieve the highest F rm (the metric standardly us in disfluency detection), and that found to achieve the highest TS for each stage in the pipeline are shown in Figure 5. Our experiments show that different system settings perform better in different metrics, and no individual setting achiev the best result in all of them. Our best utterance-final F rm reaches 0.779, marginally though not significantly exceing (Zwarts et al., 2010) s measure and STIR achieves on the previously unevaluat F s. The setting with the best DA improves on (Zwarts et al., 2010) s result significantly in terms of mean values (0.718 vs ), and also in terms of the steepness of the curves (Figure 6). The fastest average time to detection is 1 word for TD rp and 2.6 words for TD rm (Table 3), improving dramatically on the noisy channel model s 4.6 and 7.5 words. Incrementality versus accuracy trade-off We aim to investigate how well a system could do in terms of achieving both good final accuracy and incremental performance, and while the best F rm setting had a large PO and relatively slow DA increase, we find STIR can find a good trade-off set- 7 We do not include time-to-detection scores in TS as it did not vary enough between settings to be significant, however there was a difference in this measure between the 1-best stack condition and the 2-best stack condition see below. 86

10 rp hyp start F hyp rp gold start 0 64 rp hyp start F hyp rp gold start 0 2 rm hyp start F hyp rm gold start 0 8 rm hyp start F hyp rm gold start 0 16 rp hyp end F hyp rp gold end 0 2 rp hyp end F hyp rp gold end 0 8 Stack depth = 2 Stack depth = 1 Figure 5: The cost function settings for the MetaCost classifiers for each component, for the best F rm setting (top row) and best total score (TS) setting (bottom row) F rm F s DA EO PO Best Final rm F-score (F rm ) Best Final repair structure F-score (F s ) Best Delay Accuracy of rm (DA) Best (lowest) Edit Overhead (EO) Best (lowest) Processing Overhead (PO) Best Total Score (mean % of best scores) (TS) Table 2: Comparison of the best performing system settings using different measures F rm F s DA EO PO TD rp TD rm 1-best rm start best rm start Table 3: Comparison of performance of systems with different stack capacities ting: the highest TS scoring setting achieves an F rm of whilst also exhibiting a very good DA (0.711) over 98% of the best record score and low PO and EO rates over 96% of the best record scores. See the bottom row of Table 2. As can be seen in Figure 5, the cost functions for these winning settings are different in nature. The best non-incremental F rm measure setting requires high recall for the rest of the pipeline to work on, using the highest cost, 64, for false negative words and the highest stack depth of 2 (similar to a wider beam); but the best overall TS scoring system uses a less permissive setting to increase incremental performance. We make a preliminary investigation into the effect of increasing the stack capacity by comparing stacks with 1-best rm start hypotheses per and 2-best stacks. The average differences between the two conditions is shown in Table 3. Moving to the 2-stack condition results in gain in overall accuracy in F rm and F s, but at the cost of EO and also time-to-detection scores TD rm and TD rp. The extent to which the stack can be increas without increasing jitter, latency and complexity will be investigat in future work. 7 Conclusion We have present STIR, an incremental repair detector that can be us to experiment with incremental performance and accuracy trade-offs. In future work we plan to include probabilistic and distributional features from a top-down incremental parser e.g. Roark et al. (2009), and use STIR s distributional features to classify repair type. Acknowlgements We thank the three anonymous EMNLP reviewers for their helpful comments. Hough is support by the DUEL project, financially support by the Agence Nationale de la Research (grant number ANR-13-FRAL-0001) and the Deutsche Forschungsgemainschaft. Much of the work was carri out with support from an EPSRC DTA scholarship at Queen Mary University of London. Purver is partly support by ConCreTe: the project ConCreTe acknowlges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET grant number

11 References T. Baumann, O. Buß, and D. Schlangen Evaluation and optimisation of incremental processors. Dialogue & Discourse, 2(1): Leo Breiman Random forests. Machine learning, 45(1):5 32. S.E. Brennan and M.F. Schober How listeners compensate for disfluencies in spontaneous speech. Journal of Memory and Language, 44(2): Eric Brill and Robert C Moore An improv error model for noisy channel spelling correction. In Proceings of the 38th Annual Meeting on Association for Computational Linguistics, pages Association for Computational Linguistics. Alexander Clark, Gianluca Giorgolo, and Shalom Lappin Statistical representation of grammaticality judgements: the limits of n-gram models. In Proceings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL), pages 28 36, Sofia, Bulgaria, August. Association for Computational Linguistics. Pro Domingos Metacost: A general method for making classifiers cost-sensitive. In Proceings of the fifth ACM SIGKDD international conference on Knowlge discovery and data mining, pages ACM. Kallirroi Georgila Using integer linear programming for detecting speech disfluencies. In Proceings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages Association for Computational Linguistics. Sebastian Germesin, Tilman Becker, and Peter Poller Hybrid multi-step disfluency detection. In Machine Learning for Multimodal Interaction, pages Springer. Jonathan Ginzburg The Interactive Stance: Meaning for Conversation. Oxford University Press. P. G. T. Healey, Arash Eshghi, Christine Howes, and Matthew Purver Making a contribution: Processing clarification requests in dialogue. In Proceings of the 21st Annual Meeting of the Society for Text and Discourse, Poitiers, July. Peter Heeman and James Allen Speech repairs, intonational phrases, and discourse markers: modeling speakers utterances in spoken dialogue. Computational Linguistics, 25(4): Matthew Honnibal and Mark Johnson Joint incremental disfluency detection and dependency parsing. Transactions of the Association of Computational Linugistics (TACL), 2: Julian Hough and Matthew Purver Modelling expectation in the self-repair processing of annotat-, um, listeners. In Proceings of the 17th SemDial Workshop on the Semantics and Pragmatics of Dialogue (DialDam), pages , Amsterdam, December. T Florian Jaeger and Harry Tily On language utility: Processing complexity and communicative efficiency. Wiley Interdisciplinary Reviews: Cognitive Science, 2(3): Mark Johnson and Eugene Charniak A TAGbas noisy channel model of speech repairs. In Proceings of the 42nd Annual Meeting on Association for Computational Linguistics, pages 33 39, Barcelona. Association for Computational Linguistics. Frank Keller The entropy rate principle as a prictor of processing effort: An evaluation against eye-tracking data. In EMNLP, pages Reinhard Kneser and Hermann Ney Improv backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, ICASSP-95., 1995 International Conference on, volume 1, pages IEEE. Matthew Lease, Mark Johnson, and Eugene Charniak Recognizing disfluencies in conversational speech. Audio, Speech, and Language Processing, IEEE Transactions on, 14(5): W.J.M. Levelt Monitoring and self-repair in speech. Cognition, 14(1): M. Meteer, A. Taylor, R. MacIntyre, and R. Iyer Disfluency annotation stylebook for the switchboard corpus. ms. Technical report, Department of Computer and Information Science, University of Pennsylvania. Tim Miller and William Schuler A syntactic time-series model for parsing fluent and disfluent speech. In Proceings of the 22nd International Conference on Computational Linguistics-Volume 1, pages Association for Computational Linguistics. David Milward Axiomatic Grammar, Non- Constituent Coordination and Incremental Interpretation. Ph.D. thesis, University of Cambridge. Xian Qian and Yang Liu Disfluency detection using multi-step stack learning. In Proceings of NAACL-HLT, pages Mohammad Sadegh Rasooli and Joel Tetreault Non-monotonic parsing of fluent umm I mean disfluent sentences. EACL 2014, pages Hannes Rieser and David Schlangen Introduction to the special issue on incremental processing in dialogue. Dialogue & Discourse, 2(1):

12 Brian Roark, Asaf Bachrach, Carlos Cardenas, and Christophe Pallier Deriving lexical and syntactic expectation-bas measures for psycholinguistic modeling via incremental top-down parsing. In Proceings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages Association for Computational Linguistics. Claude E. Shannon A mathematical theory of communication. technical journal. AT & T Bell Labs. Elizabeth Shriberg and Andreas Stolcke How far do speakers back up in repairs? A quantitative model. In Proceings of the International Conference on Spoken Language Processing, pages Elizabeth Shriberg Preliminaries to a Theory of Speech Disfluencies. Ph.D. thesis, University of California, Berkeley. Simon Zwarts and Mark Johnson The impact of language models and loss functions on repair disfluency detection. In Proceings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT 11, pages , Stroudsburg, PA, USA. Association for Computational Linguistics. Simon Zwarts, Mark Johnson, and Robert Dale Detecting speech repairs incrementally using a noisy channel approach. In Proceings of the 23rd International Conference on Computational Linguistics, COLING 10, pages , Stroudsburg, PA, USA. Association for Computational Linguistics. 89