Dynamic Finite-State Transducer Composition with Look-Ahead for Very-Large-Scale Speech Recognition Cyril Allauzen - allauzen@google.com Ciprian Chelba - ciprianchelba@google.com Boulos Harb - harb@google.com Michael Riley - riley@google.com Johan Schalkwyk - johans@google.com Aug 19, 2010
Weighted Finite-State Tranducers in Speech Recognition - I WFSTs are a general and efficient representation for many speech and NLP problems, see: Mohri, et al., Speech recognition with weighted finite-state transducers, in Handbook of Speech Processing. Springer 2008. In ASR, they have been used to: Represent models: G: n-gram language model (automaton over words) L: pronunciation lexicon (transducer from CI phones to words) C: context dependency (transducer from CD phones to CI phone) Combine and optimize models: Composition: Computes the relational composition of two transducers. Epsilon Removal: Finds equivalent WFST with no ǫ transitions. Determinization: Finds equivalent WFST that has no identically-labeled transitions leaving a state. Minimization: Finds equivalent deterministic WFST with the fewest states and arcs.
Weighted Finite-State Tranducers in Speech Recognition - II Advantages: Uniform data representation General, efficient, mathematically well-defined and reusable combination and optimization operations Variant systems realized in data not code. OpenFst, an open-source finite-state transducer library, was used for this work (http://www.openfst.org). Released under the Apache license; used in many speech and NLP applications.
Weighted Acceptors Finite automata with labels and weights. Example: Word pronunciation acceptor: d/1 0 1 ey/0.5 ae/0.5 2 t/0.3 dx/0.7 3 ax/1 4
Weighted Transducers Finite automata with input labels, output labels, and weights. Example: Word pronunciation transducer: d:data/1 1 ey: ε /0.5 ae: ε /0.5 2 t: ε /0.3 dx: ε /0.7 3 ax: ε /1 4/0 0 d:dew/1 5 uw: ε /1 6/0 L: Closed union of V word pronunciation transducers. G: An n-gram model is a WFSA with (at most) V n 1 states.
Context-Dependent Triphone Transducer C y:y/ ε_x x:x/ ε_ε ε,* x:x/ ε_x x:x/ ε_y x:x/x_x x,x x:x/x_y x,y x:x/y_x x:x/y_y y:y/x_x y:y/x_y y:y/y_y y:y/y_x y,x x:x/y_ε x,ε y,y y:y/y_ε y:y/x_ ε y,ε y:y/ ε_y x:x/x_ε y:y/ ε_ε
Recognition Transducer Construction The models C, L, G can be combined and optimized with weighted finitestate composition and determinization as: C det(l G) (1) An alternative construction, producing an equivalent transducer, is: (C det(l)) G (2) If G is deterministic, Eq. 2 could be as efficient as Eq. 1 and avoids the determinization of L G, greatly saving time and memory and allowing fast dynamic combination (useful in applications). However, standard composition presents three problems with Eq. 2: 1. Determinization of L moves back word labels creating delay in matching and creating (possibly very many) useless composition paths 2. The delayed word labels in L produce a much larger composed machine when G is an n-gram LM. 3. The delayed word labels push back the grammar weights along paths in the composed machine to the detriment of ASR pruning.
0 r:red r:read r:reed r:road r:rode 1 2 3 4 eh:ε eh:ε iy:ε iy:ε ao:ε ao:ε 6 d:ε 7 Composition Example 0 r:ε 1 eh:ε iy:ε ao:ε 2 3 d:read d:red d:read d:reed d:road 4 0 red/0.6 read/0.4 1 2 0,0 5 r:red/0.6 r:read/0.4 5 d:rode L det(l) G 1,1 2,2 eh:ε eh:ε iy:ε 6,1 6,2 d:ε d:ε 7,1 7,2 0 r:ε 1,0 eh:ε iy:ε ao:ε 2,0 3,0 5,0 d:red/0.6 d:read/0.4 d:read/0.4 4,1 4,2 L G det(l) G
Definitions and Notation Paths Path π Origin or previous state: p[π]. Destination or next state: n[π]. Input label: i[π]. Output label: o[π]. p[π] i[π]:o[π] n[π] Sets of paths P(R 1, R 2 ): set of all paths from R 1 Q to R 2 Q. P(R 1, x, R 2 ): paths in P(R 1, R 2 ) with input label x. P(R 1, x, y, R 2 ): paths in P(R 1, x, R 2 ) with output label y.
Definitions and Notation Transducers Alphabets: input A, output B. States: Q, initial states I, final states F. Transitions: E Q (A {ǫ}) (B {ǫ}) K Q. Weight functions: initial weight function λ : I K final weight function ρ : F K. Transducer T = (A, B, Q, I, F, E, λ, ρ) with for all x A, y B : [[T]](x, y) = λ(p[π]) w[π] ρ(n[π]) π P(I,x,y,F)
Semirings A semiring (K,,, 0, 1) = a ring that may lack negation. Sum: to compute the weight of a sequence (sum of the weights of the paths labeled with that sequence). Product: to compute the weight of a path (product of the weights of constituent transitions). Semiring Set 0 1 Boolean {0, 1} 0 1 Probability R + + 0 1 Log R {, + } log + + 0 Tropical R {, + } min + + 0 String B { } lcp ǫ with log defined by: x log y = log(e x + e y ).
(ǫ-free) Composition Algorithm States: (q 1, q 2 ) with q 1 in T 1 and q 2 in T 2. Transitions: e 1 transition in q 1 and e 2 in q 2 such that o[e 1 ] = i[e 2 ] ((q 1, q 2 ), i[e 1 ], o[e 2 ], w[e 1 ] w[e 2 ], (n[e 1 ], n[e 2 ]))
Composition Filter: Φ = (T 1, T 2, Q 3, i 3,, ϕ) Generalized Composition Algorithm Q 3 : set of filter states with i 3 initial and final. ϕ : (e 1, e 2, q 3 ) (e 1, e 2, q 3): transition filter Algorithm: States: (q 1, q 2, q 3 ) with q 1 in T 1, q 2 in T 2 and q 3 a filter state. Transitions: e 1 transition in q 1, e 2 in q 2 such that ϕ(e 1, e 2, q 3 ) = (e 1, e 2, q 3) with q 3 ((q 1, q 2, q 3 ), i[e 1], o[e 2], w[e 1] w[e 2], (n[e 1], n[e 2], q 3)) Trivial filter Φ trivial : Allows all matching paths Q 3 = {0, }, i 3 = 0 and ϕ(e 1, e 2, 0) = basic ǫ-free composition algorithm { (e1, e 2, 0) if o[e 1 ] = i[e 2 ] (e 1, e 2, ) otherwise
Pseudo-code Weighted-Composition(T 1, T 2 ) 1 Q I S I 1 I 2 {i 3 } 2 while S do 3 (q 1, q 2, q 3 ) Head(S) 4 Dequeue(S) 5 if (q 1, q 2, q 3 ) F 1 F 2 Q 3 then 6 F F {(q 1, q 2, q 3 )} 7 ρ(q 1, q 2, q 3 ) ρ 1 (q 1 ) ρ 2 (q 2 ) ρ 3 (q 3 ) 8 M {(e 1, e 2 ) E L [q 1 ] E L [q 2 ] such that ϕ(e 1, e 2, q 3 ) = (e 1, e 2, q 3 ) with q 3 } 9 for each(e 1, e 2 ) M do 10 (e 1, e 2, q 3 ) ϕ(e 1, e 2, q 3 ) 11 if (n[e 1 ], n[e 2 ], q 3 ) Q then 12 Q Q (n[e 1 ], n[e 2 ], q 3 ) 13 Enqueue(S, (n[e 1 ], n[e 2 ], q 3 )) 14 E E {((q 1, q 2, q 3 ), i[e 1 ], o[e 2 ], w[e 1 ] w[e 2 ], (n[e 1 ], n[e 2 ], q 3 ))} 15 return T
Epsilon-Matching Filter An ǫ-transition in T 1 (resp. T 2 ) can be matched in T 2 (resp. T 1 ) by an ǫ-transition or by staying at the same state (as if there were ǫ self-loops at each state in T 1 and T 2 ) Allowing all possible ǫ-matches: redundant ǫ-paths in T 1 T 2 wrong result when the semiring is non-idempotent Filter Φ ǫ-match : Disallows redundant ǫ-paths, favoring matching actual ǫ-transitions Q 3 = {0, 1, 2, }, i 3 = 0 and ϕ(e 1, e 2, q 3 ) = (e 1, e 2, q 3) where: q 3 = 8 >< >: 0 if (o[e 1 ], i[e 2 ]) = (x, x) with x B, 0 if (o[e 1 ], i[e 2 ]) = (ǫ, ǫ) and q 3 = 0, 1 if (o[e 1 ], i[e 2 ]) = (ǫ L, ǫ) and q 3 2, 2 if (o[e 1 ], i[e 2 ]) = (ǫ, ǫ L ) and q 3 1, otherwise. ǫ L : label of added self-loops composition algorithm of [Mohri, Peirera and Riley, 96]
Label-Reachability Filter Disallows following an ǫ-path in q 1 that will fail to reach a non-ǫ label that matches some transition in q 2 Label-Reachability r : Q 1 B {0, 1} r(q, x) = ( 1 if there exists a path from q to some q with output label x 0 otherwise Filter Φ reach : Same as Φ trivial except when o[e 1 ] = ǫ and i[e 2 ] = ǫ L then ϕ(e 1, e 2, 0) = (e 1, e 2, q 3 ) with q 3 = 0 r:ε 1 0 red/0.6 read/0.4 eh:ε iy:ε ao:ε 1 2 2 3 5 d:read d:red d:read d:reed d:road d:rode 4 ( 0 if there exist e 2 in q 2 such that r(n[e 1 ], i[e 2 ]) = 1 otherwise 0,0 r:ε 1,0 eh:ε iy:ε ao:ε 2,0 3,0 d:red/0.6 d:read/0.4 d:read/0.4 4,1 4,2
Label-Reachability Filter with Label Pushing When matching an ǫ-transition e 1 with an ǫ L -loop e 2 : if there exists a unique e 2 in q 2 such that r(n[e 1 ], i[e 2]) = 1, then allow matching e 1 with e 2 instead early output of o[e 2] Filter Φ push-label : Q 3 = B {ǫ, } and i 3 = ǫ the filter state encodes the label that has been consumed early d:read 0 r:ε 1 0 red/0.6 read/0.4 eh:ε iy:ε ao:ε 1 2 2 3 5 d:red d:read d:reed d:road d:rode 4 0,0,ε r:ε 1,0,ε eh:ε iy:read 2,0,ε 3,1,read d:red/0.6 d:read/0.4 d:ε/0.4 4,1,ε 4,2,ε
Label-Reachability Filter with Weight Pushing When matching an ǫ-transition e 1 with an ǫ L -loop e 2 : outputs early the -sum of the weight of the prospective matches Reachable weight w r : (q 1, q 2 ) e E[q 2 ],r(q 1,i[e])=1 w[e] Filter Φ push-weight : Q 3 = K, i 3 = 1 and = 0 the filter state encodes the weight that has been outputted early if o[e 1 ] = ǫ and i[e 2 ] = ǫ L, q 3 = w r (n[e 1 ], q 2 ) and w[e 2] = q 1 3 q 3 d:read 0 r:ε 1 0 red/0.6 read/0.4 eh:ε iy:ε ao:ε 1 2 2 3 5 d:red d:read d:reed d:road d:rode 4 0,0,1 r:ε 1,0,1 d:red/0.6 4,1,1 2,0,1 eh:ε d:read/0.4 iy: ε/0.4 d:read 3,0,0.4 4,2,1
Representation of r Implementation Point representation: R q = {x B : r(x, q) = 1} inefficient in time and space Interval representation: I q = {[x, y) : x, y N, [x, y) R q, x 1 / R q, y / R q } efficiency depends on the number of interval for each R q one interval per state trivial for a tree - found by DFS one interval per state possible if C1P holds true if unique pronunciation L and preserved by determinization, minimization, closure and composition with C multiple pronunciation L typically fails C1P. However, a modification of the Hsu s (2002) C1P Test gives a greedy algorithm for minimizing the number of intervals per state.
Efficient computation of w r Implementation Requires fast computation of s q (x, y) = e E[q],i[e] [x,y) w[e] for q in T 2, x and y in B = N Achieved by precomputing c q (x) = e E[q],i[e]<x w[e] s q (x, y) = c q (y) c q (x)
Composition Options: Composition Design - Options typedef SortedMatcher<StdFst> SM; typedef SequenceComposeFilter<Arc> CF; ComposeFstOptions<StdArc, SM, CF> opts; opts.matcher1 = new SM(fst1, MATCH NONE, knolabel); opts.matcher2 = new SM(fst2, MATCH INPUT, knolabel); opts.filter = new CF(fst1, fst2); StdComposeFst cfst(fst1, fst2, opts);
Composition Filters Predefined Filters: Name SequenceComposeFilter AltSequenceComposeFilter MatchComposeFilter LookAheadComposeFilter<F> PushWeightsComposeFilter<F> PushLabelsComposeFilter<F> Description Requires FST1 epsilons to be read before FST2 epsilons Requires FST2 epsilons to be read before FST1 epsilons Requires FST1 epsilons be matched with FST2 epsilons Supports lookahead in composition Supports pushing weights in composition Supports pushing labels in composition Three lookahead composition filters, each templated on an underlying filter F, are added. All three can be used by cascading them.
Composition: Matcher Design Matchers can find and iterate through requested labels at FST states. Matcher Form: template <class F> class Matcher { typedef typename F::Arc Arc; public: }; void SetState(StateId s); bool Find(Label label); bool Done() const; const Arc& Value() const; void Next(); bool LookAhead(const Fst<Arc> fst, StateId s, Weight &weight); // Specifies current state // Checks state for match to label // No more matches // Current arc // Advance to next arc // (Optional) lookahead A Lookahead() method, given the language (FST + initial state) to expect, is added.
Matchers Predefined Matchers: Name SortedMatcher RhoMatcher<M> SigmaMatcher<M> PhiMatcher<M> LabelLookAheadMatcher<M> ArcLookMatcher<M> Description Binary search on sorted input ρ symbol handling σ symbol handling ϕ symbol handling Lookahead along epsilon paths Lookahead one transition Two lookahead matchers, each templated on an underlying matcher M, are added. Special symbol matchers: Consumes no symbol Consumes symbol Matches all ǫ σ Matches rest ϕ ρ
Recognition Experiments Broadcast News Spoken Query Task Trained on 96 and 97 DARPA Hub4 AM training sets. PLP cepstra, LDA analysis, STC Triphonic, 8k tied states, 16 components per state Speaker adapted (both VTLN + CMLLR) Acoustic Model Trained on > 1000hrs of voice search queries PLP cepstra, LDA analysis, STC Triphonic, 4k tied states, 4-128 components per state Speaker independent 1996 Hub4 CSR LM training sets 4-gram language model pruned to 8M n- grams Language Model Trained on > 1B words of google.com and voice search queries 1 million word vocabulary Katz back-off model, pruned to various sizes
Recognition Experiments Precomputation before recognition Broadcast News Spoken Query Task Construction method Time RAM Result Time RAM Result Static (1) with standard composition 7 min 5.3G 0.5G 10.5 min 11.2G 1.4G (2) with generalized composition 2.5 min 2.9G 0.5G 4 min 5.3G 1.4G Dynamic (2) with generalized composition none none 0.2G none none 0.5G Broadcast News Spoken Query Task
Recognition Experiments A small part of the recognition transducer is visited during recognition: Spoken Query Task Static Number of states in recognition transducer 25.4M Dynamic Number of states visited per second 8K Very large language models can be used in first-pass: Word Error Rate 17 18 19 20 21 Spoken Query Task Word error rate as function of LM size (with Ciprian Chelba and Boulos Harb) 1e+06 5e+06 1e+07 5e+07 1e+08 5e+08 1e+09 # of N Grams
Prior Work Caseiro and Trancoso (IEEE Trans. on ASLP 2006): they developed a specialized composition for a pronunciation lexicon L. If pronunciations are stored in a trie, then the words readable from a node form a lexicographic interval, which can be used to disallow noncoaccessible epsilon paths. Cheng, et al. (ICASSP 2007); Oonishi, et al (Interspeech 2008): they use methods apparently similar to ours, but many details are left unspecified, such as what is the representation of the reachable label sets. There are no published complexities, but the published results show a very significant overhead to the dynamic composition compared to a static recognition transducer. Our method: uses a very efficient representation of the label sets uses a very efficient computation of the weight pushing has a small overhead between static and dynamic composition
Conclusions This work: Introduces a generalized composition filter for weighted finite-state composition Presents composition filters that: Remove useless epsilon paths Push forward labels Push forward weights The combination of these filters permits the composition of large speech-recognition context-dependent lexicons and language models much more efficiently in time and space than before Experiments on Broadcast News and a spoken query task show a 5% to 10% overhead for dynamic, runtime composition compared to static, offline composition. To our knowledge, this is the first such system with so little overhead.