Dynamic Finite-State Transducer Composition with Look-Ahead for Very-Large-Scale Speech Recognition

Similar documents

The OpenGrm open-source finite-state grammar software libraries

Measuring the confusability of pronunciations in speech recognition

Automated Lossless Hyper-Minimization for Morphological Analyzers

Regular Expressions and Automata using Haskell

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Regular Languages and Finite State Machines

Intrusion Detection via Static Analysis

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Turkish Radiology Dictation System

Finite Automata and Regular Languages

Coding and decoding with convolutional codes. The Viterbi Algor

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

StateFlow Hands On Tutorial

Finite Automata. Reading: Chapter 2

THEORY of COMPUTATION

Testing LTL Formula Translation into Büchi Automata

Theory of Computation Chapter 2: Turing Machines

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Compiler Construction

Finite Automata. Reading: Chapter 2

Reading 13 : Finite State Automata and Regular Expressions

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

Introduction to Learning & Decision Trees

The Halting Problem is Undecidable

The Goldberg Rao Algorithm for the Maximum Flow Problem

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Web Data Extraction: 1 o Semestre 2007/2008

Informatique Fondamentale IMA S8

IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE

T Reactive Systems: Introduction and Finite State Automata

Introduction to Scheduling Theory

Variable Base Interface

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Magic Word. Possible Answers: LOOSER WINNER LOTTOS TICKET. What is the magic word?

Introduction to Theory of Computation

Turing Machines: An Introduction

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)

Symbol Tables. Introduction

Genetic programming with regular expressions

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

CompuScholar, Inc. Alignment to Utah's Computer Programming II Standards

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

The Model Checker SPIN

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Fast nondeterministic recognition of context-free languages using two queues

Resilient Dynamic Programming

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

Influences in low-degree polynomials

Introduction to Finite Automata

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

Introduction to Automata Theory. Reading: Chapter 1

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Compression techniques

CAs and Turing Machines. The Basis for Universal Computation

3515ICT Theory of Computation Turing Machines

Business Intelligence and Process Modelling

Decision Trees from large Databases: SLIQ

14.1 Rent-or-buy problem

Automata and Formal Languages

Computability Theory

P NP for the Reals with various Analytic Functions

Automata and Computability. Solutions to Exercises

C H A P T E R Regular Expressions regular expression

Introduction to LabVIEW Design Patterns

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

IVR Studio 3.0 Guide. May Knowlarity Product Team

Markov random fields and Gibbs measures

Software Verification: Infinite-State Model Checking and Static Program

Notes on Complexity Theory Last updated: August, Lecture 1

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

TED-LIUM: an Automatic Speech Recognition dedicated corpus

Deterministic Finite Automata

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Monitoring Metric First-order Temporal Properties

ANIMATION a system for animation scene and contents creation, retrieval and display

The P versus NP Solution

Finding Liveness Errors with ACO

3. The Junction Tree Algorithms

VoiceXML-Based Dialogue Systems

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Lecture 2: Universality

Transcription:

Dynamic Finite-State Transducer Composition with Look-Ahead for Very-Large-Scale Speech Recognition Cyril Allauzen - allauzen@google.com Ciprian Chelba - ciprianchelba@google.com Boulos Harb - harb@google.com Michael Riley - riley@google.com Johan Schalkwyk - johans@google.com Aug 19, 2010

Weighted Finite-State Tranducers in Speech Recognition - I WFSTs are a general and efficient representation for many speech and NLP problems, see: Mohri, et al., Speech recognition with weighted finite-state transducers, in Handbook of Speech Processing. Springer 2008. In ASR, they have been used to: Represent models: G: n-gram language model (automaton over words) L: pronunciation lexicon (transducer from CI phones to words) C: context dependency (transducer from CD phones to CI phone) Combine and optimize models: Composition: Computes the relational composition of two transducers. Epsilon Removal: Finds equivalent WFST with no ǫ transitions. Determinization: Finds equivalent WFST that has no identically-labeled transitions leaving a state. Minimization: Finds equivalent deterministic WFST with the fewest states and arcs.

Weighted Finite-State Tranducers in Speech Recognition - II Advantages: Uniform data representation General, efficient, mathematically well-defined and reusable combination and optimization operations Variant systems realized in data not code. OpenFst, an open-source finite-state transducer library, was used for this work (http://www.openfst.org). Released under the Apache license; used in many speech and NLP applications.

Weighted Acceptors Finite automata with labels and weights. Example: Word pronunciation acceptor: d/1 0 1 ey/0.5 ae/0.5 2 t/0.3 dx/0.7 3 ax/1 4

Weighted Transducers Finite automata with input labels, output labels, and weights. Example: Word pronunciation transducer: d:data/1 1 ey: ε /0.5 ae: ε /0.5 2 t: ε /0.3 dx: ε /0.7 3 ax: ε /1 4/0 0 d:dew/1 5 uw: ε /1 6/0 L: Closed union of V word pronunciation transducers. G: An n-gram model is a WFSA with (at most) V n 1 states.

Context-Dependent Triphone Transducer C y:y/ ε_x x:x/ ε_ε ε,* x:x/ ε_x x:x/ ε_y x:x/x_x x,x x:x/x_y x,y x:x/y_x x:x/y_y y:y/x_x y:y/x_y y:y/y_y y:y/y_x y,x x:x/y_ε x,ε y,y y:y/y_ε y:y/x_ ε y,ε y:y/ ε_y x:x/x_ε y:y/ ε_ε

Recognition Transducer Construction The models C, L, G can be combined and optimized with weighted finitestate composition and determinization as: C det(l G) (1) An alternative construction, producing an equivalent transducer, is: (C det(l)) G (2) If G is deterministic, Eq. 2 could be as efficient as Eq. 1 and avoids the determinization of L G, greatly saving time and memory and allowing fast dynamic combination (useful in applications). However, standard composition presents three problems with Eq. 2: 1. Determinization of L moves back word labels creating delay in matching and creating (possibly very many) useless composition paths 2. The delayed word labels in L produce a much larger composed machine when G is an n-gram LM. 3. The delayed word labels push back the grammar weights along paths in the composed machine to the detriment of ASR pruning.

0 r:red r:read r:reed r:road r:rode 1 2 3 4 eh:ε eh:ε iy:ε iy:ε ao:ε ao:ε 6 d:ε 7 Composition Example 0 r:ε 1 eh:ε iy:ε ao:ε 2 3 d:read d:red d:read d:reed d:road 4 0 red/0.6 read/0.4 1 2 0,0 5 r:red/0.6 r:read/0.4 5 d:rode L det(l) G 1,1 2,2 eh:ε eh:ε iy:ε 6,1 6,2 d:ε d:ε 7,1 7,2 0 r:ε 1,0 eh:ε iy:ε ao:ε 2,0 3,0 5,0 d:red/0.6 d:read/0.4 d:read/0.4 4,1 4,2 L G det(l) G

Definitions and Notation Paths Path π Origin or previous state: p[π]. Destination or next state: n[π]. Input label: i[π]. Output label: o[π]. p[π] i[π]:o[π] n[π] Sets of paths P(R 1, R 2 ): set of all paths from R 1 Q to R 2 Q. P(R 1, x, R 2 ): paths in P(R 1, R 2 ) with input label x. P(R 1, x, y, R 2 ): paths in P(R 1, x, R 2 ) with output label y.

Definitions and Notation Transducers Alphabets: input A, output B. States: Q, initial states I, final states F. Transitions: E Q (A {ǫ}) (B {ǫ}) K Q. Weight functions: initial weight function λ : I K final weight function ρ : F K. Transducer T = (A, B, Q, I, F, E, λ, ρ) with for all x A, y B : [[T]](x, y) = λ(p[π]) w[π] ρ(n[π]) π P(I,x,y,F)

Semirings A semiring (K,,, 0, 1) = a ring that may lack negation. Sum: to compute the weight of a sequence (sum of the weights of the paths labeled with that sequence). Product: to compute the weight of a path (product of the weights of constituent transitions). Semiring Set 0 1 Boolean {0, 1} 0 1 Probability R + + 0 1 Log R {, + } log + + 0 Tropical R {, + } min + + 0 String B { } lcp ǫ with log defined by: x log y = log(e x + e y ).

(ǫ-free) Composition Algorithm States: (q 1, q 2 ) with q 1 in T 1 and q 2 in T 2. Transitions: e 1 transition in q 1 and e 2 in q 2 such that o[e 1 ] = i[e 2 ] ((q 1, q 2 ), i[e 1 ], o[e 2 ], w[e 1 ] w[e 2 ], (n[e 1 ], n[e 2 ]))

Composition Filter: Φ = (T 1, T 2, Q 3, i 3,, ϕ) Generalized Composition Algorithm Q 3 : set of filter states with i 3 initial and final. ϕ : (e 1, e 2, q 3 ) (e 1, e 2, q 3): transition filter Algorithm: States: (q 1, q 2, q 3 ) with q 1 in T 1, q 2 in T 2 and q 3 a filter state. Transitions: e 1 transition in q 1, e 2 in q 2 such that ϕ(e 1, e 2, q 3 ) = (e 1, e 2, q 3) with q 3 ((q 1, q 2, q 3 ), i[e 1], o[e 2], w[e 1] w[e 2], (n[e 1], n[e 2], q 3)) Trivial filter Φ trivial : Allows all matching paths Q 3 = {0, }, i 3 = 0 and ϕ(e 1, e 2, 0) = basic ǫ-free composition algorithm { (e1, e 2, 0) if o[e 1 ] = i[e 2 ] (e 1, e 2, ) otherwise

Pseudo-code Weighted-Composition(T 1, T 2 ) 1 Q I S I 1 I 2 {i 3 } 2 while S do 3 (q 1, q 2, q 3 ) Head(S) 4 Dequeue(S) 5 if (q 1, q 2, q 3 ) F 1 F 2 Q 3 then 6 F F {(q 1, q 2, q 3 )} 7 ρ(q 1, q 2, q 3 ) ρ 1 (q 1 ) ρ 2 (q 2 ) ρ 3 (q 3 ) 8 M {(e 1, e 2 ) E L [q 1 ] E L [q 2 ] such that ϕ(e 1, e 2, q 3 ) = (e 1, e 2, q 3 ) with q 3 } 9 for each(e 1, e 2 ) M do 10 (e 1, e 2, q 3 ) ϕ(e 1, e 2, q 3 ) 11 if (n[e 1 ], n[e 2 ], q 3 ) Q then 12 Q Q (n[e 1 ], n[e 2 ], q 3 ) 13 Enqueue(S, (n[e 1 ], n[e 2 ], q 3 )) 14 E E {((q 1, q 2, q 3 ), i[e 1 ], o[e 2 ], w[e 1 ] w[e 2 ], (n[e 1 ], n[e 2 ], q 3 ))} 15 return T

Epsilon-Matching Filter An ǫ-transition in T 1 (resp. T 2 ) can be matched in T 2 (resp. T 1 ) by an ǫ-transition or by staying at the same state (as if there were ǫ self-loops at each state in T 1 and T 2 ) Allowing all possible ǫ-matches: redundant ǫ-paths in T 1 T 2 wrong result when the semiring is non-idempotent Filter Φ ǫ-match : Disallows redundant ǫ-paths, favoring matching actual ǫ-transitions Q 3 = {0, 1, 2, }, i 3 = 0 and ϕ(e 1, e 2, q 3 ) = (e 1, e 2, q 3) where: q 3 = 8 >< >: 0 if (o[e 1 ], i[e 2 ]) = (x, x) with x B, 0 if (o[e 1 ], i[e 2 ]) = (ǫ, ǫ) and q 3 = 0, 1 if (o[e 1 ], i[e 2 ]) = (ǫ L, ǫ) and q 3 2, 2 if (o[e 1 ], i[e 2 ]) = (ǫ, ǫ L ) and q 3 1, otherwise. ǫ L : label of added self-loops composition algorithm of [Mohri, Peirera and Riley, 96]

Label-Reachability Filter Disallows following an ǫ-path in q 1 that will fail to reach a non-ǫ label that matches some transition in q 2 Label-Reachability r : Q 1 B {0, 1} r(q, x) = ( 1 if there exists a path from q to some q with output label x 0 otherwise Filter Φ reach : Same as Φ trivial except when o[e 1 ] = ǫ and i[e 2 ] = ǫ L then ϕ(e 1, e 2, 0) = (e 1, e 2, q 3 ) with q 3 = 0 r:ε 1 0 red/0.6 read/0.4 eh:ε iy:ε ao:ε 1 2 2 3 5 d:read d:red d:read d:reed d:road d:rode 4 ( 0 if there exist e 2 in q 2 such that r(n[e 1 ], i[e 2 ]) = 1 otherwise 0,0 r:ε 1,0 eh:ε iy:ε ao:ε 2,0 3,0 d:red/0.6 d:read/0.4 d:read/0.4 4,1 4,2

Label-Reachability Filter with Label Pushing When matching an ǫ-transition e 1 with an ǫ L -loop e 2 : if there exists a unique e 2 in q 2 such that r(n[e 1 ], i[e 2]) = 1, then allow matching e 1 with e 2 instead early output of o[e 2] Filter Φ push-label : Q 3 = B {ǫ, } and i 3 = ǫ the filter state encodes the label that has been consumed early d:read 0 r:ε 1 0 red/0.6 read/0.4 eh:ε iy:ε ao:ε 1 2 2 3 5 d:red d:read d:reed d:road d:rode 4 0,0,ε r:ε 1,0,ε eh:ε iy:read 2,0,ε 3,1,read d:red/0.6 d:read/0.4 d:ε/0.4 4,1,ε 4,2,ε

Label-Reachability Filter with Weight Pushing When matching an ǫ-transition e 1 with an ǫ L -loop e 2 : outputs early the -sum of the weight of the prospective matches Reachable weight w r : (q 1, q 2 ) e E[q 2 ],r(q 1,i[e])=1 w[e] Filter Φ push-weight : Q 3 = K, i 3 = 1 and = 0 the filter state encodes the weight that has been outputted early if o[e 1 ] = ǫ and i[e 2 ] = ǫ L, q 3 = w r (n[e 1 ], q 2 ) and w[e 2] = q 1 3 q 3 d:read 0 r:ε 1 0 red/0.6 read/0.4 eh:ε iy:ε ao:ε 1 2 2 3 5 d:red d:read d:reed d:road d:rode 4 0,0,1 r:ε 1,0,1 d:red/0.6 4,1,1 2,0,1 eh:ε d:read/0.4 iy: ε/0.4 d:read 3,0,0.4 4,2,1

Representation of r Implementation Point representation: R q = {x B : r(x, q) = 1} inefficient in time and space Interval representation: I q = {[x, y) : x, y N, [x, y) R q, x 1 / R q, y / R q } efficiency depends on the number of interval for each R q one interval per state trivial for a tree - found by DFS one interval per state possible if C1P holds true if unique pronunciation L and preserved by determinization, minimization, closure and composition with C multiple pronunciation L typically fails C1P. However, a modification of the Hsu s (2002) C1P Test gives a greedy algorithm for minimizing the number of intervals per state.

Efficient computation of w r Implementation Requires fast computation of s q (x, y) = e E[q],i[e] [x,y) w[e] for q in T 2, x and y in B = N Achieved by precomputing c q (x) = e E[q],i[e]<x w[e] s q (x, y) = c q (y) c q (x)

Composition Options: Composition Design - Options typedef SortedMatcher<StdFst> SM; typedef SequenceComposeFilter<Arc> CF; ComposeFstOptions<StdArc, SM, CF> opts; opts.matcher1 = new SM(fst1, MATCH NONE, knolabel); opts.matcher2 = new SM(fst2, MATCH INPUT, knolabel); opts.filter = new CF(fst1, fst2); StdComposeFst cfst(fst1, fst2, opts);

Composition Filters Predefined Filters: Name SequenceComposeFilter AltSequenceComposeFilter MatchComposeFilter LookAheadComposeFilter<F> PushWeightsComposeFilter<F> PushLabelsComposeFilter<F> Description Requires FST1 epsilons to be read before FST2 epsilons Requires FST2 epsilons to be read before FST1 epsilons Requires FST1 epsilons be matched with FST2 epsilons Supports lookahead in composition Supports pushing weights in composition Supports pushing labels in composition Three lookahead composition filters, each templated on an underlying filter F, are added. All three can be used by cascading them.

Composition: Matcher Design Matchers can find and iterate through requested labels at FST states. Matcher Form: template <class F> class Matcher { typedef typename F::Arc Arc; public: }; void SetState(StateId s); bool Find(Label label); bool Done() const; const Arc& Value() const; void Next(); bool LookAhead(const Fst<Arc> fst, StateId s, Weight &weight); // Specifies current state // Checks state for match to label // No more matches // Current arc // Advance to next arc // (Optional) lookahead A Lookahead() method, given the language (FST + initial state) to expect, is added.

Matchers Predefined Matchers: Name SortedMatcher RhoMatcher<M> SigmaMatcher<M> PhiMatcher<M> LabelLookAheadMatcher<M> ArcLookMatcher<M> Description Binary search on sorted input ρ symbol handling σ symbol handling ϕ symbol handling Lookahead along epsilon paths Lookahead one transition Two lookahead matchers, each templated on an underlying matcher M, are added. Special symbol matchers: Consumes no symbol Consumes symbol Matches all ǫ σ Matches rest ϕ ρ

Recognition Experiments Broadcast News Spoken Query Task Trained on 96 and 97 DARPA Hub4 AM training sets. PLP cepstra, LDA analysis, STC Triphonic, 8k tied states, 16 components per state Speaker adapted (both VTLN + CMLLR) Acoustic Model Trained on > 1000hrs of voice search queries PLP cepstra, LDA analysis, STC Triphonic, 4k tied states, 4-128 components per state Speaker independent 1996 Hub4 CSR LM training sets 4-gram language model pruned to 8M n- grams Language Model Trained on > 1B words of google.com and voice search queries 1 million word vocabulary Katz back-off model, pruned to various sizes

Recognition Experiments Precomputation before recognition Broadcast News Spoken Query Task Construction method Time RAM Result Time RAM Result Static (1) with standard composition 7 min 5.3G 0.5G 10.5 min 11.2G 1.4G (2) with generalized composition 2.5 min 2.9G 0.5G 4 min 5.3G 1.4G Dynamic (2) with generalized composition none none 0.2G none none 0.5G Broadcast News Spoken Query Task

Recognition Experiments A small part of the recognition transducer is visited during recognition: Spoken Query Task Static Number of states in recognition transducer 25.4M Dynamic Number of states visited per second 8K Very large language models can be used in first-pass: Word Error Rate 17 18 19 20 21 Spoken Query Task Word error rate as function of LM size (with Ciprian Chelba and Boulos Harb) 1e+06 5e+06 1e+07 5e+07 1e+08 5e+08 1e+09 # of N Grams

Prior Work Caseiro and Trancoso (IEEE Trans. on ASLP 2006): they developed a specialized composition for a pronunciation lexicon L. If pronunciations are stored in a trie, then the words readable from a node form a lexicographic interval, which can be used to disallow noncoaccessible epsilon paths. Cheng, et al. (ICASSP 2007); Oonishi, et al (Interspeech 2008): they use methods apparently similar to ours, but many details are left unspecified, such as what is the representation of the reachable label sets. There are no published complexities, but the published results show a very significant overhead to the dynamic composition compared to a static recognition transducer. Our method: uses a very efficient representation of the label sets uses a very efficient computation of the weight pushing has a small overhead between static and dynamic composition

Conclusions This work: Introduces a generalized composition filter for weighted finite-state composition Presents composition filters that: Remove useless epsilon paths Push forward labels Push forward weights The combination of these filters permits the composition of large speech-recognition context-dependent lexicons and language models much more efficiently in time and space than before Experiments on Broadcast News and a spoken query task show a 5% to 10% overhead for dynamic, runtime composition compared to static, offline composition. To our knowledge, this is the first such system with so little overhead.