Lecture 10: ASR: Sequence Recognition

Transcription

1 EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition Deterministic sequence recognition: DTW Statistical sequence recognition: HMM HMM training Discriminant models Dan Ellis <dpwe@ee.columbia.edu> E6820 SAPR - Sequence Recognition - Dan Ellis

2 1 Sequence recognition After feature calculation, the problem is to classify the sequences of frames: Feature calculation sound feature vectors Network weights Word models Language model Acoustic classifier HMM decoder phone probabilities phone & word labeling Questions - how to represent model sequences - how to score matches E6820 SAPR - Sequence Recognition - Dan Ellis

3 Acoustic template matching Framewise comparison of unknown word and stored templates: Reference ONE TWO THREE FOUR FIVE time /frames Test - distance metric? - comparison between templates? - constraints? E6820 SAPR - Sequence Recognition - Dan Ellis

4 Dynamic Time Warp (DTW Find lowest-cost constrained path: - matrix d(i,j of distances between input frame f i and reference frame r j - allowable predecessors & transition costs T xy Reference frames r j D(i-1,j D(i-1,j T 10 Lowest cost to (i,j D(i-1,j + T 10 D(i,j = d(i,j + min{ } D(i,j-1 + T 01 D(i-1,j-1 + T 11 Local match cost Best predecessor D(i-1,j (including transition cost T 11 T 01 Input frames f i Best path via traceback from final state - have to store predecessors for (almost every (i,j E6820 SAPR - Sequence Recognition - Dan Ellis

5 DTW-based recognition Reference templates for each possible word Isolated word: - mark endpoints of input word - calculate scores through each template (+prune - choose best Continuous speech - one matrix of template slices; special-case constraints at word ends Reference ONE TWO THREE FOUR Input frames E6820 SAPR - Sequence Recognition - Dan Ellis

6 DTW-based recognition (2 + Successfully handles timing variation + Able to recognize speech at reasonable cost - Distance metric? - pseudo-euclidean space? - Warp penalties? - How to choose templates? - several templates per word? - choose most representative? - align and average? need a rigorous foundation... E6820 SAPR - Sequence Recognition - Dan Ellis

7 Outline Dynamic Time Warp Hidden Markov Models - Statistical formulation - Markov Models - Hidden states HMM training Discriminant models E6820 SAPR - Sequence Recognition - Dan Ellis

8 2 Statistical sequence recognition DTW limited because it s hard to optimize - interpretation of distance, transition costs? Need a theoretical foundation: Probability Formulate as MAP choice among models: M * = pm ( j X, Θ argmax M j - X = observed features - M j = word-sequence models - Θ = all current parameters E6820 SAPR - Sequence Recognition - Dan Ellis

9 Statistical formulation (2 Can rearrange via Bayes rule (& drop p(x : M * = pm ( j X, Θ argmax M j = argmax M j pxm ( j, Θ A pm ( j Θ L - p(x M j = likelihood of obs v ns under model - p(m j = prior probability of model - Θ A = acoustics-related model parameters - Θ L = language-related model parameters Questions: - what form of model to use for? - how to find Θ A (training? - how to solve for M j (decoding? pxm ( j, Θ A E6820 SAPR - Sequence Recognition - Dan Ellis

10 Model M j State-based modeling Assume discrete-state model for the speech: - observations are divided up into time frames - model states observations: Q k : q 1 q 2 q 3 q 4 q 5 q 6... N X 1 : x 2 x 3 x 4 x 5 x 6... x 1 time states observed feature vectors Probability of observations given model is: N pxm ( j = px ( 1 Qk, M j pq ( k M j all Q k - sum over all possible state sequences Q k How do observations depend on states? How do state sequences depend on model? E6820 SAPR - Sequence Recognition - Dan Ellis

11 S Generative Markov models A (first order Markov model is a finite-state system whose behavior depends only on the current state, e.g..8.8 A C B.1 E Defined by: p(q n+1 q n - state set {q i } (including start & end states - transition matrix p(q n+1 q n depends only on current state E6820 SAPR - Sequence Recognition - Dan Ellis q n S A B C E q n+1 S A B C E S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E

12 p(x q p(x q Hidden Markov models Markov models where state sequence Q = {q n } is not directly observable (= hidden But, observations X do depend on Q: - x n is rv that depends on current state: Emission distributions q = A q = B q = C q = A q = B q = C observation x x n x n - can still tell something about state seq State sequence Observation sequence pxq ( AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC time step n E6820 SAPR - Sequence Recognition - Dan Ellis

13 Hidden Markov models (2 An HMM is specified by: - emission distributions - transition probabilities pxq ( i b i ( x pq n j i q n 1 ( a ij pq 1 i ( π i - (initial state probabilities Speech models M j - typ. left-to-right HMMs (sequence constraint - observation & evolution are conditionally independent of rest given (hidden state q n S ae 1 ae 2 ae 3 E q 1 q 2 q 3 q 4 q 5 x 1 x 2 x 3 x 4 x 5 - self-loops for time dilation E6820 SAPR - Sequence Recognition - Dan Ellis

14 Markov models for sequence recognition Independence of observations: - observation x n depends only current state q n pxq ( = px ( 1, x 2, x N q 1, q 2, q N = px ( 1 q 1 px ( 2 q 2 px ( N q N N n = 1 = px ( n q n = b qn ( x n Markov transitions: n = 1 - transition to next state q i+1 depends only on q i pqm ( = pq ( 1, q 2, q N M = = pq ( N q 1 q N 1 pq ( N 1 q 1 q N 2 pq ( 2 q 1 pq ( 1 pq ( N q N 1 pq ( N 1 q N 2 pq ( 2 q 1 pq ( 1 N = pq ( 1 pq ( n q n 1 = π q1 a qn 1 n = 2 n = 2 E6820 SAPR - Sequence Recognition - Dan Ellis N N q n

15 Model fit calculation From state-based modeling : N pxm ( j = px ( 1 Qk, M j pq ( k M j For HMMs: all Q k pxq ( = b qn ( x n n = 1 pqm ( = π q1 a qn 1 Hence, solve for M * : - calculate for each available model, scale by prior Sum over all Q k??? N N n = 2 q n pxm ( j pm ( j pm ( j X E6820 SAPR - Sequence Recognition - Dan Ellis

16 The forward recursion Dynamic-programming-like technique to calculate sum over all Q k α n ( i Define as the probability of getting to state q i at time step n (by any path: α n ( i px ( 1, x 2, x n, q n =q i n i = px ( 1, qn Model states q i Then α n + 1 ( j can be calculated recursively: α n (i+1 a i+1j a ij b j (x n+1 S α n+1 (j = [Σ α n (i a ij ] b j (x n+1 i=1 α n (i Time steps n E6820 SAPR - Sequence Recognition - Dan Ellis

17 Forward recursion (2 Initialize α 1 ( i = π i b i ( x 1 Then total probability N px ( 1 M = αn ( i S i = 1 Practical way to solve for p(x M j and hence perform recognition Observations X p(x M 1 p(m 1 p(x M 2 p(m 2 Choose best E6820 SAPR - Sequence Recognition - Dan Ellis

18 Optimal path May be interested in actual q n assignments - which state was active at each time frame - e.g. phone labelling (for training? Total probability is over all paths but can also solve for single best path = Viterbi state sequence j q n + 1 Probability along best path to state : * α n + 1 * ( j = max { α n (aij i } b i j ( x n backtrack from final state to get best path - final probability is product only (no sum log-domain calculation just summation Total probability often dominated by best path: pxq (, * M pxm ( E6820 SAPR - Sequence Recognition - Dan Ellis

19 Interpreting the Viterbi path Viterbi path assigns each x n to a state q i - performing classification based on b i (x -... at the same time as applying transition constraints a ij Inferred classification 3 x n 2 Viterbi labels: AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC Can be used for segmentation - train an HMM with garbage and target states - decode on new data to find targets, boundaries Can use for (heuristic training - e.g. train classifiers based on labels... E6820 SAPR - Sequence Recognition - Dan Ellis

20 HMM summary HMM M j is hypothesized generator of obs v n X During generation, behavior of model depends only on current state q n : - transition probabilities p(q n+1 q n = a ij - emission distributions p(x n q n = b i (x Given just observed emissions X we can still calculate probability that they came from M j : pxm ( = pxqm (, pqm ( all Q Can also infer most likely state sequence Q * = pxqm (, argmax Q M = M* Q = {q 1,q 2,...q n } E6820 SAPR - Sequence Recognition - Dan Ellis X Q* assumed, hidden observed inferred

21 Validity of HMM assumptions Key assumption is conditional independence: Given q i, future evolution & obs. distribution are independent of previous events - duration behavior: self-loops imply exponential distribution γ p(n = n 1 γ γ(1 γ - independence of successive x n s n p(x n q i x n px ( = px ( n q i? n E6820 SAPR - Sequence Recognition - Dan Ellis

22 Outline Dynamic Time Warp Hidden Markov Models HMM training - EM for HMM parameters - State occupancy probabilities - Acoustic parameters Discriminative models E6820 SAPR - Sequence Recognition - Dan Ellis

23 3 Training models We know how to use an HMM, but where does it come from? In fact, HMM s assumption of stationarity within states seems a worse model than DTW templates Objection to DTW was the lack of a principled optimization procedure Probabilistic foundation of HMMs supports optimization: - maximum-likelihood - adjust Θ to maximize likelihood of training data under model - (really want minimum error rate or... E6820 SAPR - Sequence Recognition - Dan Ellis

24 EM for HMMs 0 Expectation-Maximization (EM 0 was introduced for Gaussian 0 mixture models (GMMs 0 - fit GMMs to arbitrary data 0 - EM finds locally-optimal model 0 parameters Θ that maximize data likelihood px ( train Θ - makes (some sense for decision rules like pxm ( j pm ( j Principle: adjust Θ to maximize expected log likelihood of known x & unknown u: E[ log p( x, u Θ ] = puxθ (, old log[ pxuθ (, puθ ( ] u - for GMMs, unknowns = mix assignments k - for HMMs, unknowns = hidden state q n (take Θ to include M j E6820 SAPR - Sequence Recognition - Dan Ellis

25 What EM does Maximize data likelihood by repeatedly estimating unknowns and re-maximizing expected log likelihood: Data log likelihood log p(x Θ etc... Re-estimate unknowns local optimum Adjust model params Θ to maximize expected log likelihood Estimate unknowns p(q n X,Θ Successive parameter estimates Θ E6820 SAPR - Sequence Recognition - Dan Ellis

26 all Q k = = EM for HMMs (2 Expected log likelihood for HMM: pq ( k X, Θ old log[ pxq ( k, ΘpQ ( k Θ ] all Q k N n = 1 i = S i = 1 pq ( k X, Θ old log px ( n q n pq ( n q n 1 S pq 1 i i ( X, Θ old log px ( n q n, Θ pq n i i ( X, Θ old log pq ( 1 Θ N S S i j j pq ( n 1, q n X, Θ old log pq n n = 2 i = 1 j = 1 - closed-form maximization by differentiation etc. E6820 SAPR - Sequence Recognition - Dan Ellis n i q n 1 (, Θ

27 EM update equations For acoustic model (e.g. 1-D Gauss: i ( X, Θ old x n n pq n µ i = i pq ( n X, Θ old n For transition probabilities: i j j i pq n 1 q n pq ( n q n 1 = n i pq n n 1 (, X, Θ old ( X, Θ old Normalized expectations as for GMMs Require state occupancy probabilities, i N pq ( n X 1, Θold E6820 SAPR - Sequence Recognition - Dan Ellis

28 The forward-backward algorithm i N We need pq ( n X 1 for EM updates (Θ implied Forward algorithm gave us - excludes influence of remaining data Hence, define so that then α n ( i β n ( i pq n i ( X 1 N i n α n ( i = pq ( n, X 1 Recursive definition for β: β n ( i = β n + 1 ( ja ij b j ( x n + 1 j - recurses backwards from final state N N X n + 1 β n ( i = px ( n + 1, = = pq n i N X 1 N q n i (, α n ( i β n ( i ( j β n ( j α j n X 1 n E6820 SAPR - Sequence Recognition - Dan Ellis

29 Estimating a ij from α & β From EM equations: i j j i new pq n 1 q n, pq ( n q n 1 = a ij = n i pq n 1, n - prob. of transition normalized by prob. in start Obtain from = = px n + 1 β n i pq ( n 1,, X Θ old q n j N j j ( q n px n q n (, X Θ old ( X Θ old i j ( pq ( n qn 1 pq ( n 1, ( j b j ( x n a ij α n 1 ( i i n 1 X 1 q i n-1 α n-1 (i a ij b j (x n q n j β n (j E6820 SAPR - Sequence Recognition - Dan Ellis

30 GMM-HMMs in practice GMMs as acoustic models: train by including mixture indices as unknowns - just more complicated equations... Practical GMMs: - 9 to 39 feature dimensions - 2 to 64 Gaussians per mixture depending on number of training examples Training constraint: define state sequence - decide words & pronunciations for training data - EM will then figure out boundaries Lots of data can model more classes - e.g context-independent (CI: q i = ae aa ax... context-dependent (CD: q i = b-ae-b b-ae-k... E6820 SAPR - Sequence Recognition - Dan Ellis

31 HMM training in practice EM only finds local optimum critically dependent on initialization - approximate parameters / rough alignment Model inventory ae 1 ae 2 ae 3 dh 1 dh 2 Labelled training data dh ax k ae t s ae t aa n Initialization parameters Θ init dh ax k ae s ae t aa n t Uniform initialization alignments Repeat until convergence E-step: probabilities of unknowns M-step: maximize via parameters p(q i n X N 1, Θ old dh ax Θ : max E[log p(x,q Θ] k ae Applicable for more than just words... E6820 SAPR - Sequence Recognition - Dan Ellis

32 Outline Dynamic Time Warp Hidden Markov Models HMM training Discriminant models - Being discriminant - Neural-net recognizers E6820 SAPR - Sequence Recognition - Dan Ellis

33 3 Discriminant models EM training of HMMs is maximum likelihood - i.e. local max p(x trn Θ - converges to Bayes optimum... in the limit Decision rule is max p(x M p(m - training will increase p(x M correct - may also increase p(x M wrong (more! Discriminant training tries directly to increase discrimination between right & wrong models - e.g. Maximum Mutual Information (MMI I( M j, X Θ = log pm ( j, X Θ pm ( j ΘpXΘ ( = log pxm ( j, Θ pxm ( k, ΘpM ( k Θ E6820 SAPR - Sequence Recognition - Dan Ellis

34 Neural Network Acoustic Models Single model generates posteriors directly for all classes at once = frame-discriminant Use regular HMM decoder for recognition - set b i ( x n = px ( n q i pq ( i x n pq ( i Nets are less sensitive to input representation - skewed feature distributions - correlated features Can use temporal context window to let net see feature dynamics: Feature calculation C 0 C 1 C 2 h# pcl bcl tcl dcl posteriors p(q i X C k t n t n+w E6820 SAPR - Sequence Recognition - Dan Ellis

35 Forced alignment Best (Viterbi labeling given existing classifier constrained by known word sequence Feature vectors time Existing classifier Known word sequence Dictionary Phone posterior probabilities ow th r iy... ow th r iy n s Constrained alignment Training targets ow th r iy Classifier training E6820 SAPR - Sequence Recognition - Dan Ellis

36 Neural nets: Practicalities Typical net sizes: - input layer: 9 frames x 9-40 features ~ 300 units - hidden layer: units, dep. training data - output layer: context-independent phones Hard to make context dependent - problems training many classes that are similar? Representation is opaque: Hidden -> Output weights output layer (phones feature index time frame Input -> Hidden #187 hidden layer E6820 SAPR - Sequence Recognition - Dan Ellis

37 Recap: Recognizer Structure Feature calculation sound feature vectors Network weights Word models Language model Acoustic classifier HMM decoder phone probabilities phone & word labeling Know how to execute each state Know how to train each stage.. except language/word models E6820 SAPR - Sequence Recognition - Dan Ellis

38 Summary Speech is modeled as a sequence of features - need temporal aspect to recognition - best time-alignment of templates = DTW Hidden Markov models are rigorous solution - self-loops allow temporal dilation - exact, efficient likelihood calculations HMMs can be trained via EM - guaranteed to maximize data likelihood (locally - acoustic models (GMMs can be trained too Discriminant training considers wrong models - recognition requires improved margin - frame-discriminant neural nets give better model discrimination? E6820 SAPR - Sequence Recognition - Dan Ellis