Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition

1 Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition Émilie CAILLAULT LASL, ULCO/University of Littoral (Calais) Christian VIARD-GAUDIN IRCCyN, University of Nantes

2 Aim Problem : Training of a neuro-markovian system in the context of cursive handwritten word recognition. Difficulty : Bootstrapping of the system allowing a training from scratch. Proposed solution : Implicit segmentation procedure Definition of a segmentation control function Definition of a quality segmentation measurement

3 How train a neuro-markovian system? Character level training Teaching Signal: character Signal Différents hybrides MMC-MMC RN-MMC Classifier HMM MMC - RN SVM-MMC Kppv-MMC Word level training Teaching signal: word RN-MMC Parole/Hors-ligne Classifier HMM

Global system overview : MS-TDNN 1- MsTDNN-HMM 4 xr y dx dy cx cy P/D P O bs1 1 state / character: 66 states TD NN O bs1 O bs n O bservation w indows O t of the TD NN... TDN N inputs : tem poral sequence of features vectors for the w ord «quand» z... 3 states / c b a character: 198 states = 198 output neurons O bs1 Sam e weigths m atrices O bs n TD NN O bst Sam e convolutional w indow inside the observation window s of the TD NN O bs n O bservations vectors sequences processed by the H M M Scanning of the input word by a TDNN (temporal convolution). TDNN produces the sequence of observation probabilities. No explicit segmentation is carried out but just an overlapping window is regularly shifted. Advantages of the TDNN architecture: weight sharing reduction in computation times and in memory requirements. Reduction of the number of computation with an adapted topology of the TDNN due to the sliding observation window.

1- MsTDNN-HMM Dynamic Programming Likelihood computation 5 Word Lexicon ---------------------- un deux trois quand q s q 1 q 2 q 3 q f q s q 1 q f Words HMM = concatenation of letters HMM 1-state 3-state letter letter model model q s 1 u 1 n 1 qf q s 1 q 1 u 1 a 1 n 1 d 1 q f DP DP Likelihood Scores For each entry in the lexicon, a pseudo-hmmword model is constructed dynamically by concatenating letter HMM Hence, the likelihood for each word in the lexicon is computed by multiplying the observation probabilities over the best path through the graph using the Viterbi algorithm. The word model with the highest likelihood is the top one recognition candidate. Character models a b c d.n o p q.u.z a ij = cst transition probability O1 O2 O3.. b j (O t ), observation probability Best Viterbi path Dynamic Programming to find the best path alignment Ot, Observation sequences q q q q u a a n n d d d

Training Original data 1- MsTDNN-HMM Training : 6 TDNN Delta x - coord y - coord x - direction Y _ direction P ( Class 1 X ) P ( Class 2 X ) P ( Class 3 X ) P ( Class 4 X ) Global cost function Backpropagation of the gradient Normalized data 1.. Sequence of features vectors Normalisation Delta x - coord y - coord x - direction Y _ direction P ( Class 1 X ) P ( Class 2 X )... P ( Class 3 X ) P ( Class 4 X ) Obs 1 ObsT Sequence of vectors of observation probabilities, DP Features Extraction Character Likelihood Word Likelihood Computation Recognition Word Lexicon List of words ranked by their WL score N Word level training Classical backpropagation algorithm Cost function mixing MLE, MMI and character level discrimination [ICDAR 2005] «one» recognized as «over» True Hmm: (+) Best Hmm: ( ) ooonneee oovveeer Best character: ( ) aouvneel

Illustration of the segmentation problem 2- Implicit Segmentation Control 7 Random initialisation of the TDNN all outputs are equally probable Example: Possible paths for the English word of with 10 observations (TDNN windows) o f offfffffff ooooffffff ooooooofff ooffffffff ooooofffff ooooooooff ooofffffff ooooooffff ooooooooof

How to obtain desired Likelihood distribution? 2- Implicit Segmentation Control 8 Segmentation paths for the word of offfffffff 1 ooffffffff ooofffffff ooooffffff ooooofffff ooooooffff ooooooofff ooooooooff ooooooooof 9 P(Γ) Existing solutions: Training with a character database - duration model for HMM sub-unit [Penacee] - manual segmentation constraints [Remus, Npen++] Proposed solution: - no modification of the HMM structure (transitions=1) - no training with a character database - integration of a duration model in the emission probabilities

Weighting function f T : 2- Implicit Segmentation Control 9 Outputstl (, ) = min( Outputs ( t, l)* f ( H );1) NN T Weighting of TDNN outputs % 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 B H Observations range corresponding to one letter H=100 H>1000 E B = First observation position for the concerned letter E = Last observation position for the concerned letter H = Height of the function ƒ T S = E-B = letter size

Weighting function applied : case of the word of 2- Implicit Segmentation Control 10 100 weighting coefficient 50 o f 1 0 1 2 3 4 5 6 7 8 9 10 observation position H=100

2- Implicit Segmentation Control How to evaluate the effectiveness of this weighting function? ASR 11 ASR N N NLs( e) 1 nb _ obsi ( ) = ( ) NLci ( )* T / NLe ( ) e= 1 i= 1 ASR : Averaged homogenous Segmentation Rate e: index of the considered sample T: number of observations scanned by the TDNN N: (fixed or limited) number of training samples NL(e): number of letters in the word label NLs(e): number of sequences of different letters i: index of a sequence of letters from 1 to NLs NLc(i): number of the same consecutive letter and nb_obs(i): number of observations of the letter i in the optimal Viterbi path. Case: of NL(o) NL(f) 1 9 2 8 3 7 4 6 5 5 6 4 7 3 8 2 9 1 ASR 0.36 0.64 0.84 0.96 1 0.96 0.84 0.64 0.36

Experiments : 12 Original data Normalized data Normalisation Training in 2 stages : Global cost function Backpropagation of the gradient TDNN Delta x - coord y - coord x - direction Y _ direction P ( Class 1 X ) P ( Class 2 X ) P ( Class 3 X ) P ( Class 4 X ).. Sequence of features vectors P ( Class 1 X ) P ( Class 2 X )... P ( Class 3 X ) P ( Class 4 X ) Obs 1 ObsT Sequence of vectors of observation probabilities Weighting function ƒ T ƒ T P ( Class 1 X ) P ( Class 2 X ) P ( Class 3 X ) P ( Class 4 X ) 1 True HMM Word Likelihood Computation Features Extraction Delta x - coord y - coord x - direction Y _ direction Character Likelihood Label of the true word P ( Class 1 X ) P ( Class 2 X )... P ( Class 3 X ) P ( Class 4 X ) Obs 1 ObsT Constrained Sequence wtih duration model,, N 1-f T segmentation constraint function for 20 epochs 2- training with no segmentation constraint

Results on online word IRONOFF database IRONOFF word database English / French Lexicon : 197 words 2/3 Training : 20 898 words 1/3 TEST : 10 448 words 13 System 1State- TDNN 1S-TDNN Segmentation constraint 3State-TDNN 3S-TDNN Segmentation constraint Recognition rate (%) 87.09 91.31 90.51 94.81 Relative improvement (%) 32.6 28.1 Average Segmentation Rate (ASR) 0.423 0.498 0.612 0.684 Relative improvement (%) 17.7 11.8

Conclusions 14 A simple and effective process to bootstrap a hybrid system without any labeled character database Important increasing of the recognition rate by an initialisation stage with a duration model apply directly on the TDNN outputs Weighting function ƒ T Better quality of segmentation better recognition rates ASR Averaged Segmentation Rate

Future works : controlled segmentation stage 15 Measure of the ASR for each training iteration Adaptation of the parameter H in the weighting function to reduce the segmentation constraints according to the ASR value. H=ƒ(ASR). Initialisation with H = 100 high segmentation constraint, only balanced segmentation paths are possible Then relax constraint on H when ASR has reached a reasonable value.

16 Bonus

Introduction Problème Reconnaissance Car. Reconnaissance Mot Conclusion 17 Le moteur de reconnaissance : le TDNN MLP TDNN......... Global Vision Local vision to global vision Weight sharing