Automatic slide assignation for language model adaptation

Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly growing and imposing themselves as fundamental knowledge assets. This is particularly true in the area of education, where large repositories of video lectures are being built, making education accessible to a wide community of potential students. As with many other repositories, most lectures are not transcribed because of the lack of efficient solutions to obtain them at a reasonable level of accuracy. However, transcription of video lectures is clearly necessary to make them more accessible. Also, this would facilitate lecture searchability and analysis, including classification, summarisation, or plagiarism detection. In addition, people with hearing disabilities would be able to follow the lectures just by reading the transcriptions. Manual transcription of these repositories is excessively expensive and time-consuming and current state-of-the-art automatic speech recognition (ASR) has not yet demonstrated its potential to provide acceptable transcriptions on large-scale collections of audiovisual objects. However, in this type of videos the speaker often presents with some kind of background slides. In these cases, a strong correlation can be observed between the slides and the speech. Consequently, these slides provide an interesting opportunity to adapt general-purpose ASR models by massive adaptation from lecture-specific knowledge. In [1] we proposed an adaptation technique, obtaining adapted language models for each video using the slides. Results reported an improvement of up to 3.6% absolute WER points when using slides. For the present work we will assume that we are given a set of videos together with their slides. The slides, however, will not be labeled so we will not be able to directly obtain a video-adapted model for each video, but we will first need to assign one of the slide sets to each video. In this work, we will explore the automatic assignation of slides and we will study its impact on the final WER, comparing the resulting transcriptions with the ones obtained using the correct slides. 1

In this work we will focus on the polimedia repository. polimedia [2] was created for production and distribution of multimedia educational content at Universitat Politècnica de València. Lecturers are able to record lectures under controlled conditions which are distributed along with time-aligned slides. 2 Language model adaptation background Language model adaptation in the context of video transcription consists in using specific language models for each video (or set of videos). In the adaptation method we propose, one language model is trained from the text in the slides. This model is then interpolated [3] with other corpora (out-of-domain and indomain) to obtain a more general and more powerful model, but still adapted to the specific lecture. All language models are standard n-gram models, smoothed using modified Kneser-Ney [4]. The training has been performed using the SRILM toolkit [5], which is free for academic purposes. 2.1 Corpora description In this work several corpora were used to train the language model. We used up to 9 out-of-domain corpora (Tables 1 and 2), and the polimedia corpus as in-domain corpus. Table 1: Basic statistics of out-of-domain corpora Corpus # sentences # words Vocabulary EPPS 132K 0.9M 27K news-commentary 183K 4.6M 78K TED 316K 2.3M 69K UnitedNations 448K 10.8M 105K Europarl-v7 2 123K 54.9M 155K El Periódico 2 695K 45.4M 313K news (07-11) 8 627K 217.2M 775K UnDoc 9 968K 318.0M 472K Table 2: Basic statistics of Google s Ngram corpus (v1) # unigrams # pages # books Vocabulary 45 360M 128M 521K 292K The polimedia corpus comprises more than 100 hours of manually transcribed videos. The corpus is divided in a training set, development set and test set. The polimedia corpus also provides manual transcriptions for the dev and test sets slides. Tables 3 and 4 provide more accurate information about the polimedia corpus. 2

Table 3: Basic statistics of the polimedia corpus Videos Time (hours) # sentences # words Vocabulary train 655 96 41.5K 96.8K 28K dev 26 3.5 1.4K 34K 4.5K test 23 3 1.1K 28.7K 4K Table 4: Basic statistics of slides in the polimedia corpus Videos # slides # sentences # words Vocabulary dev 26 107 1865 16.2K 3.5K test 23 363 1796 14.5K 2.9K The vocabulary is formed by the 50k most frequent words from all out-ofdomain corpora plus all the words in the polimedia training set. 3 Automatic slide assignation We are given a set of video transcriptions (correct or with errors) and the slides text for all of the videos. The slides, however, are unlabeled and we do not know which slide set belongs to each video. We propose a simple yet very effective technique which will let us assign slides to all of the videos. To decide which one of the slide sets better suits a specific video, we train a 3-gram model for each slide set and we use its perplexity over the video transcription as a score. The slides assigned to the video will be the ones with the lowest perplexity. 4 Experiments The aim of this work is twofold: on the one hand, we want to check if the proposed method is effective at assigning the slides to the right video with a low error rate. On the other hand, we want to explore the impact of this assignation on the transcription error when adapting using slides. With these goals in mind, we performed the following experiments. 4.1 Automatic assignation experiments Using the technique described in section 3, we carried out two types of automatic assignation experiments: assignation using correctly transcribed videos, and assignation using automatic transcriptions. For each type of transcriptions we tested the technique with subsets of different sizes. For each size we repeated the experiments multiple times using different random subsets, in order to make the experiment independent of the chosen subset. The assignation errors are shown in Table 5, showing the absolute number of incorrectly assigned videos as well as the percentage for both automatic and correct transcriptions. 3

Table 5: Assignation error (absolute and relative) 5 videos 10 videos 25 videos Full corpus Abs. % Abs. % Abs. % Abs. % Correct transcriptions 1 0.2 1 0.1 2 0.08 3 0.06 Automatic transcriptions 1 0.2 1.33 0.13 2.66 0.11 6 0.12 We can observe that independently of the size of the subset it s usual to find some errors in the assignation, even when correct slides are used. An analysis of the wrongly classified videos may clarify the causes of these misassigments. Table 6: Wrongly assigned videos. Videos in italics are wrongly assigned only when using automatic transcriptions automatic transcriptions Video # words in the slides M54.B05 484 M03.B01 650 M62.B02 17 M62.B03 402 M62.B04 0 M62.B05 0 Looking at the results in Table 6, we can observe that the slides of three of the videos are empty or almost empty. These three videos are the only assignment errors if we use correct slides. 4.2 LM adaptation with automatically assigned slides models The proposed techniques for language model adaptation are measured in terms of both perplexity and WER obtained with a state of the art ASR system [6]. The acoustic model has been trained using the polimedia corpus (Table 3), employing triphonemes inferred using the conventional CART with almost 3900 leaves. Each triphoneme was trained for up to 64 mixture components per Gaussian, 4 iterations per mixture and 3 states per phoneme with the typical left-to-right topology without skips. Additionally, speaker adaptation was performed applying CMLLR feature normalisation (full transformation matrices). The baseline language model was computed by interpolating all the out-ofdomain corpora with the polimedia corpus. The adapted language models were computed as discussed in Section 2, by interpolating all the previous corpora with the one trained from the video slides. Table 7 shows the results in terms of WER and PPL for both adapted models (correct and automatic assignation) as well as for the baseline model. Results show that the adaptation with automatically assigned slides is slightly worse than the one using correct slides, but it is still significantly better than no adaptation at all. 4

Table 7: WER (%) and PPLs Development Test PPL WER PPL WER (a) Baseline 140.8 22.1 172.1 24.8 (b) Adapted model (correct assignation) 96.6 20.5 113.2 21.2 (c) Adapted model (automatic assignation) 97.8 20.6 117.9 21.4 If we look in more detail the results for the videos whose slides were incorrectly assigned, we can observe that in most cases the transcriptions using the real slides are slightly better than the ones using the assigned slides. However, in the cases where the slides were empty or almost empty, differences are in general smaller, even in one of the videso the transcripton is better using the automatic assignation. Table 8: WER comparison for the incorrectly assigned videos Video WER corr. sld. WER with ass. sld. WER transcr. M54.B05 47.92 48.49 11.47 M03.B01 33.07 34.91 7.21 M62.B02 34.74 34.94 3.61 M62.B03 28.17 30.46 11.37 M62.B04 18.92 20.58 3.84 M62.B05 22.04 21.82 1.52 5 Conclusions and future work The methodology described has been proved to be very effective to automatically assign slides, and despite a small amount of errors in the assignation, transcription results are almost as good as when using correct slides. The experiments assigning slides in differently-sized sets show that the correct assignation does not depend on the size of the set, but on the properties of each slide set. An interesting experiment derived from this work is to use the technique described to select one or more documents from the internet to train a new language model in ordre to improve the adaptation. References [1] A. Martínez-Villaronga, M. A. del Agua, J. Andrés-Ferrer, and A. Juan, Language model adaptation for video lectures transcription, in In Proc. ICASSP, Vancouver, Canada, May 2013. [2] polimedia: Videolectures from the Universitat Politecnica de Valencia, http://polimedia.upv.es/catalogo/. 5

[3] Frederick Jelinek and Robert L. Mercer, Interpolated estimation of Markov source parameters from sparse data, in In Proceedings of the Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands: North- Holland, May 1980, pp. 381 397. [4] Stanley F. Chen and Joshua Goodman, An empirical study of smoothing techniques for language modeling, Computer Speech & Language, vol. 13, no. 4, pp. 359 393, 1999. [5] A. Stolcke, Srilm - an extensible language modeling toolkit, Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002. [6] The translectures-upv Team. The translectures-upv toolkit (TLK), translectures-upv toolkit (TLK) for Automatic Speech Recognition, http://translectures.eu/tlk. 6