THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
|
|
|
- Rose Harrell
- 10 years ago
- Views:
Transcription
1 THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany 2 École Centrale Paris, Paris, France 3 LIMSI CNRS, Spoken Language Processing Group, Paris, France ABSTRACT In this paper, we describe the RWTH speech recognition system for English lectures developed within the Translectures project. A difficulty in the development of an English lectures recognition system, is the high ratio of non-native speakers. We address this problem by using very effective deep bottleneck features trained on multilingual data. The acoustic model is trained on large amounts of data from different domains and with different dialects. Large improvements are obtained from unsupervised acoustic adaptation. Another challenge is the frequent use of technical terms and the wide range of topics. In our recognition system, slides, which are attached to most lectures, are used for improving lexical coverage and language model adaptation. Index Terms lecture recognition, speech recognition system, LVCSR 1. INTRODUCTION Video lectures currently receive a lot of attention. Renowned universities have made lectures available in electronic form, for example on Coursera [1] or edx [2], accompanied by additional material and interaction methods. In addition, many conferences including ICASSP record talks and make them available to a wide audience. The automatic transcription of these videos is of high interest. Transcriptions allow to perform text search in videos and facilitate access of non-native speakers and people with hearing disabilities. In this work, we describe our English lecture recognition system, which has been developed within the Translectures project [3]. The project aims at transcribing the Videolectures.NET [4] archive, which is a very large online repository for academic videos. Most of the talks are The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/ ) under grant agreement no (translectures). H. Ney was partially supported by a senior chair award from DIGITEO, a French research cluster in Ile-de- France. The author performed the work while at RWTH Aachen University. in English language and were given at computer science conferences as ICML, NIPS, and others. This means, the system described in this paper is applied to a large-scale task with real-life data. The lecture recognition task has several characteristics, which impose challenges for automatic speech recognition (ASR). One problem is the very large vocabulary, in particular the frequent use of rare technical terms. In addition, the video lectures cover a wide range of topics. Another challenge that is specific for English lectures is the large ratio of non-native speakers. Typically, performance of ASR systems already degrades when dealing with different dialects which are not covered by the acoustic training data. The effect of foreign speakers is even more severe. We address this problem by using training data with different dialects and even different languages for the training of bottleneck features. Finally, since the goal is to transcribe the complete Videolectures.NET database, our recognition system must be efficient. On the other hand, the task offers specific opportunities for improving ASR performance. First, lectures typically only have one speaker and unsupervised speaker adaptation is therefore very effective. Second, many video lectures are attached with slides, which can be used as an additional knowledge source for extending the vocabulary and language model adaptation [5]. Furthermore, many video lectures have been subtitled manually by volunteers. The subtitles are not an exact transcription as known from standard ASR tasks. Still, the data can be used as a basis for acoustic model training [6]. The topic of this paper is therefore a challenging real-life task. The task has several properties different from conventional ASR research tasks, which makes it interesting to study how well existing methods perform on it. Further, it is required to adapt techniques to the task under consideration. The next section briefly describes the Videolectures.NET database. The remaining sections give an overview of our lecture recognition system, including highly effective deep multilingual bottleneck features and our proposed language model adaptation approach.
2 Table 1. Statistics of the Videolectures.NET database. Archive Development Test Videos 9, Time 5, 900h 3.2h 3.4h Sentences - 1K 1.3K Words - 28K 34K Vocabulary - 3K 3K 2. THE VIDEOLECTURES.NET DATABASE Videolectures.NET is a free and open access video lectures repository. Within the Translectures project, all 9, 148 video lectures that have been viewed at least fifty times at project start, are automatically transcribed. A very small part of the database has been carefully transcribed for the development of ASR systems. The statistics of the development and test sets as well as the complete archive are given in Table 1. Until now, additional twenty hours of training data have been transcribed. Due to this small amount, we have not used this data so far. Most of the lectures are attached with slides. Depending on the format, the slide text can be extracted directly, or has to be generated with optical character recognition (OCR). The text extraction of the slides has already been performed within the Translectures project. We assume the slide texts to be given and do not deal with the difficulties of OCR ourselves. The quality of the extracted slide texts is quite low in general. Even if the text can be extracted without OCR, mathematical formulas and other graphical elements impose difficulties. 3. ACOUSTIC MODEL Our acoustic model is based on the tandem approach [7]. This allows using well-understood Gaussian mixture model (GMM) speaker adaptation techniques, which is of crucial importance for the task under consideration Resources The acoustic model has been trained on large amounts of acoustic data from various resources with different domains and dialects, see Table 2. The Quaero English corpus contains podcasts, mostly in British English, which has been provided within the Quaero project [8]. HUB4 and TDT4 are widely used broadcast news corpora in American English. EPPS is a collection of carefully transcribed European Parliament speeches. The TED data has been collected by ourselves. We downloaded 200 hours of subtitled videos from the TED website [9]. The subtitles are not well aligned to the audio data and contain additional annotation, for example speaker information. Repetitions and disfluencies are also not annotated in subtitles. Therefore, we applied a text postprocessing to the Table 2. Statistics of acoustic training data. BC stands for broadcast conversations, BN for broadcast news, PS for parliament speeches, and L for lectures. Corpus Domain Duration (h) #Words Quaero English BC M HUB4 BN M TDT4 BN M EPPS PS M TED L M Quaero French BC M Quaero German BC M Quaero Polish BC M subtitles and removed audio segments which could not be aligned when using a low alignment pruning threshold. In total, 962 hours of English audio data have been used for acoustic model training. Furthermore, we used data from other languages for training multilingual neural network features, see Subsection 3.3. The multilingual bottleneck features have been trained on all Quaero corpora shown in Table 2, in total 837 hours Baseline feature extraction Sixteen Mel-cepstral coefficients (MFCC) are extracted every 10 ms using a bank of 20 filters. In addition, a voicedness feature is computed. By applying a sliding window of size 9, 154-dimensional features are obtained, which are mapped by a linear discriminant analysis (LDA) to a 45-dimensional subspace Multilingual deep bottleneck features In addition to the cepstral features, bottleneck (BN) features extracted from a multilayer perceptron (MLP) are used. The neural network has already been trained within the Quaero project [10]. This illustrates a major practical advantage of the tandem approach: The bottleneck features can be shared across different tasks. In our case, the network is trained on multilingual data and the features can even be shared across languages. This approach facilitates system development strongly. In addition, we know from our experience on Quaero data that the multilingual training improves performance by about three percent relative in comparison to using only data from the target language [10]. The observed improvements from multilingual training are in the same range as reported by other groups for data-rich languages [11],[12]. The bottleneck features are extracted from a neural network with a hierarchical structure as described in [13, 14], based on MRASTA filtering [15]. The fast modulation part of the MRASTA filtering is fed to a first MLP. A second MLP is trained on the slow modulation components and the PCA transformed BN output of the first MLP. The modulation fea-
3 tures fed to the MLPs are always augmented by the critical band energies. Both MLPs have six hidden layers with 2000 hidden nodes and a 60-dimensional BN layer, which is placed before the last hidden layer. The final features are obtained by applying a PCA to the BN activations of the second MLP and reducing the dimensionality to 38. We applied the multilingual training method proposed by [16]. The MLP training data comprises the English, French, German, and Polish acoustic training data from the Quaero project - in total 837 hours. The feature vectors extracted from the joint corpus of the four languages were randomized and fed to the MLPs. Using language specific softmax outputs, backpropagation has been initiated only from the language specific subset of the output depending on the language-id of the feature vector. The MLPs were trained according to crossentropy criterion with 1500 tied-triphone states per language as outputs [17] Training The acoustic model used in this work is a GMM/HMM. The features for the GMM are obtained by concatenating the spectral features with the BN features described in the previous subsection. The acoustic model has been trained on all English audio data given in Table 2. The parameters have been optimized according to the maximum likelihood (ML) criterion with the expectation maximization algorithm (EM) with Viterbi approximation and a splitting procedure. The GMM has a globally pooled and diagonal covariance matrix and roughly 1.2M densities. It models 4, 500 generalized triphones determined by a decision-tree-based clustering (CART). Speaker adaptation is of crucial importance for the performance of a lecture recognition system, because there is a lot of data per speaker available. We use several speaker adaptation techniques in our system. First, mean and variance normalization is applied to the spectral features on segment level. Furthermore, vocal tract length normalization (VTLN) is applied to the MFCC features. The VTLN warping factors are obtained from a Gaussian classifier (fast-vtln) [18]. In addition, we perform speaker adaptation with constrained maximum likelihood linear regression (CMLLR) [19]. The CMLLR transformation has been applied to the training data and a new speaker-adapted GMM has been trained (speaker adaptive training). In recognition, the CMLLR transformations are estimated from a first recognition pass and then, a second recognition pass with the GMM from speaker adaptive training (SAT) is performed. In addition to these transformations on feature side, Gaussian mixture models can also be adapted directly with maximum likelihood linear regression (MLLR) [19]. In our experience, MLLR usually does not improve performance of tandem systems, see for example [20]. For the lecture recognition system, we found MLLR to be beneficial due to the large amount of adaptation data per speaker. Table 3. Statistics of text resources. The third column gives the weight of the data source in the interpolated LM. Corpus #Words λ Acoustic transcriptions 8M 0.27 Slide texts 97M 0.25 IWSLT M 0.13 Quaero blog M 0.12 WIT TED talks 3M 0.11 Gigaword 3B 0.08 WMT 2012 news-crawl 2.8B LEXICON Our pronunciation modeling is based on the British English Example Pronunciation (BEEP) dictionary [21]. Missing pronunciations are determined by a grapheme-to-phoneme conversion (g2p) model. We use a combination of a generative g2p model and a conditional random field based approach as described in [22]. The baseline recognition vocabulary has been determined by using the 150k most frequent words from the English Gigaword corpus, see Table 3. Many technical terms which are used in lectures are not covered by the Gigaword corpus. In order to reduce the OOV rate, we added the 50k most frequent unknown words from the slide texts to the recognition lexicon. This adds a lot of noise to the lexicon, but the OOV rate is reduced strongly, because the slides exactly correspond to the data that is recognized Resources 5. LANGUAGE MODEL The datasets for language model training are summarized in Table 3. Only small amounts of in-domain data are available: the transcriptions of the acoustic training data, the collection of Videolectures.NET slide texts, lecture data provided by the IWSLT 2013 evaluation campaign [23], and a small collection of TED talks. A large archive of blog data has been provided within the Quaero project. Gigaword and news-crawl are very large news text collections. In total, the language model is trained on 6.6 billion running words Training The texts were normalized and converted to lower case. Punctuations were discarded. For every data source, a single 4- gram LM with modified Kneser-Ney smoothing has been estimated using the SRILM toolkit [24]. These LMs were linearly interpolated by optimizing the perplexity (PPL) on a validation set. Table 3 shows the interpolation factor of each data source. It can be seen that the in-domain datasets have much more weight than the large out-of-domain sources, despite of
4 Table 4. Word error rates (in %) for the MFCC baseline system and the tandem system on the development and test set. Baseline Tandem Dev. Test Dev. Test speaker-independent CMLLR MLLR Slides/Viterbi Slides/CN their small size. The full language model (11GB) is only applied in lattice rescoring. For recognition, the language model is pruned to 726MB with entropy-pruning Adaptation using slide texts The language model adaptation is performed in a lattice rescoring framework. For each video lecture, a language model is trained on the corresponding slides. We found a careful text normalization to be very important for the noisy slide texts. In rescoring, the unadapted language model described above is dynamically interpolated with the slide language model. The interpolation weight has been chosen as the average of the weights that minimize the perplexity for each lecture in the development data. We trained N-gram models with order one to four on the corresponding slide texts. Bigrams performed about ten percent relatively better than unigrams in terms of perplexity, and marginally better than trigrams and 4-grams. 6. RECOGNITION SETUP Our system has a four-pass recognition setup. In an initial unadapted pass, a first transcription is obtained, which is used for the CMLLR-adapted recognition pass. In the subsequent MLLR-adapted recognition pass, word lattices are created. The lattices are rescored using the slide-adapted language model. Finally, a confusion network (CN) decoding is performed on the lattice [25]. In contrast to other systems developed in our group, for example the Quaero evaluation systems [20], we do not use system combination or cross-adaptation, because our aim is to apply our system to a large-scale dataset. 7. EXPERIMENTAL RESULTS Table 4 shows detailed word error rate (WER) results for an MFCC baseline system and the tandem system described above. The results highlight the importance of adaptation on this task. The relative improvement of all adaptation techniques on the tandem system is 20.7% on the test data. On the MFCC baseline system, the improvement is even 25.1% Table 5. Effect of using slide vocabulary on the OOV and WER rate. Perplexity (PPL) improvements by language model (LM) adaptation using slide texts. OOV [%] WER [%] PPL Dev. Test baseline lexicon baseline + slides lexicon baseline lexicon baseline + slides lexicon baseline LM slide-adapted LM relative. Slight additional gains are obtained by using CN decoding instead of Viterbi decoding on the final lattices. The multilingual bottleneck features strongly improve system performance from 26.6% WER to 21.2% WER on the final system. The relative improvement is even higher if less adaptation techniques are applied. Overall, the bottleneck features are clearly highly valuable. Using the subtitled videos as acoustic training data only gave a moderate improvement of 1.8% relative in WER. The system benefits from the availability of slide texts in several ways. First, the OOV rate is reduced strongly, in particular on the development data, see Table 5. It can be assumed that most of the relevant technical terms also appear on the slides attached to the videos. The reduction of the OOV rate also reduces the word error rate on the development data. Second, the slides already improve the unadapted language model. They are used as one of the LM text sources and have a high interpolation weight, see Table 3. Finally, the lecture-specific slides are used for LM adaptation. The slide adaptation gives improvements in perplexity (see Table 5) and in the recognition result, but varies with the quality of the slides. 8. CONCLUSION In this work, we described the RWTH speech recognition system for English video lectures in detail. The system described in this paper is applied to a large-scale dataset with real-life data. The lecture recognition task is challenging due to the large vocabulary and the high ratio of non-native speakers. Adaptation of the acoustic model and the language model plays an important role on this task. The tandem acoustic model has been adapted within a multipass CMLLR and MLLR framework. We obtained additional gains by using a language model adaptation method similar to [5] based on the slides which are attached to most videos. We only obtained small improvements by using subtitled videos as acoustic training data. In future work, we plan to use this data more efficiently by making use of unsupervised training techniques as described in [6].
5 9. REFERENCES [1] Coursera, [2] edx, [3] Translectures, [4] Videolectures.NET, [5] A. Martinez-Villaronga, M. A. del Agua, J. Andrés- Ferrer, and A. Juan, Language model adaptation for video lectures transcription, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013, pp [6] A. Rousseau, F. Bougares, P. Deléglise, H. Schwenk, and Y. Estève, LIUM s systems for the IWSLT 2011 speech translation tasks, in Proc. of IWSLT, San Francisco, USA, Dec [7] H. Hermansky, D. P. Ellis, and S. Sharma, Tandem connectionist feature extraction for conventional HMM systems, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Dallas, USA, Mar. 2000, pp [8] Quaero, [9] Ted, [10] Z. Tüske, R. Schlüter, and H. Ney, Multilingual hierarchical MRASTA features for ASR, in Proc. Interspeech, Lyon, France, Aug. 2013, pp [11] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, Multilingual acoustic models using distributed deep neural networks, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013, pp [12] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, Crosslanguage knowledge transfer using multilingual deep neural network with shared hidden layers, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013, pp [13] F. Valente and H. Hermansky, Hierarchical and parallel processing of modulation spectrum for ASR applications, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Las Vegas, USA, Mar. 2008, pp [15] H. Hermansky and P. Fousek, Multi-resolution RASTA filtering for TANDEM-based ASR, in Proc. Interspeech, Lisbon, Portugal, Sep. 2005, pp [16] S. Scanzio, P. Laface, L. Fissore, R. Gemello, and F. Mana, On the Use of a Multilingual Neural Network Front-End, in Proc. Interspeech, Brisbane, Australia, Sep. 2008, pp [17] Z. Tüske, R. Schlüter, and H. Ney, Deep hierarchical bottleneck MRASTA features for LVCSR, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013, pp [18] L. Welling, S. Kanthak, and H. Ney, Improved methods for vocal tract normalization, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Phoenix, USA, Mar. 1999, pp [19] C. Leggetter and P. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models, Computer Speech and Language, vol. 9, no. 2, p. 171, [20] M. Sundermeyer, M. Nußbaum-Thom, S. Wiesler, C. Plahl, A. El-Desoky Mousa, S. Hahn, D. Nolden, R. Schlüter, and H. Ney, The RWTH 2010 Quaero ASR Evaluation System for English, French, and German, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 2011, pp [21] T. Robinson, BEEP - The British English Example Pronunciation Dictionary, 1995, ftp://svrftp.eng.cam.ac.uk/comp.speech/dictionaries/. [22] S. Hahn, P. Lehnen, S. Wiesler, R. Schlüter, and H. Ney, Improving LVCSR with Hidden Conditional Random Fields for Grapheme-to-Phoneme Conversion, in Proc. Interspeech, Lyon, France, Aug. 2013, pp [23] IWSLT 2013, [24] A. Stolcke, SRILM - an extensible language modeling toolkit, in Proc. Int. Conf. on Spoken Language Processing, Denver, USA, Sep. 2002, pp [25] G. Evermann and P. Woodland, Posterior probability decoding, confidence estimation and system combination, in Proc. NIST Speech Transcription Workshop. Baltimore, USA, [14] C. Plahl, R. Schlüter, and H. Ney, Hierarchical Bottle Neck Features for LVCSR, in Proc. Interspeech, Makuhari, Japan, Sep. 2010, pp
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
TED-LIUM: an Automatic Speech Recognition dedicated corpus
TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France [email protected]
Strategies for Training Large Scale Neural Network Language Models
Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond,
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0
Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0 Tim Schlippe, Lukasz Gren, Ngoc Thang Vu, Tanja Schultz Cognitive Systems Lab, Karlsruhe Institute
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
Building A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content
CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content Martine Garnier-Rizet 1, Gilles Adda 2, Frederik Cailliau
Slovak Automatic Transcription and Dictation System for the Judicial Domain
Slovak Automatic Transcription and Dictation System for the Judicial Domain Milan Rusko 1, Jozef Juhár 2, Marian Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Miloš Cerňak 1, Marek Papco 2, Róbert
arxiv:1505.05899v1 [cs.cl] 21 May 2015
The IBM 2015 English Conversational Telephone Speech Recognition System George Saon, Hong-Kwang J. Kuo, Steven Rennie and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 [email protected]
IBM Research Report. Scaling Shrinkage-Based Language Models
RC24970 (W1004-019) April 6, 2010 Computer Science IBM Research Report Scaling Shrinkage-Based Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM Research
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation
Evaluating grapheme-to-phoneme converters in automatic speech recognition context
Evaluating grapheme-to-phoneme converters in automatic speech recognition context Denis Jouvet, Dominique Fohr, Irina Illina To cite this version: Denis Jouvet, Dominique Fohr, Irina Illina. Evaluating
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke
1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline
Deep Neural Network Approaches to Speaker and Language Recognition
IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 10, OCTOBER 2015 1671 Deep Neural Network Approaches to Speaker and Language Recognition Fred Richardson, Senior Member, IEEE, Douglas Reynolds, Fellow, IEEE,
Slovak Automatic Dictation System for Judicial Domain
Slovak Automatic Dictation System for Judicial Domain Milan Rusko 1(&), Jozef Juhár 2, Marián Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Róbert Sabo 1, Matúš Pleva 2, Marián Ritomský 1, and
Generating Training Data for Medical Dictations
Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN [email protected] Michael Schonwetter Linguistech Consortium, NJ [email protected] Joan Bachenko
Estonian Large Vocabulary Speech Recognition System for Radiology
Estonian Large Vocabulary Speech Recognition System for Radiology Tanel Alumäe, Einar Meister Institute of Cybernetics Tallinn University of Technology, Estonia October 8, 2010 Alumäe, Meister (TUT, Estonia)
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine George E. Dahl, Marc Aurelio Ranzato, Abdel-rahman Mohamed, and Geoffrey Hinton Department of Computer Science University of Toronto
Effect of Captioning Lecture Videos For Learning in Foreign Language 外 国 語 ( 英 語 ) 講 義 映 像 に 対 する 字 幕 提 示 の 理 解 度 効 果
Effect of Captioning Lecture Videos For Learning in Foreign Language VERI FERDIANSYAH 1 SEIICHI NAKAGAWA 1 Toyohashi University of Technology, Tenpaku-cho, Toyohashi 441-858 Japan E-mail: {veri, nakagawa}@slp.cs.tut.ac.jp
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Ulpu Remes, Kalle J. Palomäki, and Mikko Kurimo Adaptive Informatics Research Centre,
tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same
1222 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 7, SEPTEMBER 2008 System Combination for Machine Translation of Spoken and Written Language Evgeny Matusov, Student Member,
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
Advanced Signal Processing and Digital Noise Reduction
Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New
Perplexity Method on the N-gram Language Model Based on Hadoop Framework
94 International Arab Journal of e-technology, Vol. 4, No. 2, June 2015 Perplexity Method on the N-gram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
SOME INSIGHTS FROM TRANSLATING CONVERSATIONAL TELEPHONE SPEECH. Gaurav Kumar, Matt Post, Daniel Povey, Sanjeev Khudanpur
SOME INSIGHTS FROM TRANSLATING CONVERSATIONAL TELEPHONE SPEECH Gaurav Kumar, Matt Post, Daniel Povey, Sanjeev Khudanpur Center for Language and Speech Processing & Human Language Technology Center of Excellence
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System
Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System Eunah Cho, Jan Niehues and Alex Waibel International Center for Advanced Communication Technologies
Robust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist ([email protected]) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)
Proceedings of the Twenty-Fourth Innovative Appications of Artificial Intelligence Conference Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Tatsuya Kawahara
CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN
CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN Min Tang, Bryan Pellom, Kadri Hacioglu Center for Spoken Language Research University of Colorado at Boulder Boulder, Colorado
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium [email protected]
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION Hank Liao, Erik McDermott, and Andrew Senior Google Inc. {hankliao, erikmcd, andrewsenior}@google.com
Collaborative Machine Translation Service for Scientific texts
Collaborative Machine Translation Service for Scientific texts Patrik Lambert [email protected] Jean Senellart Systran SA [email protected] Laurent Romary Humboldt Universität Berlin
Automated Transcription of Conversational Call Center Speech with Respect to Non-verbal Acoustic Events
Automated Transcription of Conversational Call Center Speech with Respect to Non-verbal Acoustic Events Gellért Sárosi 1, Balázs Tarján 1, Tibor Fegyó 1,2, and Péter Mihajlik 1,3 1 Department of Telecommunication
Language technologies for Education: recent results by the MLLP group
Language technologies for Education: recent results by the MLLP group Alfons Juan 2nd Internet of Education Conference 2015 18 September 2015, Sarajevo Contents The MLLP research group 2 translectures
Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device
International Journal of Signal Processing Systems Vol. 3, No. 2, December 2015 Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device K. Kurumizawa and H. Nishizaki
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,
Developing an Isolated Word Recognition System in MATLAB
MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
DynaSpeak: SRI s Scalable Speech Recognizer for Embedded and Mobile Systems
DynaSpeak: SRI s Scalable Speech Recognizer for Embedded and Mobile Systems Horacio Franco, Jing Zheng, John Butzberger, Federico Cesari, Michael Frandsen, Jim Arnold, Venkata Ramana Rao Gadde, Andreas
Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 Weighting and Normalisation of Synchronous HMMs for
Factored Language Model based on Recurrent Neural Network
Factored Language Model based on Recurrent Neural Network Youzheng Wu X ugang Lu Hi toshi Yamamoto Shi geki M atsuda Chiori Hori Hideki Kashioka National Institute of Information and Communications Technology
Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
Distributed Sensor Networks Volume 2015, Article ID 157453, 7 pages http://dx.doi.org/10.1155/2015/157453 Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
Speech Transcription
TC-STAR Final Review Meeting Luxembourg, 29 May 2007 Speech Transcription Jean-Luc Gauvain LIMSI TC-STAR Final Review Luxembourg, 29-31 May 2007 1 What Is Speech Recognition? Def: Automatic conversion
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Modified MMI/MPE: A Direct Evaluation of the Margin in Speech Recognition
Modified MMI/MPE: A Direct Evaluation of the Margin in Speech Recognition Georg Heigold [email protected] Thomas Deselaers [email protected] Ralf Schlüter [email protected]
