KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION
|
|
|
- Jeffry Greer
- 10 years ago
- Views:
Transcription
1 KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond, 98052, WA, USA 2 Online Services Division, Microsoft Corporation, Redmond, 98052, WA, USA 3 Microsoft Research Asia, Beijing, China 4 Tsinghua University, Beijing, China ABSTRACT We propose a novel regularized adaptation technique for context dependent deep neural network hidden Markov models (CD-DNN- HMMs). The CD-DNN-HMM has a large output layer and many large hidden layers, each with thousands of neurons. The huge number of parameters in the CD-DNN-HMM makes adaptation a challenging task, esp. when the adaptation set is small. The technique developed in this paper adapts the model conservatively by forcing the senone distribution estimated from the adapted model to be close to that from the unadapted model. This constraint is realized by adding Kullback Leibler divergence (KLD) regularization to the adaptation criterion. We show that applying this regularization is equivalent to changing the target distribution in the conventional backpropagation algorithm. Experiments on Xbox voice search, short message dictation, and Switchboard and lecture speech transcription tasks demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups. Index Terms deep neural network, CD-DNN-HMM, speaker adaptation, Kullback Leibler divergence regularization 1. INTRODUCTION Recently the context dependent deep neural network hidden Markov models (CD-DNN-HMMs) outperformed the discriminatively trained Gaussian mixture model (GMM) HMMs with 16% [1][2] and 33% [3] relative error reduction, respectively, on the Bing voice search [4] and the Switchboard (SWB) tasks. Similar improvement has been observed on other tasks such as broadcast news, Google voice search, Youtube and Aurora4 [5]- [11]. Most recently Kingsbury et al. [8] showed that the speaker independent (SI) CD-DNN-HMM can outperform the well-tuned state-of-the-art speaker adaptive GMM system with more than 10% relative error reduction on the SWB task by applying the sequence-level discriminative training on the CD-DNN-HMM. The potential of CD-DNN-HMMs, however, are yet to be explored. In this paper we propose a regularized adaptation technique for DNNs to further improve the recognition accuracy of CD-DNN-HMMs. The CD-DNN-HMM is a special case of the artificial neural network (ANN) HMM hybrid system developed in 1990s, for which several adaptation techniques have been developed. These techniques can be classified into categories of linear transformation [12]-[20], conservative training [21]-[23], and subspace method [24]. However, compared to the earlier ANN/HMM hybrid systems, CD-DNN-HMMs have significantly more parameters due to wider and deeper hidden layers used and the much larger output layer designed to model senones (tied-triphone states) directly. This difference casts additional challenges to adapting CD-DNN- HMMs, esp. when the adaptation set is small. In this paper we propose a novel regularized adaptation technique for DNNs. Our proposed technique adapts the model conservatively by forcing the senone distribution estimated from the adapted model to be close to that estimated from the unadapted model. This constraint is realized by adding Kullback Leibler divergence (KLD) regularization to the adaptation criterion. We show that applying this regularization is equivalent to changing the target distribution in the conventional backpropagation (BP) algorithm. Experiments on Xbox voice search, short message dictation, and SWB and lecture transcription tasks demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups. This new technique also outperforms the feature discriminative linear regression (fdlr) technique proposed in [25] and the output-feature discriminative linear regression (odlr) proposed in [20]. The rest of the paper is organized as follows. In Section 2 we briefly review CD-DNN-HMMs. In Section 3 we describe our proposed regularized adaptation algorithm. In Section 4 related work is briefly introduced. We report the experimental results in Section 5 and conclude the paper in Section CD-DNN-HMM The CD-DNN-HMM [1][3] is a special ANN/HMM hybrid system in which DNNs are in place of the shallow ANNs and are used to model senones directly. The DNN accepts an input observation, which typically consists of 9-13 frames of acoustic features, and process it through many layers of nonlinear transformation ( ) ( ) (1) where and are the weight matrix and bias, respectively, at hidden layer, is the output of the -th neuron, (2) is the excitation vector given input, when >0 and, and ( ) is the sigmoid function applied element-wise. At the top layer, the softmax function ( ) (3) ( ) is used to estimate the state posterior probability, /13/$ IEEE 7893 ICASSP 2013
2 which is converted to the HMM state emission probability as (4) where { } is a senone id, is the total number of senones, is the prior probability of senone, and is independent of state. The parameters of DNNs are typically trained to maximize the negative cross entropy (5) where N is the number of samples in the training set and is the target probability. Often times we use a hard alignment from an existing system as the training label, under which condition, where is Kronecker delta and is the label of the -th sample, i.e. the -th observation frame in our training corpus. The training is carried out using the BP algorithm, speeded up with GPU and minibatch updates. 3. REGULARIZED ADAPTATION An obvious approach to adapting DNNs is adjusting all the DNN parameters with the adaptation data, starting from the SI model. However, doing so may destroy previously learned information and overfit the adaptation data, esp. if the adaptation set is small. To prevent this from happening, adaptation needs to be done conservatively. The technique proposed here does exactly this. The intuition behind our proposed approach is that the posterior senone distribution estimated from the adapted model should not deviate too far away from that estimated using the unadapted model, esp. when the adaptation set is small. Since the DNN outputs are probability distributions, a natural choice in measuring the deviation is the Kullback Leibler divergence (KLD). By adding this divergence as a regularization term to eq. (5) and removing the terms unrelated to the model parameters we get the regularized optimization criterion (6) where is the posterior probability estimated from the SI model and computed with a forward pass using the SI model, and is the regularization weight. Eq. (6) can be reorganized to (7) where we have defined (8) By comparing eqs. (5) and (7) we can see that applying the KLD regularization to the original training criterion equals to changing the target probability distribution from to, which is a linear interpolation of the distribution estimated from the unadapted model and the ground truth alignment of the adaptation data. This interpolation prevents overtraining by keeping the adapted model from straying too far from the SI model. Note that this differs from L2 regularization [23], which constrains the model parameters themselves rather than the output probabilities. This also indicates that the normal BP algorithm can be directly used to adapt the DNN. The only thing needs to be changed is the error signal at the output layer, which is now defined based on The interpolation weight, which is directly derived from the regularization weight ρ, can be adjusted, typically using a development set, based on the size of the adaptation set, the learning rate used, and whether the adaptation is supervised or unsupervised. When ρ=1, we trust completely the unadapted model and ignore all new information from the adaptation data. When ρ=0, we adapt the model solely on the adaptation set, ignoring information from the unadapted model except using it as the starting point. Intuitively we should use a large ρ for a small adaptation set and a small ρ for a large adaptation set. 4. RELATED WORK Many ANN adaptation techniques have been developed in the past. These techniques can be classified into three categories: linear transformation, conservative training, and subspace method. 4.1 Linear Transformation The simplest and most popular approach to adapting ANNs is applying a linear transformation, either to the input feature (as in the linear input network (LIN) [12]-[15], [17]-[19] and the very similar feature discriminative linear regression (fdlr) [25]), to the activation of a hidden layer (as in the linear hidden network (LHN) [16]), or to the softmax layer (as in the linear output network (LON) [15] and in the output-feature discriminative linear regression (odlr) [20]). No matter where the linear transformation is applied, it is typically trained from an identity weight matrix and zero bias to optimize the criterion specified in eq. (5), keeping fixed the weights of the original ANN. 4.2 Conservative Training Another popular category of adaptation techniques is conservative training (CT) [22]. CT can be achieved by adding regularizations to the adaptation criterion. For example, in [23], L2 regularization was used. The KLD regularization approach proposed in this paper is another regularization technique. An alternative CT technique is adapting only selected weights, e.g., in [21] only weights connected to the hidden nodes with maximum variance (computed on the adaptation data) are adapted. Adaptation with very small learning rate and early stopping can also be considered as CT. 4.3 Subspace Method Subspace method aims to find a speaker subspace and then construct adapted ANN weights or transformations as a point in the subspace. For example, in [24] subspace is used to estimate an affine transformation matrix by considering it as a random variable. Principal components analysis (PCA) was performed on a set of adaptation matrices to obtain the principal directions (i.e. eigenvectors) in the speaker space. Each new speaker adaptation model is then approximated by a linear combination of the retained eigenvectors. The linear combination weights can be estimated using BP. In [26] a speaker subspace and a speech subspace are estimated and combined using tensors. The deep tensor neural 7894
3 network developed in [27] is also a subspace method. 5. EMPIRICAL EVALUATION To evaluate the proposed KLD-Reg approach, we have conducted a series of experiments on four different datasets. To obtain the results reported in this section we adapted all parameters in the DNNs, which always outperforms the setup where only the softmax layer is adapted on these datasets. 5.1 Xbox Voice Search Our initial set of experiments were conducted on a small internal Xbox voice search (VS) task which features distant talking voice search of music catalog, games, movies, and so on. We selected this dataset since it has been used in our previous study so that we can compare the KLD-Reg approach with fdlr [25], which performs slightly better than odlr [20] on the same dataset. The features were 13-dimensional Mel filter-bank cepstral coefficients (MFCC) with up to third derivatives, further transformed to 36-dimension with heterogeneous linear discriminate analysis (HLDA). Per-device cepstral mean subtraction was applied. The baseline SI models were trained using 40 hours of voice search data. The GMM-HMM model had 70k Gaussian components and 1509 senones optimized with the standard maximum likelihood estimation (MLE) procedure. The SI CD-DNN-HMM system used an 9-frame input layer, three neuron hidden layers, and a 1509-neuron output layer. The DNN system was trained using the frame-level cross entropy criterion and the senone alignment generated from the MLE system. The trigram language model (LM) used in evaluating both SI and adapted models was trained on the transcriptions of all the data including the test set. This is obviously suboptimal, but needed due to the unfortunate nature of the data set which has a significant outof-vocabulary rate; we basically trade skewing the result due to OOVs for skewing the result due to test-set-inside-lm (an overly strong LM should make the results tend to be a lower bound for the effectiveness of acoustic adaptation). Thus, when interpreting the result in this section, please be aware of this experimental shortcoming, and note that the same limitation also applies to previously reported fdlr/odlr results on this data set. The other data sets reported in this paper do not have this problem. The adaptation experiments were conducted on 6 speakers (excluded from the training set), for whom we have over 300 utterances, the newest of which were used as test utterances. The total number of tested words is Without adaptation, the GMM system achieved 43.6% word error rate (WER) and the DNN system obtained 34.1% WER. The goal of the experiments is to improve the recognition accuracy using each speaker s past data. We pretended we have 200, 100, 50, 25, 10, and 5 past utterances for adaptation for each speaker by sampling from his/her past utterances. 200 utterances and 5 utterances roughly equal to 12 minutes and 18 seconds of audio data, respectively, in this dataset. In all these experiments we used the adaptation strategy optimized for fdlr, as demonstrated in [20], with 10 passes of data and 1x10-5 learning rate per sample. Figure 1 compares the WER using the supervised fdlr and KLD-Reg adaptation techniques. From this figure we can make three observations. First, KLD-Reg helps, esp. when the adaptation set is small. In fact, when only 5 or 10 utterances were used for adaptation, the KLD-Reg adapted model degraded the accuracy. However, when the regularization weight was increased to greater than the adapted model consistently outperformed the SI model. Second, the WER reduction seems to be robust to the choice of once it s in the right range (e.g., [ ]). In addition, when the adaptation set is large (e.g., 200 utterances), good WER reduction can be obtained even with a very small ρ. Third, the KLD-Reg outperformed fdlr across all adaptation set sizes. The 5-1 cross validation indicates the average relative WER reductions using supervised KLD-Reg are 20.7%, 17.7%, 17.5%, 11.1%, 7.0%, and 5.3%, with 200, 100, 50, 25, 10, and 5 utterances of adaptation data, respectively. Figure 1: WERs on the Xbox voice search dataset for supervised fdlr and KLD-Reg adaptation. Numbers in parentheses are regularization weights ρ. The dashed line is the baseline WER using the SI DNN model. Note the remark in the text on LM used. 5.2 Short Message Dictation The second experiment was conducted on a short message dictation (SMD) task, which has longer utterances than VS. The goal is to see whether KLD-Reg is effective on other tasks. The baseline SI models were trained using 300hr voice search and SMD data. The evaluation was conducted on data from 9 speakers, out of which 2 were used as the development set and 7 were used as the test set. The total number of test set words is There is no overlap among training, adaptation and test sets. The SI GMM-HMM acoustic model has approximately 288k Gaussian components and 5976 senones trained with the MLE procedure, followed by fmpe and BMMI. The baseline SI CD- DNN-HMM system used 24 log-filter bank features with up to second derivatives and a context window size of 11, forming a vector of 792-dimension (72x11) input. On top of the input layer, there are 5 hidden layers with 2048-neurons each. The output layer has a dimension of The DNN system was trained using the senone alignments from the GMM-HMM system. The baseline SI GMM-HMM system and CD-DNN-HMM system achieved 30.4% and 23.4% WER, respectively, on the 7-speaker test set. Same as in the VS experiment, we varied the number of adaptation utterances from 5 (32 seconds) to 200 (22 minutes). Different from the previous experiments, though, we used the 2- speaker development set to determine the learning rate (4x10-5 per sample) and applied it to the 7-speaker test set. Figure 2 (a) and (b) summarize the WER on the SMD dataset using supervised and unsupervised KLD-Reg adaptation, respectively. From these figures we can observe that KLD-Reg is very effective on this task as well. With the optimal regularization weights determined by the development set we get 30.3%, 25.2%, 18.6%, 12.6%, 8.8%, and 5.6% relative WER reduction using supervised adaptation, and 14.6%, 11.7%, 8.6%, 5.8%, 4.1%, 2.5%, using unsupervised adaptation, respectively, with 200, 100, 7895
4 50, 25, 10, and 5 utterances of adaptation data. These figures also show clearer (since the learning rate is optimized using a dev set) that larger regularization weights should be used for smaller adaptation set and smaller regularization weights should be used for larger adaptation set, although the error reductions are quite robust as long as the regularization weight is within [ ] on this task. Compared to the supervised adaptation, the unsupervised setup can benefit from larger regularization weights. We believe this is because the labels in the unsupervised adaptation setup are less reliable and thus we should trust the output from the SI model even more during the adaptation and 8481 words, respectively. Table 1 summarizes the experimental results and indicates that good improvement can be achieved with the proposed adaptation technique. For comparison, we also optimized a speaker-dependent (SD) CD-DNN-HMM with the 6 lectures (with transcription) used for adaptation. In this task, the SD model outperformed the DNN baseline trained using 2000hr of (mismatched) SWB data on both the development and test sets. However, it underperformed the adapted models. This confirmed the effectiveness of the proposed adaptation technique. Table 1. WER and Relative WER Reduction (in parentheses) on Lecture Transcription Task. SI DNN Supervised Unsupervised SD DNN Dev Set 16.0% 14.3% (10.6%) 14.9% (6.9%) 15.0% Test Set 20.9% 19.1% (8.6%) 19.4% (7.2%) 20.2% 6. CONCLUSION AND DISCUSSION (a) Supervised Adaptation (b) Unsupervised Adaptation Figure 2: WERs on the SMD dataset using KLD-Reg adaptation for different regularization weights ρ (numbers in parentheses). The dashed line is the SI DNN baseline. 5.3 Switchboard and Lecture Transcription In the SWB task we want to use the unsupervised adaptation to improve the recognition accuracy of that utterance itself (45 segments or 3 mins). The SI DNN model was trained using 309hr SWB-I training set with the exact same configuration (11 frames of input features, seven 2048-neuron hidden layers, and a 9.3Kneuron output layer) as that described in [3][25]. The evaluation was conducted on the 1831-segment SWB part of the NIST 2000 Hub5 eval set. On this task we achieved on average 2.7% relative WER reduction which is slightly better than the 2.0% obtained with fdlr [25]. The 2.7% reduction, however, is only half of that achieved on the SMD task using similar amount of adaptation data (3 mins or 25 utterances in SMD). This might because the SI DNN in the SMD task was trained with a mixture of VS and SMD data. Our last experiment was conducted on a lecture transcription task where enough adaptation data (6 lectures 3.8hrs total) were available. The SI DNN model was trained with the 2000hr SWB data and has seven 2048-neuron hidden layers and an 18K output layer. The development and test sets are each a lecture and have In this paper we have proposed a novel conservative adaptation technique for DNNs. The basic idea of our approach is to force the senone posterior probability estimated from the adapted model to not deviate too much from that estimated using the unadapted model. To achieve this goal we have applied a KLD regularization term to the adaptation criterion. We showed that this is equivalent to adjusting the target distribution of the training samples and thus can be easily incorporated into the existing BP training procedure. Experiments on four datasets demonstrated the superiority of our approach with 2-30% relative WER reduction against already very strong CD-DNN-HMM systems under various adaptation setups. We believe the research on the DNN adaptation just started. The simple extension of this work would be to come up with a regression model, trained on many different tasks and adaptation set sizes, for determining the regularization weight. This would make it easier to apply the KLD-Reg method to new tasks. Alternatively, Bayesian optimization techniques might be used to search for the best hyper-parameters (which we did not optimize aggressively in our experiments) for different adaptation tasks [28]. Another direction of research is to build variable parameter DNNs similar to the variable parameter HMMs [29], in which the model parameters are functions of some control parameters such as signal-to-noise ratio. The strength of such a model is its ability to adjust the model parameters automatically based on even continuous control parameters. The fourth direction is to factorize the effect of the speaker, environment and channel to the model parameters so that we can find subspaces of each and combine them to estimate a new adapted model even for a speakerenvironment combination that has never been observed before. This is motivated by the fact that the adapted models in the lecture transcription task, although outperformed the unadapted DNN on both the development and test sets, increased WER from 14.2% (baseline) to 16.0% (supervised) and 14.6% (unsupervised), when applied to the same speaker under different channel and environment conditions (mismatched to the adaptation data). The last possible direction is to use speaker adaptive training technique [30] to separate the canonical model and the adaptation model. 7. AKNOWLEDGEMENT We would like to thank Drs. Yifan Gong, Li Deng, Alex Acero, Mike Seltzer, Qiang Huo and Jasha Droppo at Microsoft Corporation for valuable discussions and suggestions. 7896
5 8. REFERENCES [1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Contextdependent pre-trained deep neural networks for large vocabulary speech recognition, IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 1, pp , [2] D. Yu, L. Deng, and G. Dahl, Roles of pretraining and finetuning in context-dependent DNN-HMMs for real-world speech recognition, in Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Dec [3] F. Seide, G. Li, and D. Yu, Conversational speech transcription using context-dependent deep neural networks, in Proc. Interspeech 11, pp , [4] D. Yu, Y.-C. Ju, Y.-Y. Wang, G. Zweig, and A. Acero, "Automated Directory Assistance System - from Theory to Practice", in Proc. Interspeech 07, pp , [5] A. Mohamed, G. Dahl, and G. Hinton, Acoustic modeling using deep belief networks, IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 1, pp , [6] N. Jaitly, P. Nguyen, and V. Vanhoucke, application of pretrained deep neural networks to large vocabulary speech recognition, in Proc. Interspeech'12, [7] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A.-r. Mohamed, Making deep belief networks effective for large vocabulary continuous speech recognition, in Proc. ASRU 11, pp , [8] B. Kingsbury, T. N. Sainath, and H. Soltau, "Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization," in Proc. Interspeech'12, [9] Hang Su, Gang Li, Dong Yu, Frank Seide, "Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription", in Proc. ICASSP [10] Michael Seltzer, Dong Yu, Yongqiang Wang, "An investigation of deep neural networks for noise robust speech recognition", in Proc. ICASSP [11] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Processing Magazine, [12] V. Abrash, H. Franco, A. Sankar, and M. Cohen, Connectionist speaker normalization and adaptation, in Proc. EUROSPEECH 95, pp , [13] J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, and S. Renals, T. Robinson, Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system, in Proc. EUROSPEECH 95, pp , [14] D. Albesano, R. Gemello, and F. Mana, Hybrid HMM-NN modeling of stationary-transitional units for continuous speech recognition, in Proc. NIPS 97, pp , [15] B. Li and K. C. Sim, "Comparison of discriminative input and output transformations for speaker adaptation in the Hybrid NN/HMM systems", in Proc. Interspeech 10, pp , [16] R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, "Linear hidden transformations for adaptation of hybrid ANN/HMM models", Speech Communication 49, no. 10, pp , [17] X. Liu, M. J. F. Gales, and P. C. Woodland. "Improving LVCSR system combination using neural network language model cross adaptation," in Proc. Interspeech 11, Pp , [18] Y. Xiao, Z. Zhang, S. Cai, J. Pan, and Y. Yan, "A initial attempt on task-specific adaptation for deep neural networkbased large vocabulary continuous speech recognition", in Proc. Interspeech 12, [19] J. Trmal, J. Zelinka, and L. Müller. "Adaptation of a feedforward artificial neural network using a linear transform," Text, Speech and Dialogue. Springer Berlin/Heidelberg, pp , [20] K. Yao, D. Yu, F. Seide, H. Su, L.i Deng, and Y. Gong, "Adaptation of context-dependent deep neural networks for automatic speech recognition", in Proc. SLT 12, [21] J. Stadermann and G. Rigoll, Two-stage speaker adaptation of hybrid tied-posterior acoustic models, in Proc. ICASSP 05, vol. I, pp , [22] D. Albesano, R. Gemello, P. Laface, F. Mana, and S. Scanzio, "Adaptation of artificial neural networks avoiding catastrophic forgetting," in Proc. Int. Jnt. Conference on Neural Networks 2006, pp , [23] X. Li and J. Bilmes, Regularized adaptation of discriminative classifiers, in Proc. ICASSP 06, [24] S. Dupont and L. Cheboub, "Fast speaker adaptation of artificial neural networks for automatic speech recognition", in Proc. ICASSP 00, vol.3, pp , [25] F. Seide, G. Li, X. Chen, and D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in Proc. ASRU 11, [26] D. Yu, X. Chen, and L. Deng, "Factorized deep neural networks for adaptive speech recognition", International workshop on statistical machine learning for speech processing, [27] D. Yu, L. Deng, and F. Seide, "The deep tensor neural network with applications to large vocabulary speech recognition", IEEE Trans. on Audio, Speech, and Language Processing, [28] J. Snoek, H. Larochelle, and RP. Adams, "Practical bayesian optimization of machine learning algorithms," arxiv: , [29] D. Yu, L. Deng, Y. Gong, and A. Acero, "A novel framework and training algorithm for variable-parameter hidden markov models", IEEE Trans. on Audio, Speech, and Language Processing, vol 17, no. 7, pp , [30] J. Trmal, J. Zelinka, and L. Müller, "On speaker adaptive training of artificial neural networks", in Proc. Interspeech 10, pp ,
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon
arxiv:1505.05899v1 [cs.cl] 21 May 2015
The IBM 2015 English Conversational Telephone Speech Recognition System George Saon, Hong-Kwang J. Kuo, Steven Rennie and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 [email protected]
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION Hank Liao, Erik McDermott, and Andrew Senior Google Inc. {hankliao, erikmcd, andrewsenior}@google.com
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine George E. Dahl, Marc Aurelio Ranzato, Abdel-rahman Mohamed, and Geoffrey Hinton Department of Computer Science University of Toronto
Deep Neural Network Approaches to Speaker and Language Recognition
IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 10, OCTOBER 2015 1671 Deep Neural Network Approaches to Speaker and Language Recognition Fred Richardson, Senior Member, IEEE, Douglas Reynolds, Fellow, IEEE,
Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
Distributed Sensor Networks Volume 2015, Article ID 157453, 7 pages http://dx.doi.org/10.1155/2015/157453 Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
Strategies for Training Large Scale Neural Network Language Models
Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS George E. Dahl University of Toronto Department of Computer Science Toronto, ON, Canada Jack W. Stokes, Li Deng, Dong Yu
Denoising Convolutional Autoencoders for Noisy Speech Recognition
Denoising Convolutional Autoencoders for Noisy Speech Recognition Mike Kayser Stanford University [email protected] Victor Zhong Stanford University [email protected] Abstract We propose the use of
arxiv:1604.08242v2 [cs.cl] 22 Jun 2016
The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 [email protected]
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
Do Deep Nets Really Need to be Deep?
Do Deep Nets Really Need to be Deep? Lei Jimmy Ba University of Toronto [email protected] Rich Caruana Microsoft Research [email protected] Abstract Currently, deep neural networks are the state
Online Diarization of Telephone Conversations
Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Automatic Evaluation Software for Contact Centre Agents voice Handling Performance
International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Hardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Machine Learning in FX Carry Basket Prediction
Machine Learning in FX Carry Basket Prediction Tristan Fletcher, Fabian Redpath and Joe D Alessandro Abstract Artificial Neural Networks ANN), Support Vector Machines SVM) and Relevance Vector Machines
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Artificial Neural Network for Speech Recognition
Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks
Volume 143 - No.11, June 16 Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks Mohamed Akram Zaytar Research Student Department of Computer Engineering Faculty
The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Figure 1. The cloud scales: Amazon EC2 growth [2].
- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 [email protected], [email protected] Abstract One of the most important issues
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,
The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
Probabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks
Chin. J. Astron. Astrophys. Vol. 5 (2005), No. 2, 203 210 (http:/www.chjaa.org) Chinese Journal of Astronomy and Astrophysics Automated Stellar Classification for Large Surveys with EKF and RBF Neural
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
A QoE Based Video Adaptation Algorithm for Video Conference
Journal of Computational Information Systems 10: 24 (2014) 10747 10754 Available at http://www.jofcis.com A QoE Based Video Adaptation Algorithm for Video Conference Jianfeng DENG 1,2,, Ling ZHANG 1 1
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Price Prediction of Share Market using Artificial Neural Network (ANN)
Prediction of Share Market using Artificial Neural Network (ANN) Zabir Haider Khan Department of CSE, SUST, Sylhet, Bangladesh Tasnim Sharmin Alin Department of CSE, SUST, Sylhet, Bangladesh Md. Akter
Neural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
Emotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
Subspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft
Supply Chain Forecasting Model Using Computational Intelligence Techniques
CMU.J.Nat.Sci Special Issue on Manufacturing Technology (2011) Vol.10(1) 19 Supply Chain Forecasting Model Using Computational Intelligence Techniques Wimalin S. Laosiritaworn Department of Industrial
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Proactive Drive Failure Prediction for Large Scale Storage Systems
Proactive Drive Failure Prediction for Large Scale Storage Systems Bingpeng Zhu, Gang Wang, Xiaoguang Liu 2, Dianming Hu 3, Sheng Lin, Jingwei Ma Nankai-Baidu Joint Lab, College of Information Technical
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, New Territories,
Annotated bibliographies for presentations in MUMT 611, Winter 2006
Stephen Sinclair Music Technology Area, McGill University. Montreal, Canada Annotated bibliographies for presentations in MUMT 611, Winter 2006 Presentation 4: Musical Genre Similarity Aucouturier, J.-J.
Building A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium [email protected]
Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation
Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Feature Engineering in Machine Learning
Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Lecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
Advanced Signal Processing and Digital Noise Reduction
Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION Horacio Franco, Luciana Ferrer, and Harry Bratt Speech Technology and Research Laboratory, SRI International, Menlo Park, CA
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Neural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
Method of Combining the Degrees of Similarity in Handwritten Signature Authentication Using Neural Networks
Method of Combining the Degrees of Similarity in Handwritten Signature Authentication Using Neural Networks Ph. D. Student, Eng. Eusebiu Marcu Abstract This paper introduces a new method of combining the
Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *
Send Orders for Reprints to [email protected] 766 The Open Electrical & Electronic Engineering Journal, 2014, 8, 766-771 Open Access Research on Application of Neural Network in Computer Network
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
