arxiv: v2 [cs.cl] 22 Jun 2016
|
|
|
- Andrea Crawford
- 9 years ago
- Views:
Transcription
1 The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, arxiv: v2 [cs.cl] 22 Jun 2016 Abstract We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model M and hierarchical neural network LMs. Index Terms: recurrent neural networks, convolutional neural networks, conversational speech recognition 1. Introduction The landscape of neural network acoustic modeling is rapidly evolving. Spurred by the success of deep feed-forward neural nets for LVCSR in [1] and inspired by other research areas like image classification and natural language processing, many speech groups have looked at more sophisticated architectures such as deep convolutional nets [2, 3], deep recurrent nets [4], time-delay neural nets [5], and long-short term memory nets [6, 7, 8, 9]. The trend is to remove a lot of the complexity and human knowledge that was necessary in the past to build good ASR systems (e.g. speaker adaptation, phonetic context modeling, discriminative feature processing, etc.) and to replace them with a powerful neural network architecture that can be trained agnostically on a lot of data. With the advent of numerous neural network toolkits which can implement these sophisticated models out-of-the-box and powerful hardware based on GPUs, the barrier of entry for building high performing ASR systems has been lowered considerably. First case in point: front-end processing has been simplified considerably with the use of CNNs which treat the log-mel spectral representation as an image and don t require extra processing steps such as PLP cepstra, LDA, FMLLR, fmpe transforms, etc. Second case in point: end-to-end ASR systems such as [6, 8, 7] bypass the need of having phonetic context decision trees and HMMs altogether and directly map the sequence of acoustic features to a sequence of characters or context independent phones. Third case in point: training algorithms such as connectionist temporal classification [10] don t require an initial alignment of the training data which is typically done with a GMM-based baseline model. The above points beg the question whether, in this age of readily available NN toolkits, speech recognition expertise is still necessary or whether one can simply point a neural net to the audio and transcripts, let it train, and obtain a good acoustic model. While it is true that, as the amount of training data increases, the need for human ASR expertise is lessened, at the moment the performance of end-to-end systems ultimately remains inferior to that of more traditional, i.e. HMM and decision tree-based, approaches. Since the goal of this work is to obtain the lowest possible WER on the Switchboard dataset regardless of other practical considerations such as speed and/or simplicity, we have focused on the latter approaches. The paper is organized as follows: in section 2 we discuss acoustic and language modeling improvements and in section 3 we summarize our findings. 2. System improvements In this section we describe three different acoustic models that were trained on 2000 hours of English conversational telephone speech: recurrent nets with maxout activations and annealed dropout, very deep convolutional nets with 3 3 kernels, and bidirectional long short-term memory nets operating on FM- LLR and i-vector features. All models are used in a hybrid HMM decoding scenario by subtracting the logarithm of the HMM state priors from the log of the softmax output scores. The training and test data, frontend processing, speaker adaptation are identical to [11] and their description will be omitted. At the end of the section, we also provide an update on our vocabulary and language modeling experiments Recurrent nets with maxout activations We remind the reader that maxout nets [12] generalize ReLu units{ by employing } non-linearities of the form s i = max j C(i) w T j x + b j where the subsets of neurons C(i) are typically disjoint. In [11] we have shown that maxout DNNs and CNNs trained with annealed dropout outperform their sigmoid-based counterparts on both 300 hours and 2000 hours training regimes. What was missing there was a comparison between maxout and sigmoid for unfolded RNNs [4]. The architecture of the maxout RNNs comprises one recurrent layer with 2828 units projected to 1414 units via non-overlapping 2 1 maxout operations. This layer is followed by 4 nonrecurrent layers with 2828 units (also projected to 1414) followed by a bottleneck with units and an output layer with neurons corresponding to as many contextdependent HMM states. The number of neurons for the maxout layers have been chosen such that the weight matrices have roughly the same number of parameters as the baseline sigmoid network which has 2048 units per hidden layer. The recurrent layer is unfolded backwards in time for 6 time steps t 5... t and has 340-dimensional inputs consisting of 6 spliced right context 40-dimensional FMLLR frames (t... t + 5) to which we append a 100-dimensional speaker-based ivector. The unfolded maxout RNN architecture is depicted in Figure 1. The network is trained one hidden layer at a time with discriminative pretraining followed by 12 epochs of SGD CE training on randomized minibatches of 250 samples. The model is refined with Hessian-free sequence discriminative training [13]
2 t Figure 1: Unfolded maxout RNN architecture (right arrows in the boxes denote the maxout operation). using the state-based MBR criterion for 10 iterations. RNN sigmoid (CE) RNN maxout (CE) RNN maxout (ST) Table 1: Word error rates for sigmoid vs. Maxout RNNs trained with annealed dropout on Hub5 00 after cross-entropy training (and sequence training for Maxout). In Table 1 we report the error rates for sigmoid and maxout RNNs on the Switchboard and CallHome subsets of Hub5 00. The decodings are done with a small vocabulary of 30K words and a small 4-gram language model with 4M n-grams. Note that the sigmoid RNNs have better error rates than what was reported in [11] because they have been retrained after the data has been realigned with the best joint RNN/CNN model. We observe that the maxout RNNs are consistently better and that, by themselves, they achieve a similar WER as our previous best model which was the joint RNN/CNN with sigmoid activations Very deep convolutional networks t Very deep CNNs with small 3 3 kernels have recently been shown to achieve very strong performance as acoustic models in hybrid NN-HMM speech recognition systems. Results were provided after cross-entropy training on the 300 hours switchboard-1 dataset in [14], and results from sequence training on both switchboard-1 and the 2000 hours switchboard+fisher dataset are in [15]. The very deep convolutional networks are inspired by the VGG Net architecture introduced in [16] for the 2014 ImageNet classification challenge, with the central idea to replace large convolutional kernels by small 3 3 kernels. By stacking many of these convolutional layers with ReLU nonlinearities before pooling layers, the same receptive field is created with less parameters and more nonlinearity. Figure 2 shows the design of the networks. Note that as we go deeper in the network, the time and frequency resolution is reduced through pooling only, while the convolutions are zero-padded as to not reduce the size of the feature maps. We increase the number of feature maps gradually from 64 to 512 (indicated by the different colors). We pool right before the layer that increases the number of feature maps. Note that the indication of feature map size on the right only applies to the rightmost 2 designs. In contrast, the classical CNN architecture has only two layers, goes to 512 feature maps directly, and uses a large 9 9 kernel on the first layer. Our 10-layer CNN has about the same number of parameters as the classical CNN, converges in 5 times fewer epochs, but is computationally more expensive. Results for 3 variations of the 10-layer CNN are in table 2. For model combination, we use the version with pooling, which is the exact same model without modifications from the original paper [14]. CNN model SWB (300h) SWB (2000h) CE ST CE ST Classic sigmoid [17] Classic maxout [11] * 9.9* (a) 10-conv Pool (b) 10-conv No pool (c) 10-conv No pool, no pad Table 2: WER on the SWB part of the Hub5 00 testset, for 3 variants of the 10-convolutional-layer CNN: with pooling in time (a), without pooling in time (b), and without pooling nor padding in time (c). For more details see [15]. *New results. Our implementation was done in Torch [18]. We adopt the balanced sampling from [14], by sampling from context dependent state CD i with probability p i = f γ i j f γ. We keep γ = 0.8 j throughout the experiments during cross-entropy training. During CE training, we optimize with simple SGD or NAG, during ST we found NAG to be superior to SGD. We regularize the stochastic sequence training by adding the gradient of crossentropy loss, as proposed in [19] Bidirectional LSTMs Given the recent popularity of LSTMs for acoustic modeling [6, 7, 8, 9], we have experimented with such models on the Switchboard task using the Torch toolkit [18]. We have looked at the effect of the input features on LSTM performance, the number of layers and whether start states for the recurrent layers should be reset or carried over. We use bidirectional LSTMs that are trained on non-overlapping subsequences of 20 frames. The subsequences coming from the same utterance are contiguous so that the left-to-right final states for the current subsequence can be copied to the left-to-right start states for the next subsequence (i.e. carried over). For processing speed and in order to get good gradient estimates, we group subsequences from multiple utterances into minibatches of size 256. Regardless of the number of LSTM layers, all models use a linear bottleneck of size 256 before the softmax output layer (of size 32000). In one experiment, we compare the effect of input features on model performance. The baseline models are trained on 40-dimensional FMLLR dimensional ivector frames and have 1024 (or 512) LSTM units per layer and per direction (left-to-right and right-to-left). The forward and backward activations from the previous LSTM layer are concatenated and fed into the next LSTM layer. The contrast model is a sin-
3 2-conv (classic) 6-conv 8-conv 10-conv featuremap size (freq x time) 2 x 4 4x3 conv, 512 3x1 pool 9x9 conv, x 8 10 x x x 16 input (40x11) Figure 2: The design of the VGG nets: (1) classical CNN, (2-4) very deep CNNs from [14] with 6, 8 and 10 convolutional layers respectively. The deepest CNN (10-conv) obtains best performance. This figure corresponds to [14] Table 1. gle layer bidirectional LSTM trained on 128-dim features obtained by performing PCA on 512-dimensional bottleneck features. The features are obtained from a 6-layer DNN cross entropy trained on blocks of 11 consecutive FMLLR frames and 100-dimensional i-vectors. In Table 3, we report recognition results on Hub5 00 for these four models trained with 15 passes of cross-entropy SGD on the 300 hour (SWB-1) subset. 1-layer 1024 bottleneck layer 1024 FMLLR+ivec layer 1024 FMLLR+ivec layer 512 FMLLR+ivec Table 3: Word error rates on Hub for various LSTM models trained with cross-entropy on 300 hours. Held-out loss start states reset start states carried over Due to a bug that affected our earlier multi-layer LSTM results, we decided to go ahead with single layer bidirectional LSTMs on bottleneck features on the full 2000 hour training set. We also experimented with how to deal with the start states at the beginning of the left-to-right pass. One option is to carry them over from the previous subsequence and the other one is to reset the start states at the beginning of each subsequence. In Figure 3 we compare the cross-entropy loss on held-out data between these two models. As can be seen, the LSTM model with carried over start states is much better at predicting the correct HMM state. However, when comparing word error rates in Table 4, the LSTM with start states that are reset has a better performance. We surmise that this is because the increased memory of the LSTM with carried over start states is in conflict with the state sequence constraints imposed by the HMM topology and the language model. Additionally, we show the WERs of the DNN used for Epoch Figure 3: Single layer LSTM cross-entropy loss on held-out data with left-to-right start states which are either reset or carried over.
4 the bottleneck features and of a 4-layer 512 unit LSTM. We observe that the 4 layer LSTM is significantly better than the DNN and the two single layer LSTMs trained on bottleneck features. CE ST CE ST 6-layer DNN layer LSTM (carry-over) layer LSTM (reset) layer LSTM (reset) Table 4: Word error rates on Hub for DNN and LSTM models. All models are trained on 2000 hours with crossentropy and sequence discriminative training Model combination In Table 5 we report the performance of the individual models (RNN, VGG and 4-layer LSTM) described in the previous subsections as well as the results after frame-level score fusion. All decodings are done with a 30K word vocabulary and a small 4-gram language model with 4M n-grams. We note that RNNs and VGG nets exhibit similar performance and have a strong complementarity which improves the WER by 0.6% and 0.9% on SWB and CH, respectively. RNN maxout VGG (a) from Table LSTM (4-layer, 512) RNN+VGG VGG+LSTM RNN+VGG+LSTM Table 5: Word error rates for individual acoustic models and frame-level score fusions on Hub Language modeling experiments Our language modeling strategy largely parallels that described in [11]. For completeness, we will repeat some of the details here. The main difference is an increase in the vocabulary size from 30K words to 85K words. When comparing acoustic models in previous sections, we used a relatively small legacy language model used in previous publications: a 4M n-gram (n=4) language model with a vocabulary of 30.5K words. We wanted to increase the language model coverage in a manner that others can replicate. To this end, we increased the vocabulary size from 30.5K words to 85K words by adding the vocabulary of the publicly available Broadcast News task. We also added to the LM publicly available text data from LDC, including Switchboard, Fisher, Gigaword, and Broadcast News and Conversations. The most relevant data are the transcripts of the 1975 hour audio data used to train the acoustic model, consisting of about 24M words. For each corpus we trained a 4-gram model with modified Kneser-Ney smoothing [20]. The component LMs are linearly interpolated with weights chosen to optimize perplexity on a held-out set. Entropy pruning [21] was applied, resulting in a single 4-gram LM consisting of 36M n-grams. This new n-gram LM was used together with our best acoustic model to decode and generate word lattices for LM rescoring experiments. The first two lines of Table 6 show the improvement using this larger n-gram LM with larger vocabulary trained on more data. The WER improved by 1.0% for SWB. Part of this improvement ( %) was due to also using a larger beam for decoding and a change in vocabulary tokenization. LM WER SWB WER CH 30K vocab, 4M n-grams K vocab, 36M n-grams n-gram + model M n-gram + NNLM n-gram + model M + NNLM Table 6: Comparison of word error rates for different LMs. We used two types of LMs for LM rescoring: model M, a class-based exponential model [22] and feed-forward neural network LM (NNLM) [23, 24, 25, 26]. We built a model M LM on each corpus and interpolated the models, together with the 36M n-gram LM. As shown in Table 6, using model M results in an improvement of 0.6% on SWB. We built two NNLMs for interpolation. One was trained on just the most relevant data: the 24M word corpus (Switchboard/Fisher/CallHome acoustic transcripts). Another was trained on a 560M word subset of the LM training data: in order to speed up training for this larger set, we employed a hierarchical NNLM approximation [24, 27]. Table 6 shows that the NNLMs provided an additional 0.4% improvement over the model M result on SWB. Compared with the n-gram LM baseline, LM rescoring yielded a total improvement of 1.0% on SWB (7.6% to 6.6%) and 1.5% on CH (13.7% to 12.2%). 3. Conclusion In our previous Switchboard system paper [11] we have observed a good complementarity between recurrent nets and convolutional nets and their combination led to significant accuracy gains. In this paper we have presented an improved unfolded RNN (with maxout instead of sigmoid activations) and a stronger CNN obtained by adding more convolutional layers with smaller kernels and ReLu nonlinearities. These improved models still have good complementarity and their frame-level score combination in conjunction with a multi-layer LSTM leads to a 0.4%-0.7% decrease in WER over the LSTM. Multilayer LSTMs were the strongest performing model followed closely by the RNN and VGG nets. We also believe that LSTMs have more potential for direct sequence-to-sequence modeling and we are actively exploring this area of research. On the language modeling side, we have increased our vocabulary from 30K to 85K words and updated our component LMs. At the moment, we are less than 3% away from achieving human performance on the Switchboard data (estimated to be around 4%). Unfortunately, it looks like future improvements on this task will be considerably harder to get and will probably require a breakthrough in direct sequence-to-sequence modeling and a significant increase in training data. 4. Acknowledgment The authors wish to thank E. Marcheret, J. Cui and M. Nussbaum-Thom for useful suggestions about LSTMs.
5 5. References [1] F. Seide, G. Li, X. Chien, and D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in Proc. ASRU, [2] O. Abdel-Hamid, L. Deng, and D. Yu, Exploring convolutional neural network structures and optimization techniques for speech recognition. in INTERSPEECH, 2013, pp [3] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, Deep convolutional neural networks for lvcsr, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp [4] G. Saon, H. Soltau, A. Emami, and M. Picheny, Unfolded recurrent neural networks for speech recognition, in Fifteenth Annual Conference of the International Speech Communication Association, [5] V. Peddinti, D. Povey, and S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts, in Proceedings of INTERSPEECH, [6] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., Deepspeech: Scaling up end-to-end speech recognition, arxiv preprint arxiv: , [7] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk, Learning acoustic frame labeling for speech recognition with recurrent neural networks, in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp [8] Y. Miao, M. Gowayyed, and F. Metze, Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding, arxiv preprint arxiv: , [9] A.-r. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn, Deep bi-directional recurrent networks over spectral windows, in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, [10] A. Graves, A.-r. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp [11] G. Saon, H.-K. Kuo, S. Rennie, and M. Picheny, The IBM 2015 English conversational speech recognition system, in Sixteenth Annual Conference of the International Speech Communication Association, [12] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, Maxout networks, arxiv preprint arxiv: , [13] B. Kingsbury, T. Sainath, and H. Soltau, Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization, in Proc. Interspeech, [14] T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, Very deep multilingual convolutional neural networks for lvcsr, Proc. ICASSP, [15] T. Sercu and V. Goel, Advances in very deep convolutional neural networks for lvcsr, arxiv, [16] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR arxiv: , [17] H. Soltau, G. Saon, and T. N. Sainath, Joint training of convolutional and non-convolutional neural networks, to Proc. ICASSP, [18] R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A matlab-like environment for machine learning, in BigLearn, NIPS Workshop, no. EPFL-CONF , [19] H. Su, G. Li, D. Yu, and F. Seide, Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription, Proc. ICASSP, [20] S. F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, Computer Speech & Language, vol. 13, no. 4, pp , [21] A. Stolcke, Entropy-based pruning of backoff language models, in Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp [22] S. F. Chen, Shrinking exponential language models, in Proc. NAACL-HLT, 2009, pp [23] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, A neural probabilistic language model, Journal of Machine Learning Research, vol. 3, pp , [24] A. Emami, A neural syntactic language model, Ph.D. dissertation, Johns Hopkins University, Baltimore, MD, USA, [25] H. Schwenk, Continuous space language models, Computer Speech & Language, vol. 21, no. 3, pp , [26] A. Emami and L. Mangu, Empirical study of neural network language models for Arabic speech recognition, in Proc. ASRU, 2007, pp [27] H.-K. J. Kuo, E. Arısoy, A. Emami, and P. Vozila, Large scale hierarchical neural network language models, in Proc. Interspeech, 2012.
arxiv:1505.05899v1 [cs.cl] 21 May 2015
The IBM 2015 English Conversational Telephone Speech Recognition System George Saon, Hong-Kwang J. Kuo, Steven Rennie and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 [email protected]
Strategies for Training Large Scale Neural Network Language Models
Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond,
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
Denoising Convolutional Autoencoders for Noisy Speech Recognition
Denoising Convolutional Autoencoders for Noisy Speech Recognition Mike Kayser Stanford University [email protected] Victor Zhong Stanford University [email protected] Abstract We propose the use of
LSTM for Punctuation Restoration in Speech Transcripts
LSTM for Punctuation Restoration in Speech Transcripts Ottokar Tilk, Tanel Alumäe Institute of Cybernetics Tallinn University of Technology, Estonia [email protected], [email protected] Abstract
arxiv:1412.5567v2 [cs.cl] 19 Dec 2014
Deep Speech: Scaling up end-to-end speech recognition arxiv:1412.5567v2 [cs.cl] 19 Dec 2014 Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh,
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
IBM Research Report. Scaling Shrinkage-Based Language Models
RC24970 (W1004-019) April 6, 2010 Computer Science IBM Research Report Scaling Shrinkage-Based Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM Research
Deep Neural Network Approaches to Speaker and Language Recognition
IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 10, OCTOBER 2015 1671 Deep Neural Network Approaches to Speaker and Language Recognition Fred Richardson, Senior Member, IEEE, Douglas Reynolds, Fellow, IEEE,
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected]
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected] Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
Do Deep Nets Really Need to be Deep?
Do Deep Nets Really Need to be Deep? Lei Jimmy Ba University of Toronto [email protected] Rich Caruana Microsoft Research [email protected] Abstract Currently, deep neural networks are the state
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon
Factored Language Model based on Recurrent Neural Network
Factored Language Model based on Recurrent Neural Network Youzheng Wu X ugang Lu Hi toshi Yamamoto Shi geki M atsuda Chiori Hori Hideki Kashioka National Institute of Information and Communications Technology
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
Compacting ConvNets for end to end Learning
Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
Recurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION Hank Liao, Erik McDermott, and Andrew Senior Google Inc. {hankliao, erikmcd, andrewsenior}@google.com
Three New Graphical Models for Statistical Language Modelling
Andriy Mnih Geoffrey Hinton Department of Computer Science, University of Toronto, Canada [email protected] [email protected] Abstract The supremacy of n-gram models in statistical language modelling
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks
Volume 143 - No.11, June 16 Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks Mohamed Akram Zaytar Research Student Department of Computer Engineering Faculty
Image Classification for Dogs and Cats
Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science [email protected] Abstract
Image and Video Understanding
Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material
Pedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University [email protected] Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks
1 Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks Tomislav Hrkać, Karla Brkić, Zoran Kalafatić Faculty of Electrical Engineering and Computing University of
Ebru Arısoy [email protected]
Ebru Arısoy [email protected] Education, Istanbul, Turkey Ph.D. in Electrical and Electronics Engineering, 2009 Title: Statistical and Discriminative Language Modeling for Turkish Large Vocabulary Continuous
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 GENIVI is a registered trademark of the GENIVI Alliance in the USA and other countries Copyright GENIVI Alliance
TED-LIUM: an Automatic Speech Recognition dedicated corpus
TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France [email protected]
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015
CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing
Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
Distributed Sensor Networks Volume 2015, Article ID 157453, 7 pages http://dx.doi.org/10.1155/2015/157453 Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
NEURAL NETWORKS A Comprehensive Foundation
NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine George E. Dahl, Marc Aurelio Ranzato, Abdel-rahman Mohamed, and Geoffrey Hinton Department of Computer Science University of Toronto
An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048
An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048 Hong Gui, Tinghan Wei, Ching-Bo Huang, I-Chen Wu 1 1 Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan
IMPROVING DEEP NEURAL NETWORK ACOUSTIC MODELS USING GENERALIZED MAXOUT NETWORKS. Xiaohui Zhang, Jan Trmal, Daniel Povey, Sanjeev Khudanpur
IMPROVING DEEP NEURAL NETWORK ACOUSTIC MODELS USING GENERALIZED MAXOUT NETWORKS Xiaohui Zhang, Jan Trmal, Daniel Povey, Sanjeev Khudanpur Center for Language and Speech Processing & Human Language Technology
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
This version: December 12, 2013 Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks Lawrence Takeuchi * Yu-Ying (Albert) Lee [email protected] [email protected] Abstract We
CUED-RNNLM AN OPEN-SOURCE TOOLKIT FOR EFFICIENT TRAINING AND EVALUATION OF RECURRENT NEURAL NETWORK LANGUAGE MODELS
CUED-RNNLM AN OPEN-SOURCE OOLKI FOR EFFICIEN RAINING AND EVALUAION OF RECURREN NEURAL NEWORK LANGUAGE MODELS X Chen, X Liu, Y Qian, MJF Gales, PC Woodland Cambridge University Engineering Dept, rumpington
PhD in Computer Science and Engineering Bologna, April 2016. Machine Learning. Marco Lippi. [email protected]. Marco Lippi Machine Learning 1 / 80
PhD in Computer Science and Engineering Bologna, April 2016 Machine Learning Marco Lippi [email protected] Marco Lippi Machine Learning 1 / 80 Recurrent Neural Networks Marco Lippi Machine Learning
Università degli Studi di Bologna
Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of
Neural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
A STATE-SPACE METHOD FOR LANGUAGE MODELING. Vesa Siivola and Antti Honkela. Helsinki University of Technology Neural Networks Research Centre
A STATE-SPACE METHOD FOR LANGUAGE MODELING Vesa Siivola and Antti Honkela Helsinki University of Technology Neural Networks Research Centre [email protected], [email protected] ABSTRACT In this paper,
SIGNAL INTERPRETATION
SIGNAL INTERPRETATION Lecture 6: ConvNets February 11, 2016 Heikki Huttunen [email protected] Department of Signal Processing Tampere University of Technology CONVNETS Continued from previous slideset
arxiv:1603.03185v2 [cs.cl] 11 Mar 2016
PERSONALIZED SPEECH RECOGNITION ON MOBILE DEVICES Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Haşim Sak, Alexander Gruenstein, Françoise
Building A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels
3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels Luís A. Alexandre Department of Informatics and Instituto de Telecomunicações Univ. Beira Interior,
Simplified Machine Learning for CUDA. Umar Arshad @arshad_umar Arrayfire @arrayfire
Simplified Machine Learning for CUDA Umar Arshad @arshad_umar Arrayfire @arrayfire ArrayFire CUDA and OpenCL experts since 2007 Headquartered in Atlanta, GA In search for the best and the brightest Expert
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network
Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network Dušan Marček 1 Abstract Most models for the time series of stock prices have centered on autoregressive (AR)
Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features
Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS George E. Dahl University of Toronto Department of Computer Science Toronto, ON, Canada Jack W. Stokes, Li Deng, Dong Yu
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University [email protected] Rob Fergus Department
Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *
Send Orders for Reprints to [email protected] 766 The Open Electrical & Electronic Engineering Journal, 2014, 8, 766-771 Open Access Research on Application of Neural Network in Computer Network
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks Dan C. Cireşan IDSIA USI-SUPSI Manno, Switzerland, 6928 Email: [email protected] Ueli Meier IDSIA USI-SUPSI Manno, Switzerland, 6928
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Lecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
An Introduction to Deep Learning
Thought Leadership Paper Predictive Analytics An Introduction to Deep Learning Examining the Advantages of Hierarchical Learning Table of Contents 4 The Emergence of Deep Learning 7 Applying Deep-Learning
A Look Into the World of Reddit with Neural Networks
A Look Into the World of Reddit with Neural Networks Jason Ting Institute of Computational and Mathematical Engineering Stanford University Stanford, CA 9435 [email protected] Abstract Creating, placing,
SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS
UDC: 004.8 Original scientific paper SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS Tonimir Kišasondi, Alen Lovren i University of Zagreb, Faculty of Organization and Informatics,
DeepPlaylist: Using Recurrent Neural Networks to Predict Song Similarity
DeepPlaylist: Using Recurrent Neural Networks to Predict Song Similarity Anusha Balakrishnan Stanford University [email protected] Kalpit Dixit Stanford University [email protected] Abstract The
arxiv:1502.01852v1 [cs.cv] 6 Feb 2015
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun arxiv:1502.01852v1 [cs.cv] 6 Feb 2015 Abstract Rectified activation
IBM SPSS Neural Networks 22
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.
Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
Taking Inverse Graphics Seriously
CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto The representation used by the neural nets that work best
GLOVE-BASED GESTURE RECOGNITION SYSTEM
CLAWAR 2012 Proceedings of the Fifteenth International Conference on Climbing and Walking Robots and the Support Technologies for Mobile Machines, Baltimore, MD, USA, 23 26 July 2012 747 GLOVE-BASED GESTURE
IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE
IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE Jing Zheng, Arindam Mandal, Xin Lei 1, Michael Frandsen, Necip Fazil Ayan, Dimitra Vergyri, Wen Wang, Murat Akbacak, Kristin
Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis
Contemporary Engineering Sciences, Vol. 4, 2011, no. 4, 149-163 Performance Evaluation of Artificial Neural Networks for Spatial Data Analysis Akram A. Moustafa Department of Computer Science Al al-bayt
arxiv:1312.6034v2 [cs.cv] 19 Apr 2014
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Karen Simonyan Andrea Vedaldi Andrew Zisserman Visual Geometry Group,
NEURAL NETWORK FUNDAMENTALS WITH GRAPHS, ALGORITHMS, AND APPLICATIONS
NEURAL NETWORK FUNDAMENTALS WITH GRAPHS, ALGORITHMS, AND APPLICATIONS N. K. Bose HRB-Systems Professor of Electrical Engineering The Pennsylvania State University, University Park P. Liang Associate Professor
Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition
1 Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition Émilie CAILLAULT LASL, ULCO/University of Littoral (Calais) Christian VIARD-GAUDIN IRCCyN, University of
MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan
1 MulticoreWare Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan Focused on Heterogeneous Computing Multiple verticals spawned from core competency Machine Learning
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks Ilya Sutskever Google [email protected] Oriol Vinyals Google [email protected] Quoc V. Le Google [email protected] Abstract Deep Neural Networks (DNNs)
NEURAL NETWORKS IN DATA MINING
NEURAL NETWORKS IN DATA MINING 1 DR. YASHPAL SINGH, 2 ALOK SINGH CHAUHAN 1 Reader, Bundelkhand Institute of Engineering & Technology, Jhansi, India 2 Lecturer, United Institute of Management, Allahabad,
Deep Learning using Linear Support Vector Machines
Yichuan Tang [email protected] Department of Computer Science, University of Toronto. Toronto, Ontario, Canada. Abstract Recently, fully-connected and convolutional neural networks have been trained
Feature Engineering in Machine Learning
Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
