THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT. Daniel Bolaños. Boulder Language Technologies (BLT), Boulder, CO, USA
|
|
|
- Letitia Cain
- 9 years ago
- Views:
Transcription
1 THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT Daniel Bolaños Boulder Language Technologies (BLT), Boulder, CO, USA ABSTRACT This article describes the design of Bavieca, an opensource speech recognition toolkit intended for speech research and system development. The toolkit supports latticebased discriminative training, wide phonetic-context, efficient acoustic scoring, large n-gram language models, and the most common feature and model transformations. Bavieca is written entirely in C++ and presents a simple and modular design with an emphasis on scalability and reusability. Bavieca achieves competitive results in standard benchmarks. The toolkit is distributed under the highly unrestricted Apache 2.0 license, and is freely available on SourceForge. Index Terms automatic speech recognition 1. INTRODUCTION Bavieca is an open-source speech recognition toolkit intended for speech research and as a platform for rapid development of speech-enabled solutions by non-speech experts. It supports common acoustic modeling and adaptation techniques based on continuous density hidden Markov models (CD-HMMs), including discriminative training. Bavieca includes two efficient decoders based on dynamic and static expansion of the search space that can operate in batch and live recognition modes. Bavieca is written entirely in C++ and freely distributed under the Apache 2.0 license ( Compared to existing open-source automatic speech recognition (ASR) toolkits such us HTK [1], CMU-Sphinx [2], RWTH [3] and the more recent Kaldi [4], Bavieca is characterized by a simple and modular design that favors scalability and reusability, a small code base, a focus on real-time performance and a highly unrestricted license. Bavieca was developed at Boulder Language Technologies (BLT) during the last three years to fulfill the needs of research projects conducted within the company. Currently it is being used in a wide variety of projects including conversational dialog systems and assessment tools that are deployed in formal educational settings and other real-life scenarios. This work was supported by U.S. Department of Education award number R305B070434, NSF award number , NIH award number R43 DC and NSF award number IIS DESIGN Bavieca is written entirely in C++ and makes extensive use of the Standard Template Library (STL) and the Linear Algebra PACKage (LAPACK) for linear algebra support, which is freely available ( The code base is quite small, comprising about one hundred C++ classes and 30,000 lines of code. The toolkit consists of about 25 command-line tools that serve very specific purposes such as accumulating sufficient statistics or estimating model parameters according to some estimation criterion. These tools present a simple interface in which many parameters concerning algorithmic details are optional and receive default values. Following the traditional design, these tools are intended to be invoked from scripting languages such as Perl or Python in order to build recognizers. This is a similar design to that of the HTK [1] or Kaldi toolkits [4], however the number of tools in Bavieca is much reduced. Tools are organized to make the toolkit scalable to compute clusters and decoding engines are available as executables and libraries. Most C++ classes within the toolkit hide unimportant implementation details and expose relatively simple interfaces that favor reusability across different tools. For example, the same aligner objects are used in all the accumulation tools, and the acoustic and language model interfaces are shared by training, decoding and lattice edition tools. Execution errors and warnings are handled via exceptions, which increase the readability of the code. Acoustic modeling is based on CD-HMMs and the HMMtopology, which cannot be modified outside the code, is fixed to three states with no transition probabilities. 3. FRONT END AND MODELING TOOLS 3.1. Front-end Speech parametrization is carried out using Mel-Frequency Cepstral Coefficients (MFCC). A configuration file is used to specify feature extraction parameters such as the window size and width, the tapering function, filterbank characteristics, frequency cut-offs, number of cepstral parameters, energy, etc. Dynamic coefficients can be appended to the feature
2 vectors in the form of n-order derivatives or by concatenating static coefficients from adjacent feature vectors (spliced feature vectors). Cepstral mean normalization (CMN) and cepstral mean variance normalization (CMVN) can be applied at both the utterance or session level. Decoders also support stream-mode for live recognition. Feature decorrelation can be carried out using Linear Discriminant Analysis (LDA) and Heteroscedastic LDA (HLDA) Alignments There are four C++ aligner classes implementing graph-based and linear versions of the Viterbi and Forward-Backward algorithms. These classes can handle phonetic-contexts of any length, and are reused across different components in the system such as the accumulation and adaptation tools. The graph-based alignment is a data-driven method to find the sequence of word pronunciations and symbols that better matches the utterance while performing the actual alignment. The graph-based aligner performs over a directed acyclic graph (DAG) of HMM-states built from the reference string of words (typically the hand-made transcription) by considering alternative pronunciations and optional symbols like silence or fillers. The alignment is carried out against all the paths in the graph simultaneously, although optional and alternative symbols may not receive any significant occupation. The construction of this graph is similar to the procedure in the Attila toolkit [5]. It comprises the following steps, a) generation of an initial graph of words from the reference including alternative pronunciations found in the pronunciation lexicon (optionally) and optional symbols, b) transforming the word-graph into a phone-graph, c) expanding the phone-graph by propagating left and right phonetic context and attaching within-word position labels to each phone-arc using a queue, d) transforming the phone-graph into a graph of HMM-states using context-clustering decision trees and the phone labels from the previous step. Before performing the actual alignment the graph is minimized following a Forward-Backward edge merging process consisting of merging equivalent states and arcs (HMM-states). This process considerably reduces the graph size and thus the time and memory requirements for the alignment. The linear aligner is the method of choice when the reference is completely known in advance, i.e., no optional or alternative symbols/words are allowed. This is the case when collecting Gaussian specific statistics for semi-supervised adaptation using the best-path from the decoder or when collecting Gaussian statistics at the phone level for discriminative training. Alignments can be stored in disk in binary or text format. In binary format an array of pairs [HMM-state, occupation] is kept for each time frame. Additionally, for Viterbi alignments the word alignment information is kept, which might not be recoverable otherwise Accumulation and estimation Accumulation of sufficient statistics is carried out using the accumulation tools: mlaccumulator for maximum likelihood (ML) and maximum a posteriori (MAP) estimation, and dtaccumulator for standard and boosted maximum mutual information (MMI) estimation. These tools align the data using the acoustic models and the aligner objects previously described and accumulate statistics on disk in a universal format suitable for physical and logical HMM-states and for numerator and denominator statistics. This strategy, although not as flexible as accumulating statistics at the utterance level through the use of alignment objects, is more compact. Estimation of model parameters using sufficient statistics is carried out using the estimation tools, mlestimator for ML estimation, mapestimator for MAP estimation and dtestimator for standard and boosted MMI estimation Context modeling Context clustering is carried out using logical accumulators obtained from single Gaussian HMMs and decision trees that are either state-specific or global. Decision trees are generated following a standard top-down procedure that iteratively splits the data by applying binary questions using a ML criterion. Questions are asked about the correspondence to phonetic groups (defined by hand-made phonetic rules) and the within-word position (initial, internal and final). The splitting process is governed by two parameters: a minimum occupation count for each leaf and a minimum likelihood increase for each split. Finally, a bottom-up merging process is applied to merge those leaves which, when merged, produce a likelihood decrease below the minimum value used to allow a split Mixture splitting and merging Acoustic model refinement through mixture splitting and merging can be performed after each reestimation iteration using the gmmeditor tool, the original set of HMMs and the accumulated statistics. Gaussian splitting is performed iteratively until the desired number of components in the Gaussian mixture model (GMM) is reached. At each iteration the Gaussian to split is selected based on either the largest average covariance (default behavior) or the largest occupation. After the reestimation iteration(s) that typically follow mixture splitting, the occupation of some Gaussian components may fall below the minimum occupation needed to reliably estimate their parameters (typically 100 feature vectors). Thus, aiming for a robust parameter estimation, components with an occupation falling below a given threshold are merged to the nearest Gaussian in the mixture, which is found using a covariance metric. Each GMM after applying these tools will typically have a variable number of components depending
3 on the data aligned to the HMM-state, which varies across reestimation iterations Speaker adaptation Vocal Tract Length Normalization (VTLN) [6] is implemented using a piece-wise linear function with one breakpoint to scale the filterbank frequencies. Warp factors are estimated under the ML criterion using conventional acoustic models, while unvoiced phones and symbols are excluded from the estimation. Maximum Likelihood Linear Regression (MLLR) is implemented for both feature-space (fmllr) [7] and modelspace adaptation [8]. For fmllr a single transform is used. In model-space, mean and diagonal variance adaptation is performed using a regression tree. The regression tree is generated by clustering Gaussian means using either K-means or expectation maximization (EM) clustering. The clustering process is governed by two parameters: the maximum number of base-classes and the minimum number of Gaussian components per base-class. Adaptation statistics are accumulated in the regression tree and transforms are estimated conditioned on a minimum number of frames and observed Gaussian distributions per transform Language Modeling The toolkit supports language models (LMs) in the ARPA n- gram format. Internally a n-gram LM is represented as a Finite State Machine (FSM) in which each state represents a word-history and each transition has a word label (or backoff symbol) and a log-likelihood attached. Given a new word, the current LM-state is updated by performing a binary search on the sorted array of transitions, which includes a transition to the back-off state. This representation makes the use of different n-gram sizes transparent to the decoder Lattice Processing Lattice operations are performed using the latticeeditor tool. Operations include: word error rate (WER) computation (oracle), HMM-state and LM marking, HMM-state alignment, path-insertion, determinization (merging of equivalent states and arcs), posterior probability and confidence measure computation, etc. This tool only operates on lattices in binary format, although lattices can be converted to text format for visualization purposes Discriminative training The toolkit supports model-based discriminative training under the MMI and boosted MMI estimation criteria with cancellation of statistics and I-smoothing to the previous iteration [9]. Discriminative training is carried out using the tools dtaccumulator and dtestimator to accumulate numerator and denominator statistics and to estimate model parameters from the accumulators respectively. In addition, the lattice edition tool is used to convert the lattices to a format suitable for the accumulation of statistics, which includes: lattice determinization, HMM-state marking, path insertion, etc. Both numerator and denominator statistics are accumulated from a single set of lattices resulting from decoding the training data. To enable the accumulation of numerator statistics the orthographic transcription of each utterance (with the corresponding time alignment information) is added to each lattice if not present, and the path marked. The accumulation of denominator statistics may imply a significant number of redundant likelihood computations across overlapping phones in the lattice. For this reason this process is carried out by an optimized Forward-Backward aligner object that utilizes a hash table to cache and reuse likelihood evaluations. This simple strategy can speed-up the accumulation of statistics by over a factor of 2 depending on the lattice depth. 4. DECODERS The toolkit comprises two large vocabulary decoders, a dynamic decoder and a WFSA (Weighted Finite State Acceptor) based decoder. The main difference between these systems is how the LM is applied during the search. The dynamic decoder uses a decoding network compiled and optimized from the set of physical HMM-states and the pronunciation lexicon, while the LM is applied dynamically at the token level. On the other hand, the WFSA decoder performs recognition on a decoding network in which all sources of information including the LM are statically compiled and optimized. These two decoders serve complementary purposes. In the context of large vocabulary dialog systems we need a large vocabulary decoder that can handle large and potentially dynamic LMs with a small memory footprint for live applications in resource-constrained scenarios. In turn, the WFSA decoder exhibits excellent performance for experimentation at the expense of a large memory footprint on some tasks. The design of both decoders presents a number of similarities: a) Decoding networks are FSMs in which arcs and nodes are stored in two contiguous arrays of memory, arcs keep the offset of the destination node, and nodes keep the offset of the first outgoing arc (a node s outgoing arcs are stored contiguously). This compact layout saves memory and favors locality, which reduces cache misses. FSMs representing LMs are stored in the same fashion. b) HMM self-transitions are not explicit in the decoding network but simulated during the search. c) Data structures frequently accessed during decoding (such as Viterbi tokens on the dynamic decoder, active nodes, word-history items, lattice tokens, etc.) are preallocated during initialization as contiguous arrays of memory to preserve locality and speed up the search. This enables
4 the utilization of offsets instead of pointers, which considerably reduces memory requirements in architectures with 64- bit memory addresses. d) Word-history items are organized as a tree; every time a word-arc is visited a new item is appended to the tree. During decoding a garbage collection mechanism is periodically invoked to reuse word-history items and lattice tokens that become inactive. e) During decoding, network nodes (or tokens) are activated only when their updated path score is within the pruning beam with respect to the score of the best partial path. The implementation of the lattice generation in both decoders is identical and very similar to that of the Attila toolkit [5]. In addition to pruning thresholds, lattice depth is governed by the maximum number of distinct word sequences kept at each active node (token in the dynamic decoder), which are managed by a specialized hash-table Dynamic decoder This decoder is based on the token-passing paradigm, it supports triphone and pentaphone phonetic contexts and n-gram LMs of any size. The decoding network consists of a word loop built from the set of physical HMM-states and the pronunciation lexicon, while the LM is applied dynamically at the token level. The decoding network comprises initial, internal and final parts, which contain HMM-states for the first phone, internal phones and final phone of each word respectively. In order to simplify the network building process for wide phonetic contexts, cross-word context spans only adjacent words and only the initial and final parts are sensitive to it while the internal part remains sensitive to intra-word phonetic context only. Initial and final parts are built by enumerating state sequences corresponding to all possible contexts, and are minimized incrementally for efficient support of wide phonetic contexts. Hash-tables are used to keep auxiliary nodes that facilitate the connections to the internal part, which is built one word at a time. Once the network is built, word arcs are pushed to the initial part of the network in order to allow for early token recombination. Finally the network is globally minimized using a forward-backward merging process in order to combine equivalent arcs and nodes. LM look-ahead is implemented generating a tree from the internal part of the network up to the word arcs. This tree is minimized by merging linear sequences of arcs and pruned in order to speed-up the computation of the look-ahead scores. Histogram and beam pruning are carried out globally, at word ends, and within each active node. This decoder performs over a statically optimized WFSA that is composed incrementally following a procedure similar to the one described by Novak et al. [10]. The network composition is carried out by the wfsabuilder tool and is limited to cross-word triphones and small to medium size LMs, given that the memory requirements would otherwise exceed the available physical memory. Each transition in the acceptor keeps an integer encoding the symbol type (HMM-state index, word-id or epsilon) and the actual symbol, a floating point value with the weight, and an integer with the offset of the destination state within the array of states (12 bytes total). States in the WFSA can only be reached by transitions with the same input symbol, which facilitates the recombination. The decoder utilizes a specialized hash table to facilitate the activation of states from emitting and epsilon transitions. At each time frame the active states are processed, path scores are updated with transition weights and acoustic scores, and destination states of emitting transitions are activated. Epsilon transitions are processed in topological order until an emitting transition is found and its destination state activated. Figure 1 shows the real time factor (RTF) of both decoders on the Wall Street Journal (WSJ) Nov 92 task using a bigram LM. Decoders were run on a laptop computer equipped with a Core 2 Duo P8600 processor (2.4 GHz and 3MB cache) and 3GB of main memory (only one core was used). Acoustic models consisted of 112k Gaussian components and were trained under the MLE criterion. The WFSA decoding network had 7M transitions and 3.1M states for a total of 92 MB. When the SSE instruction set was used for acoustic scoring the RTF of the WFSA decoder improved by 20 to 30%. Results in figure 1 improve those reported by Novak [11] for two WFST-based decoders, Sphinx and HTK on the same task using a faster machine (comparing comparable systems) WFSA decoder Fig. 1. RTF vs Accuracy on the WSJ Nov 92 task (bigram)
5 4.3. Acoustic scoring The computation of Gaussian log-likelihoods is usually the most CPU-intensive task within an ASR system. Different techniques have been implemented to speed up this task. The first optimization consists of using the nearest neighbor approximation [12] in conjunction with partial distance elimination [13] in order to get rid of the exponential and terminating the computation as early as possible. Additionally, the support for Single Instruction Multiple Data (SIMD) parallel computation present in the x86 architecture is utilized to perform floating-point arithmetic operations in up to four vector dimensions simultaneously. In particular, the SSE2 (Streaming SIMD Extensions) Intrinsics were used. In order to take full advantage of this instruction set the data (features and model parameters) were 16-byte aligned, because otherwise a significant portion of the gains resulting from the parallel computation were lost when loading and storing the data into the 128-bit registers. SIMD support can be disabled using a compilation flag if the SSE instruction set is not supported by the architecture. Cholesky decomposition is used to speed up the evaluation of full-covariance Gaussian distributions. Finally, during decoding multiple network nodes attached to the same physical HMM-state can be active simultaneously, thus acoustic likelihoods for each HMM-state are cached and reused within each time-frame. 5. EVALUATION The system was evaluated on two tasks: the WSJ Nov 92 task (5k and 20k vocabulary sizes) and the MyST task WSJ Recognition results are reported on the WSJ Nov 92 test sets. Two evaluation conditions were examined, the 5k closed vocabulary and the 20k open vocabulary conditions. In both cases non-verbalized pronunciations and standard bigram and trigram LMs were used. Acoustic models were trained on 80 hours of data from the SI-284 dataset. For every speech utterance, 39-dimensional feature vectors, consisting of 12 MFCCs and energy plus first and second order derivatives, were extracted and CMN was applied. Word pronunciations were extracted from the CMU dictionary (cmudict.0.7a) using the standard CMU phonetic symbol set without stress markers (39 phonetic classes plus silence). Acoustic models consisted of 4100 clustered HMM-states and about 112k Gaussian distributions with diagonal covariance. No feature transformation or speaker/gender adaptation was carried out. Table 1 shows the performance of the system in comparison with published results on the WSJ Nov 92 evaluation. WER is given for 5k and 20k vocabulary sizes using bigram (bg) and trigram (tg) LMs, first pass decoding only. The table summarizes WERs from the HTK system described in [14], the LIMSI dictation system [15], and the more recent Kaldi toolkit [4]. WERs from Bavieca are given for ML estimation (which are directly comparable to results from the other systems) and after four iterations of bmmi training in rows six and seven respectively. The last column of the table shows whether gender dependent modeling (GD) was applied. It can be seen that Bavieca produces comparable results to other ASR systems. 5k 20k system bg tg bg tg GD HTK yes LIMSI yes Kaldi 11.8 no Bavieca no Bavieca+bMMI no Table 1. WER on the WSJ Nov 92 evaluation (%) My Science Tutor My Science Tutor (MyST) [16] is an intelligent tutoring system designed to improve science learning in elementary school students through conversational dialogs with a virtual science tutor. The MyST corpus comprises about 160 hours of conversational speech from 3rd, 4th, and 5th graders. The corpus was collected using close-talk microphones and is divided into 140 hours for training, 10 hours for development and 10 hours for testing Recognition setup The 140 hours of training data were parameterized using 12 MFCC coefficients plus energy and the first and second derivatives. CMVN was applied for each session. The phonetic symbol set comprised 39 phone classes plus 12 symbols to model non-speech events including filled pauses and other events such as silence and breath noise. The vocabulary size was about 7k words including a small number of alternative pronunciations. A LM for each of the four modules (measurement, magnetism and electricity, variables and water) was trained by interpolation from a general LM built on the training transcriptions only. The CMU LM Toolkit [17] was used for training and interpolation purposes. The training process started with a set of single-gaussian context-independent (CI) HMMs initialized to the global distribution of the data, which were reestimated for five iterations. Then a set of 4600 single-gaussian context-dependent (CD) clustered triphones was generated using context decision trees. Pentaphone modeling did not reduce the WER and resulted in slower decoding due to the larger decoding network. After three reestimation iterations the CD HMMs were refined through 22 iterations 1 Results after four iterations of discriminative training, not comparable to results from the other systems, which are based on MLE solely.
6 of Gaussian splitting and merging, resulting in about 140k Gaussian distributions (diagonal covariance). VTLN was applied according to the standard iterative procedure until the warp-factors converged (5 iterations). HLDA was applied by performing single-pass retraining on a set of extended features (third derivatives were added) and estimating a full covariance Gaussian distribution for each clustered HMM-state. Once the transform was estimated, features were decorrelated and projected down to the original dimensionality and the HMMs were settled into the new feature space. Model-based discriminative training was conducted using the bmmi objective function. The training data were decoded using a bigram LM and lattices were generated. The lattice WER (oracle) was 3.06%, and the lattice depth (average number of lattice arcs containing words that cross every time frame) was 383 after making the lattices deterministic. The size of the uncompressed lattices in disk once processed to be used for discriminative training was 323GB. Acoustic scores were scaled down with the inverse of the language-model scaling factor used for decoding and lattices were marked with an unigram LM. The Gaussian specific learning rate was computed setting the constant E to 3. I-smoothing to the previous iteration was carried out using τ = 100. The boosting factor b was set to 0.5. A comprehensive parameter optimization was not carried out. Table 2 shows the WER on the development set after the different training stages for both speaker independent (SI) and speaker dependent (SD) systems. SI SD initial CD models VTLN HLDA bmmi fmllr MLLR Table 2. WER on the MyST task (%). 6. CONCLUSIONS Bavieca is an open-source ASR toolkit intended for speech research and system development. The toolkit is based on CD- HMMs and offers a simple and modular design with an emphasis on efficiency, scalability and reusability. Bavieca exhibits competitive results on standard benchmarks and is being successfully used at BLT on a number of research projects addressing both read and conversational children s speech as well as conversational adult s speech. Nonetheless, the development of Bavieca is still a work in progress. Future plans include exploring generic recipes for building ASR systems, further code refactoring and testing, and implementation of new features, such as feature-space discriminative training. 7. REFERENCES [1] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for version 3.4). Cambridge University Engineering Department, [2] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel, Sphinx-4: A flexible Open Source Framework for Speech Recognition, Sun Microsystems Inc., Technical Report SML1 TR , [3] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Lööf, R. Schlüter, and H. Ney, The RWTH Aachen University Open Source Speech Recognition System, in Proc. of Interspeech, 2009, pp [4] D. Povey, A. Ghoshal et al., The Kaldi Speech Recognition Toolkit, in Proc. of ASRU, [5] H. Soltau, G. Saon, and B. Kingsbury, The IBM Attila Speech Recognition Toolkit, in Proc. IEEE SLT, [6] L. Lee & R. C. Rose. A frequency warping approach to speaker normalization, IEEE TSAP. 6, 1, [7] M. J. F. Gales, Maximum-likelihood linear transforms for HMM-based speech recognition, Computer Speech and Language, vol. 12, no. 2, pp , [8] C. J. Leggetter, & P. C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov Models. Comput. Speech Langu. 9, [9] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, Boosted MMI for model and feature space discriminative training, in Proc. ICASSP, 2008, pp [10] M. Novak, Incremental composition of static decoding graphs, in Proc. of Interspeech, 2009, pp [11] J. Novak et al., An Empirical Comparison of the T 3, Juicer, HDecode and Sphinx3 Decoders, In Proc Interspeech [12] F. Seide, Fast likelihood computation for continuousmixture densities using a tree-based nearest neighbor search, in Proc. Eurospeech, 1995, pp [13] C. D. Bei and R. M. Gray, An improvement of the minimum distortion encoding algorithm for vector quantization, IEEE Trans. Commun., vol. COM-33, pp , Oct [14] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young, Large vocabulary continuous speech recognition using HTK, in Proc. ICASSP, vol. 2, 1994, pp. II/125 II/128. [15] J. L. Gauvain et al., The LIMSI speech dictation system: evaluation on the ARPA Wall Street Journal task, in Proc. ICASSP 1994, Vol. I, pp [16] W. Ward, R. Cole, D. Bolaños et al., My Science Tutor: A conversational multimedia virtual tutor for elementary school science. ACM Trans. Speech Lang. Process., 7(4), [17] R. Rosenfeld, The CMU Statistical Language Modeling Toolkit.
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
Hardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
Developing an Isolated Word Recognition System in MATLAB
MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE
IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE Jing Zheng, Arindam Mandal, Xin Lei 1, Michael Frandsen, Necip Fazil Ayan, Dimitra Vergyri, Wen Wang, Murat Akbacak, Kristin
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,
arxiv:1505.05899v1 [cs.cl] 21 May 2015
The IBM 2015 English Conversational Telephone Speech Recognition System George Saon, Hong-Kwang J. Kuo, Steven Rennie and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 [email protected]
CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content
CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content Martine Garnier-Rizet 1, Gilles Adda 2, Frederik Cailliau
Comparing Open-Source Speech Recognition Toolkits
Comparing Open-Source Speech Recognition Toolkits Christian Gaida 1, Patrick Lange 1,2,3, Rico Petrick 2, Patrick Proba 4, Ahmed Malatawy 1,5, and David Suendermann-Oeft 1 1 DHBW, Stuttgart, Germany 2
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Ericsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
Strategies for Training Large Scale Neural Network Language Models
Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Yousef Ajami Alotaibi 1, Mansour Alghamdi 2, and Fahad Alotaiby 3 1 Computer Engineering Department, King Saud University,
Modified MMI/MPE: A Direct Evaluation of the Margin in Speech Recognition
Modified MMI/MPE: A Direct Evaluation of the Margin in Speech Recognition Georg Heigold [email protected] Thomas Deselaers [email protected] Ralf Schlüter [email protected]
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems
Myanmar Continuous Speech Recognition System Based on DTW and HMM
Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Advanced Signal Processing and Digital Noise Reduction
Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
Connected Digits Recognition Task: ISTC CNR Comparison of Open Source Tools
Connected Digits Recognition Task: ISTC CNR Comparison of Open Source Tools Piero Cosi, Mauro Nicolao Istituto di Scienze e Tecnologie della Cognizione, C.N.R. via Martiri della libertà, 2 35137 Padova
DynaSpeak: SRI s Scalable Speech Recognizer for Embedded and Mobile Systems
DynaSpeak: SRI s Scalable Speech Recognizer for Embedded and Mobile Systems Horacio Franco, Jing Zheng, John Butzberger, Federico Cesari, Michael Frandsen, Jim Arnold, Venkata Ramana Rao Gadde, Andreas
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond,
Analysis of Compression Algorithms for Program Data
Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries
First Semester Development 1A On completion of this subject students will be able to apply basic programming and problem solving skills in a 3 rd generation object-oriented programming language (such as
Image Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
Online Diarization of Telephone Conversations
Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
TED-LIUM: an Automatic Speech Recognition dedicated corpus
TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France [email protected]
Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables
Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept - Faculty of Engineering-Tanta University, Tanta,
Emotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Ulpu Remes, Kalle J. Palomäki, and Mikko Kurimo Adaptive Informatics Research Centre,
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium [email protected]
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
Structural Health Monitoring Tools (SHMTools)
Structural Health Monitoring Tools (SHMTools) Getting Started LANL/UCSD Engineering Institute LA-CC-14-046 c Copyright 2014, Los Alamos National Security, LLC All rights reserved. May 30, 2014 Contents
An Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Slovak Automatic Transcription and Dictation System for the Judicial Domain
Slovak Automatic Transcription and Dictation System for the Judicial Domain Milan Rusko 1, Jozef Juhár 2, Marian Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Miloš Cerňak 1, Marek Papco 2, Róbert
Perplexity Method on the N-gram Language Model Based on Hadoop Framework
94 International Arab Journal of e-technology, Vol. 4, No. 2, June 2015 Perplexity Method on the N-gram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam
COMPUTER SCIENCE (5651) Test at a Glance
COMPUTER SCIENCE (5651) Test at a Glance Test Name Computer Science Test Code 5651 Time Number of Questions Test Delivery 3 hours 100 selected-response questions Computer delivered Content Categories Approximate
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 Weighting and Normalisation of Synchronous HMMs for
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
GameTime: A Toolkit for Timing Analysis of Software
GameTime: A Toolkit for Timing Analysis of Software Sanjit A. Seshia and Jonathan Kotker EECS Department, UC Berkeley {sseshia,jamhoot}@eecs.berkeley.edu Abstract. Timing analysis is a key step in the
1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.
1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D. base address 2. The memory address of fifth element of an array can be calculated
Training Universal Background Models for Speaker Recognition
Odyssey 2010 The Speaer and Language Recognition Worshop 28 June 1 July 2010, Brno, Czech Republic Training Universal Bacground Models for Speaer Recognition Mohamed Kamal Omar and Jason Pelecanos IBM
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition
The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Efficient Data Structures for Decision Diagrams
Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
SMALL INDEX LARGE INDEX (SILT)
Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song
THE goal of Speaker Diarization is to segment audio
1 The ICSI RT-09 Speaker Diarization System Gerald Friedland* Member IEEE, Adam Janin, David Imseng Student Member IEEE, Xavier Anguera Member IEEE, Luke Gottlieb, Marijn Huijbregts, Mary Tai Knox, Oriol
Neural Network Add-in
Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...
