Comparing Open Source ASR Toolkits on Italian Children Speech

Size: px
Start display at page:

Download "Comparing Open Source ASR Toolkits on Italian Children Speech"

Transcription

1 Comparing Open Source ASR Toolkits on Italian Children Speech Piero Cosi 1, Mauro Nicolao 1, 2, Giulio Paci 1, Giacomo Sommavilla 1, Fabio Tesser 1 1 Istituto di Scienze e Tecnologie della Cognizione, CNR, Padova, Italy 2 Speech and Hearing Research Group, University of Sheffield, United Kingdom { piero.cosi, giulio.paci, giacomo.sommavilla, fabio.tesser }@pd.istc.cnr.it mauro.nicolao@gmail.com Abstract In this paper, we consider two different aspects of the automatic speech recognition task: the effectiveness of using open-source ASR toolkits and the quite problematic recognition of children speech. On this difficult task, we compare three well established and widely available ASR toolkits and we finally demonstrate the feasibility of applying these results to speech recognition and spoken dialogue system design. Even if various open source ASR toolkits are now available, we were mainly interested in evaluate the usability of the relatively new BAVIECA system in comparison to two systems (SONIC and SPHINX) for which we had already various results in past experiments on children speech. This paper is intended to provide the reader with a simple overview of the solutions adopted by the three different systems under investigation and with the demonstration of their effectiveness on children speech. Furthermore, the paper provides suggestions for future research directions in the field. Index Terms: Open Source, ASR, Tookit, SONIC, SPHINX, BAVIECA, Children Speech. 1. Introduction During the last few years, many different Automatic Speech Recognition frameworks have been developed for research purposes and, nowadays, various open-source automatic speech recognition (ASR) toolkits are available to research laboratories. Systems such as HTK [1], SONIC [2], [3], SPHINX [4], [5], RWTH [6], JULIUS [7], KALDI [8], the more recent ASR framework SIMON [9], and the relatively new system called BAVIECA [10] are a simple and probably the most famous list. Indeed new systems such as KALDI [8] demonstrated the effectiveness of easily incorporate Deep Neural Network (DNN) techniques [11] in order to improve the recognition performance in almost all recognition tasks. However, in this paper, we concentrate on HMM-based system leaving the analysis of newest systems for the future. The experience that we had with some of these systems at ISTC-CNR of Padova, convinced us in the past that CSLR SONIC and CMU SPHINX are two versatile and powerful tools to build a state-of-the-art ASR. Thus, we decided to benchmark them in different contexts [12], [13]. In our former work, we performed a recognition test on continuous clean speech for Arabic language [12] with a 1k-word vocabulary. Although the task was quite simple, the result was encouraging (in terms of Word Error Rate (WER). SONIC scored 1.9% and SPHINX did 1.3%). The simplicity, with which we were able to configure the systems for such a phonetic and spelling-complicated language, was the most interesting feature of these experiments. In our more recent work, these two systems were tested in an evaluation campaign on the automatic recognition of connected digit for Italian language, named EVALITA [13]. The results were also extremely good (the SONIC word error rate was 2.7% and the SPHINX one was 4.5%) and both systems yielded the best performances among the other competitors. As indicated in a complete review on ASR technology for children speech by Gerosa et al. [14] Most of this research effort, however, has been devoted to developing systems targeting adult speakers. Following the first studies that raised attention to the poor performance of speech recognition systems for children users, increasing attention has been paid to the area of robust speech recognition technologies for children s speech in a variety of application scenarios such as reading tutors, foreign language learning and multimodal human-machine interaction systems. Early spoken dialogue application prototypes that were specifically aimed at children included word games, reading aids and pronunciation tutoring [15], [16], [17]. Recently a number of systems have been implemented with advanced spoken dialogue interfaces and/or multimodal interaction capabilities [18], [19], [20], [21].. And again, according to Lee et al. [22], Previous studies have shown that children s speech, compared to adults speech, exhibits higher pitch and formant frequencies, longer segmental durations, and greater temporal and spectral variability [23], [24], [25], [26], [27], [28], [29].. From all these considerations, this paper aims at making a useful comparison among some of the best open source ASR systems, and of demonstrating their effectiveness on a difficult target task, such as the application of ASR systems on children speech. As a consequence, the present work concentrates on applying SPHINX and the more recent BAVIECA toolkit to the same children s speech recognition task where SONIC had good results in our past experiences [30], [31], [32]. Furthermore, the paper provides suggestions for future research directions in the field. 2. System Description The SONIC, SPHINX and BAVIECA ASR systems have been considered in our tests because they have similar structure, and all the three systems use Hidden Markov Models (HMM) to describe the acoustic feature space and Finite State Grammars or Markov chains to model the structure of language. Even so, they have important differences SONIC CSLR SONIC [2], [3], is a complete toolkit for research and development of new algorithms for continuous speech recognition. The software has been developed at the Center for Spoken Language Research (CSLR) of the University of Colorado since March It allows for two modes of speech recognition: Finite-State-Grammar decoding and N-gram Language Model decoding. Since 2005, SONIC has been no more developed and the most updated version is the 2.0-beta5. Nonetheless, it is still one

2 of the most advanced and easy-to-tune toolkit available even in comparison with a constantly developed ASR such as SPHINX. Unfortunately, since its code is not open source, its functions and its API can be used without neither modifying nor using them in different applications. SONIC is based on a Continuous Density Hidden Markov Model (CD-HMM) technology and it also incorporates standard speaker adaptation techniques in the recognition process, such as unsupervised Maximum Likelihood Linear Regression (MLLR) and Structural MAP Linear Regression (SMAPLR). It allows also for some adaptation methods in the training process, such as Vocal Tract Length Normalization (VTLN), cepstral mean and variance normalization, and Speaker-Adaptive Training (SAT). A description of all these SONIC techniques can be found in [32]. The software version used in our experiments was actually the older beta3, because it incorporates some powerful characteristics, no longer present in the following versions, i.e. an interesting noise-robust type of acoustic features. These are known as Perceptual Minimum Variance Distortionless Response (PMVDR) cepstral coefficients and are also described in [32]. The PMVDR cepstral coefficients provide an improved accuracy over the traditional MFCC parameters by better tracking the upper envelope of the speech spectrum. Our previous experience [13] showed a better robustness of these features in noisy environment in comparison to MFCCs. The SONIC Acoustic Models (AMs) are built with decision-tree state-clustered HMMs with associated gamma probability density functions to model state-durations. The HMM topology has only a fixed 3-state configuration and each state can be modelled with variable number of multivariate mixture Gaussian distributions. The HMM training process consists of the alignment of the training audio material followed by an Expectation-Maximization (EM) step in which HMM parameters are tuned. Means, covariances and mixture weights are estimated in the maximum likelihood sense. The training process iterates data alignment and model estimation several times in order to gradually achieve adequate parameter estimation SPHINX The CMU SPHINX system, whose basic characteristics can be found in [4], [5], is an open-source project, developed by Carnegie Mellon University (CMU) of Pittsburgh, which provides a complete set of functions to develop complex Automatic Speech Recognition systems. There is a very reliable set of functions for the training of the HMMs and several different choices available for the large vocabulary, phoneme recognition, and N-best decoding. There is a version to be used in embedded solution (Pocket-SPHINX), a standard versatile C- version (SPHINX-3) and the Java version (SPHINX-4) suitable for web applications. Among these decoding versions, SPHINX-3, has been chosen because it is C-based and it has fitted with our testing framework better than the Java version, with no loss of performance at all. The training functions are shared by every decoding system and are grouped in the so-called SPHINX-Train package. As in SONIC, the training process consists in iterations of alignments and acoustic-model parameter estimation. However, a difference between the two systems can be found in the model used for the preliminary phonetic alignment. SPHINX can start from raw AMs, estimated from a uniformly spaced audio segmentation, while SONIC always needs previously trained AMs. Several loops of probability-density-function (pdf) estimation with Baum-Welch algorithm and of re-alignment between training audio and transcription are performed. Models can be computed either for isolated phonemes (Context Independent, CI) or for each tri-phones found in the training data (Context Dependent, CD). Instead of the SONIC Gamma pdf, SPHINX uses Gaussian duration models. While the number of HMM states in the training tools is potentially unlimited, the recognition software has limited the topologies to 3 or 5 states and the Language Model (LM) structure to 3- grams. The decoding process is a conventional Viterbi search algorithm through the lattice created with the output scores of HMM and some beam-search heuristics. It uses a lexical-treesearch structure, too, in order to prune the state transitions. The MLLR, VTLN and MAP adaptation methods are also implemented in this software. These techniques are substantially the same of those in SONIC and the results confirm that the recognition improvement is comparable BAVIECA Compared to the other two open-source automatic speech recognition (ASR) toolkits the relatively new system called BAVIECA [10] is characterized by a simple and modular design that favors scalability and reusability, a small code base, a focus on real-time performance and a highly unrestricted license. BAVIECA was developed at Boulder Language Technologies (BLT) by Daniel Bolaños during the last years to fulfill the needs of research projects conducted within the company. Currently it is being used in a wide variety of projects including conversational dialog systems and assessment tools that are deployed in formal educational settings and other reallife scenarios. BAVIECA is an open-source speech recognition toolkit intended for speech research and as a platform for rapid development of speech-enabled solutions by non-speech experts. It supports common acoustic modeling and adaptation techniques based on continuous density hidden Markov models (CD-HMMs), including discriminative training. BAVIECA includes two efficient decoders based on dynamic and static expansion of the search space that can operate in batch and live recognition modes. BAVIECA is written entirely in C++ and freely distributed under the Apache 2.0 license. 3. Experimental Framework Italian Children s speech recognition is a very peculiar task and its difficulties have been investigated several times by the ISTC-CNR research group [30], [31], [32] Dataset As dataset for Italian Children s speech recognition, the final release of the ChildIt children s speech corpus [33] has been used. This consists of audio collected by FBK 1 from 171 children (85 females and 86 males) aged between 7 and 13 (from grade 2 up to grade 8), who were all native speakers from regions in the north of Italy. 1 Bruno Kessler Foundation, ex ITC-IRST (

3 Each child provided approximately read sentences extracted from age-appropriate literature. According to [30], to test the two systems, the corpus was divided into a training set consisting of 129 speakers (64 females and 65 males) and a test set consisting of 42 speakers (21 females and 21 males) balanced by gender and aged between 7 and 13. Training and test sentences that contain mispronunciation and noisy words have been excluded from the following experiments, while all other sentences, even those with annotated extra-linguistic phenomena, such as human disturbs (lip smacks, breath, laugh, cough, etc.), generic noises nonoverlapping with speech have been included. The orthographic transcriptions of the prompt sentences have been used for training and the automatic phonetic transcription for the test. Actually, slightly mispronounced words, such as the interrupted ones, could still have been considered by modifying their phonetisation and forcing a silence tag at the end of them. This helped to prevent coarticulation between the interrupted token and the following word that could lead to improper model training. This was also justified by the human speech error explanation theories, which affirm that, after the occurrence of an error in speech production, there is always a little pause (~20 ms) due to of the time spent for the interrupting and the restarting actions Training and Decoding Process As most of the ASR systems, a list of words with their standard phonetisation (lexicon), a list of extra-text words (fillers), a list of phonemes, a list of questions for tree-clustering and the accurate transcription of each training audio file must be provided to correctly configure the three systems. Phone list and decision-tree question structure are the same for all systems and have been previously compiled by expert Italian linguists [34]. Only orthographic transcriptions were available within the FBK-ChildIt corpus, thus, an automatic phonetic alignment had to be provided. This has been automatically obtained by aligning text to the audio data with a previous trained Italian AM [34] in SONIC, and with a raw AM trained by uniformly segmenting audio data in SPHINX and BAVIECA. The latter method yields to a less accurate bootstrap, but it allows for flexibility in choosing transcriptions and phoneme list. Except from this, both training and decoding processes were quite similar and their main characteristics are summarised in Figure 1 and Table 1. Because phonetic recognition was the aim of our systems, even a phoneme-level transcription, rather than a word-level one, could have been chosen to have a simpler AM structure, ignoring the word-level layer. Although it was out of the purpose of our work, such a phoneme-level training was also experimented. The phonetic transcription was created with a statistical-based grapheme-to-phoneme conversion, which had been trained from a 400,000-word hand-compiled Italian lexicon. The phoneme-level training was applied exclusively to the SPHINX system. Some evidences of recognition improvement by using this method are presented. Figure 1. General scheme of ASR training (upper) and decoding (lower) process. 4. Results In order to compute the score of the recognition process, the Phonetic Error Rate (PER) has been used. This is defined as the sum of the deletion, substitution and insertion percentage of phonemes in the ASR outcome text with respect to a reference transcription. Ideally, a hand-labelled reference would have been preferred, because it would have been corrected at the phonetic level to take into account children s pronunciation mistakes. Since this was not available, the automatic phonetic sequences obtained by a Viterbi alignment of the word-level orthographic transcription have been used. The reference test transcriptions were shared by all systems and they were created using the SONIC aligner with the general-purpose Italian model created in [34]. This method was chosen because it allowed for automatically selecting the best pronunciation for each word in the training data among the alternative choices available in the 400,000-word Italian lexicon. Since each system had a distinct architecture, finding similarities between the three ASR systems to choose comparable configurations and to set a homogeneous test framework was not an easy task. To solve this issue, we performed several experiments and eventually, we decided to compare the outcomes of the best-per-score configurations. Moreover, it was chosen to evaluate only the AM performance, therefore we decided that the LM role, which is often crucial in an ASR system, should have been minimized as much as possible. Thus, LM weights have been set to have no influence on the computation of the output probability of each utterance. A detailed overview of the configuration for the tests is shown in the following Table 2. Deciding when a phoneme should be considered incorrectly recognized was another evaluation issue. In this test, the same phoneme set as in [31], consisting of 40 primary Acoustic Units (AUs), has been adopted. The acoustic differences between stressed and unstressed vowels in Italian are subtle and mostly related to their duration. Furthermore, most of the Italian people pronounce vowels according to their regional influences instead of correct pronunciation and, in children s speech production, this sort of inaccuracies is even more common. For these reasons, recognition outputs have been evaluated using a 29- phoneme set which does not count the phonetically irrelevant distinctions between stressed and unstressed vowels, geminates

4 and their correspondent single phonemes, and specific phones /ng/ and /nf/ and phoneme /n/. The phonetic recognition experiments were conducted using 2299 sentences from 42 held-out speakers of the FBK- ChildIt corpus. According to [32], several configurations were provided and all the efforts were made to maintain the comparability between the different configurations. All adaptations were unsupervised, which means that the reference transcriptions used to adapt the AMs were the unverified output of the adapted recognition in the previous step. The a-prioriknown identities of the test speakers have been used to perform the adaptation to each speaker. Furthermore, two extra noncomparable tests were done (one in SONIC and in one SPHINX) to check other potentialities of the systems. As adaptation methods in SONIC, as described in [32], the SMAPLR has been used; this is an iterative unsupervised structural MAP linear regression using the confidence-weighted phonetic recognition output. The means and variances of the system Gaussians were adapted after each decoding pass and used to obtain an improved output. VTLN, which is a frequency warping method of the extracted features to compensate the peculiarities (i.e. Vocal Tract Length) of each speaker and an extra adaptation method, the Speaker Adaptive Training (SAT), not available in SPHINX, attempting to remove speakerspecific characteristics from the training data in order to estimate a set of speaker-independent acoustic models have been considered too. In SPHINX and BAVIECA tests, an MLLR has been performed instead of the SONIC SMAPLR. However, it is also an unsupervised and iterative recognition, in which the speakerindependent AMs are modified with a linear transformation in order to minimize the distance between means and variances of the speaker-independent models and those of the unknownspeaker. Due to timing constraints for the preparation of this paper, only in SPHINX and not in BAVIECA, the VTLN method has been also applied. Finally, an extra experiment has been considered for SPHINX, in which, as mentioned above, differently from the standard uniform segmentation, a detailed phonetically annotated transcription (same of SONIC) has been used to train the baseline AM and the same MLLR adaptation has been performed during recognition. A detailed overview of all results of the SONIC, SPHINX and BAVIECA experiments is displayed in Figure 2. Table 1. Main characteristics of the three ASR training configurations (differences are highlighted by the colour). Table 2. Testing configurations of the three systems (differences are highlighted by the colour).

5 Figure 2. SONIC, SPHINX and BAVIECA test results. PER as a function of SMAPLR or MLLR adaptation iters. 5. Discussion and Conclusions Comparing the scores of the word-level-trained systems in Figure 2, it can be observed that, even if the three systems have all had good performance, SONIC obtained slightly better scores on average. Considering the baseline-relative improvements obtained by applying the VTLN and the model adaptations to the three systems we could conclude that adaptation methods of all systems have been effective but SMAPLR has been slightly better than MLLR in adapting the baseline models. However, the main cause of such difference in the absolute performance was the baseline model. SONIC model had better results due to several reasons; in particular, two factors played a decisive role. First, the PMVDR features used to describe the signal spectra in SONIC apparently react very well to high-variable signal such as children s speech. The second reason was the bootstrap of the training process: while SONIC could take advantages of a good first segmentation, SPHINX and BAVIECA had a raw uniformly spaced one. Some further tests are already planned in order to investigate this differences more deeply. Concerning the extra tests, a small improvement was obtained using the SPHINX model trained with phonetic transcriptions obtained by unsupervised phonetic alignment with general purpose SONIC AM, and the adaptation was also more effective in fact, the improvement is slightly better with respect to the standard uniform segmentation case. For what concerns the SONIC extra test (SAT adaptation), the absolute results and the adaptation effectiveness were very similar to those of the standard case. A marginal consideration might be added as a final conclusion. Even if SONIC turned out to have the best overall performances, nonetheless, it has begun to be less attractive than SPHINX and above all BAVIECA for research purpose because its libraries are a close piece of software and there is no more development and support from CSLR. Moreover, the apparently so efficient PMVDR are no more included in the standard distribution. Indeed new systems such as JULIUS [7] and KALDI [8], which demonstrated the effectiveness of easily incorporate Deep Neural Network (DNN) techniques [11] in order to improve the recognition performance in almost all recognition tasks, should and shall be considered in the future for further comparison. However, in this paper, we focused on HMMbased systems leaving out the analysis of newest systems for the future. Moreover, the same MFCC and non-mfcc features will be uniformly used in future experiments for all systems in order to have a more effective comparison among them. 6. Acknowledgements This work was partially supported by the EU FP7 ALIZ-E project (grant number ).

6 7. References [1] Young S., Evermann G., Gales M., Hain T., Kershaw D., Liu X., Moore G., Odell J., Ollason D., Povey D., Valtchev V., and Woodland P., The HTK Book (for version 3.4). Cambridge Univ. Eng. Dept., [2] Pellom B SONIC: The University of Colorado Continuous Speech Recognizer, Technical Report TR- CSLR , Center for Spoken Language Research, University of Colorado, USA, [3] Pellom B. and Hacioglu K Recent Improvements in the CU SONIC ASR System for Noisy Speech: The SPINE Task, Proc. ICASSP, Hong Kong, [4] Lee K.F., Hon H.W., and Reddy R. (1990), An overview of the SPHINX speech recognition system, in IEEE Transactions on Acoustics, Speech and Signal Processing 38.1 (1990), [5] Walker W., Lamere P., Kwok P., Raj B., Singh R., Gouvea E., Wolf P., and Woelfel J., Sphinx-4: A flexible Open Source Framework for Speech Recognition, Sun Microsystems Inc., Technical Report SML1 TR , [6] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Lööf, R. Schlüter, and H. Ney, The RWTH Aachen University Open Source Speech Recognition System, in Proc. of Interspeech, 2009, , [7] Lee A., Kawahara T., and Shikano K, JULIUS - an open source real-time large vocabulary recognition engine, in Proc. of INTERSPEECH 2001, S, [8] D. Povey, A. Ghoshal et al., The Kaldi Speech Recognition Toolkit, in Proc. of ASRU, Smith, J. O. and Abel, J. S., Bark and ERB Bilinear Transforms, IEEE Trans. Speech and Audio Proc., 7(6): , [9] Simon. WEB site: [10] Bolaños D., The Bavieca Open-Source Speech Recognition Toolkit, in Proc. of IEEE Workshop on Spoken Language Technology (SLT), December 2-5, 2012, Miami, FL, USA, [11] Bengio Y., Learning Deep Architectures for AI, in Foundations and Trends in Machine Learning, Vol. 2, No. 1 (2009) [12] Cosi, P., Nicolao, M., Sommavilla, G. & Tisato, G., Sviluppo di un sistema di riconoscimento per l Arabo: problemi e soluzioni, in Proc. of AISV 2007, 4th Conference of AISV, Reggio Calabria, Italy, [13] Cosi, P. & Nicolao, M., Connected Digits Recognition Task ISTC-CNR Comparison of Open Source Tools, in Proc. of the 11th Conference of the Italian Association for Artificial Intelligence, Reggio Emilia, Italy, [14] Gerosa M., Giuliani D., Narayanan S., Potamianos A., A Review of ASR Technologies for Children s Speech, in Proceeding WOCCI '09, 2nd Workshop on Child, Computer and Interaction, Article No. 7. [15] Strommen E.F. and Frome F.S., Talking back to big bird: Preschool users and a simple speech recognition system, Educational Technology Research and Development, 41:5 16, [16] Mostow J., Roth S., Hauptmann A., and Kane M., A prototype reading coach that listens, in Proc. of 12 th Natl. Conf. on Artificial Intelligence (AAAI-94), , Seattle, WA, [17] Russell M., Brown C., Skilling A., Series R., Wallace J., Bonham B., and Barker P., Applications of automatic speech recognition to speech and language development in young children, in Proc. of ICSLP, pages , Philadelphia, PA, [18] Narayanan S. and Potamianos A., Creating conversational interfaces for children, IEEE Transactions on Speech and Audio Processing, 10(2):65 78, Feb [19] Hagen A., Pellom B., and Cole R., Children s Speech Recognition with Application to Interactive Books and Tutors, in Proc. of IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, pages , St. Thomas, US Virgin Islands, Dec [20] Bell L., Boye J., Gustafson J., Heldner M., Lindstrom A., and Wiren M., The Swedish NICE Corpus Spoken dialogues between children and embodied characters in a computer game scenario, in Proc. of Interspeech, , Lisbon, Portugal, [21] Bell L. and Gustafson J., Children s convergence in referring expressions to graphical objects in a speechenabled computer game, in Proc. of Interspeech, , Antwerp, Belgium, [22] S. Lee, A. Potamianos, and S. Narayanan. Acoustics of children s speech: Developmental changes of temporal and spectral parameters, Journal of the Acoustical Society of America, pages , Mar [23] Eguchi, S., and Hirsh, I. J., Development of speech sounds in children, Acta Oto-Laryngol. Suppl. 257, 1 51, [24] Kent, R. D., Anatomical and neuromuscular maturation of the speech mechanism: Evidence from acoustic study, J. Speech Hear. Res. 19, , [25] Kent, R. D., and Forner, L. L., Speech segment durations in sentence recitations by children and adults, Journal of Phonetics 8, , [26] Smith, B., Temporal aspects of English speech production: A developmental perspective, Journal of Phonetics 6, 37 67, [27] Smith, B. L., Relationships between duration and temporal variability in children s speech, J. Acoust. Soc. Am. 91, , [28] Smith, B., Kenney, M. K., and Hussain, S., A longitudinal investigation of duration and temporal variability in children s speech production, J. Acoust. Soc. Am. 99, , [29] Hillenbrand, J., Getty, L. A., Clark, M. J., and Wheeler, K., Acoustic characteristics of American English vowels, J. Acoust. Soc. Am. 97, , [30] Cosi P. & Pellom B., Italian Children s Speech Recognition For Advanced Interactive Literacy Tutors, in Proc. of INTERSPEECH 2005, Lisbon, Portugal, , [31] Cosi, P., Recent Advances in Sonic Italian Children s Speech Recognition for Interactive Literacy Tutors, in Proc. of 1st Workshop On Child, Computer and Interaction (WOCCI), Chania, Greece, 2008/. [32] Cosi, P., On the Development of Matched and Mismatched Italian Children s Speech Recognition Systems, in Proc. of INTERSPEECH 2009, Brighton, UK, , [33] Gerosa M., Giuliani D. and Brugnara F., Acoustic Variability and automatic recognition of children s speech, Speech Communication, Vol. 49, [34] Cosi, P., Hosom, J.P., High Performance General Purpose Phonetic Recognition for Italian, in Proc. Of ICSLP 2000, Beijing, , 2000.

Connected Digits Recognition Task: ISTC CNR Comparison of Open Source Tools

Connected Digits Recognition Task: ISTC CNR Comparison of Open Source Tools Connected Digits Recognition Task: ISTC CNR Comparison of Open Source Tools Piero Cosi, Mauro Nicolao Istituto di Scienze e Tecnologie della Cognizione, C.N.R. via Martiri della libertà, 2 35137 Padova

More information

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection

More information

EVALITA-ISTC Comparison of Open Source Tools on Clean and Noisy Digits Recognition Tasks

EVALITA-ISTC Comparison of Open Source Tools on Clean and Noisy Digits Recognition Tasks EVALITA-ISTC Comparison of Open Source Tools on Clean and Noisy Digits Recognition Tasks Piero Cosi Mauro Nicolao ISTITUTO DI SCIENZE E TECNOLOGIE DELLA COGNIZIONE SEDE DI PADOVA - FONETICA E DIALETTOLOGIA

More information

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition , Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology

More information

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Fabio Tesser, Giacomo Sommavilla, Giulio Paci, Piero Cosi Institute of Cognitive Sciences and Technologies, National

More information

Turkish Radiology Dictation System

Turkish Radiology Dictation System Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

More information

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction : A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)

More information

ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney

ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department,

More information

THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT. Daniel Bolaños. Boulder Language Technologies (BLT), Boulder, CO, 80301 USA

THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT. Daniel Bolaños. Boulder Language Technologies (BLT), Boulder, CO, 80301 USA THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT Daniel Bolaños Boulder Language Technologies (BLT), Boulder, CO, 80301 USA dani@bltek.com ABSTRACT This article describes the design of Bavieca, an opensource

More information

Speech Recognition on Cell Broadband Engine UCRL-PRES-223890

Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Yang Liu, Holger Jones, John Johnson, Sheila Vaidya (Lawrence Livermore National Laboratory) Michael Perrone, Borivoj Tydlitat, Ashwini Nanda

More information

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Yousef Ajami Alotaibi 1, Mansour Alghamdi 2, and Fahad Alotaiby 3 1 Computer Engineering Department, King Saud University,

More information

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,

More information

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Robust Methods for Automatic Transcription and Alignment of Speech Signals Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist (lgr@msi.vxu.se) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background

More information

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems

More information

THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM

THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,

More information

EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN

EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN J. Lööf (1), D. Falavigna (2),R.Schlüter (1), D. Giuliani (2), R. Gretter (2),H.Ney (1) (1) Computer Science Department, RWTH Aachen

More information

2014/02/13 Sphinx Lunch

2014/02/13 Sphinx Lunch 2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue

More information

SPEECH RECOGNITION SYSTEM

SPEECH RECOGNITION SYSTEM SPEECH RECOGNITION SYSTEM SURABHI BANSAL RUCHI BAHETY ABSTRACT Speech recognition applications are becoming more and more useful nowadays. Various interactive speech aware applications are available in

More information

Using the Amazon Mechanical Turk for Transcription of Spoken Language

Using the Amazon Mechanical Turk for Transcription of Spoken Language Research Showcase @ CMU Computer Science Department School of Computer Science 2010 Using the Amazon Mechanical Turk for Transcription of Spoken Language Matthew R. Marge Satanjeev Banerjee Alexander I.

More information

MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Ulpu Remes, Kalle J. Palomäki, and Mikko Kurimo Adaptive Informatics Research Centre,

More information

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University

More information

LIUM s Statistical Machine Translation System for IWSLT 2010

LIUM s Statistical Machine Translation System for IWSLT 2010 LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,

More information

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,

More information

Hardware Implementation of Probabilistic State Machine for Word Recognition

Hardware Implementation of Probabilistic State Machine for Word Recognition IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2

More information

RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM

RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM B. Angelini, G. Antoniol, F. Brugnara, M. Cettolo, M. Federico, R. Fiutem and G. Lazzari IRST-Istituto per la Ricerca Scientifica e Tecnologica

More information

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be

More information

Establishing the Uniqueness of the Human Voice for Security Applications

Establishing the Uniqueness of the Human Voice for Security Applications Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.

More information

Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training

Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training Matthias Paulik and Panchi Panchapagesan Cisco Speech and Language Technology (C-SALT), Cisco Systems, Inc.

More information

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania oananicolae1981@yahoo.com

More information

An Arabic Text-To-Speech System Based on Artificial Neural Networks

An Arabic Text-To-Speech System Based on Artificial Neural Networks Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department

More information

Generating Training Data for Medical Dictations

Generating Training Data for Medical Dictations Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN pakhomov.sergey@mayo.edu Michael Schonwetter Linguistech Consortium, NJ MSchonwetter@qwest.net Joan Bachenko

More information

HMM-based Breath and Filled Pauses Elimination in ASR

HMM-based Breath and Filled Pauses Elimination in ASR HMM-based Breath and Filled Pauses Elimination in ASR Piotr Żelasko 1, Tomasz Jadczyk 1,2 and Bartosz Ziółko 1,2 1 Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science

More information

Emotion Detection from Speech

Emotion Detection from Speech Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction

More information

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Myanmar Continuous Speech Recognition System Based on DTW and HMM Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-

More information

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,

More information

TED-LIUM: an Automatic Speech Recognition dedicated corpus

TED-LIUM: an Automatic Speech Recognition dedicated corpus TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France firstname.lastname@lium.univ-lemans.fr

More information

Document downloaded from: http://hdl.handle.net/10251/35190. This paper must be cited as:

Document downloaded from: http://hdl.handle.net/10251/35190. This paper must be cited as: Document downloaded from: http://hdl.handle.net/10251/35190 This paper must be cited as: Valor Miró, JD.; Pérez González De Martos, AM.; Civera Saiz, J.; Juan Císcar, A. (2012). Integrating a State-of-the-Art

More information

Lecture 12: An Overview of Speech Recognition

Lecture 12: An Overview of Speech Recognition Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated

More information

Artificial Neural Network for Speech Recognition

Artificial Neural Network for Speech Recognition Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken

More information

Automatic Transcription of Conversational Telephone Speech

Automatic Transcription of Conversational Telephone Speech IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 6, NOVEMBER 2005 1173 Automatic Transcription of Conversational Telephone Speech Thomas Hain, Member, IEEE, Philip C. Woodland, Member, IEEE,

More information

Automatic slide assignation for language model adaptation

Automatic slide assignation for language model adaptation Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly

More information

Improving Automatic Forced Alignment for Dysarthric Speech Transcription

Improving Automatic Forced Alignment for Dysarthric Speech Transcription Improving Automatic Forced Alignment for Dysarthric Speech Transcription Yu Ting Yeung 2, Ka Ho Wong 1, Helen Meng 1,2 1 Human-Computer Communications Laboratory, Department of Systems Engineering and

More information

31 Case Studies: Java Natural Language Tools Available on the Web

31 Case Studies: Java Natural Language Tools Available on the Web 31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software

More information

Developing an Isolated Word Recognition System in MATLAB

Developing an Isolated Word Recognition System in MATLAB MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling

More information

CHANWOO KIM (BIRTH: APR. 9, 1976) Language Technologies Institute School of Computer Science Aug. 8, 2005 present

CHANWOO KIM (BIRTH: APR. 9, 1976) Language Technologies Institute School of Computer Science Aug. 8, 2005 present CHANWOO KIM (BIRTH: APR. 9, 1976) 2602E NSH Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Phone: +1-412-726-3996 Email: chanwook@cs.cmu.edu RESEARCH INTERESTS Speech recognition system,

More information

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations Hugues Salamin, Anna Polychroniou and Alessandro Vinciarelli University of Glasgow - School of computing Science, G128QQ

More information

Things to remember when transcribing speech

Things to remember when transcribing speech Notes and discussion Things to remember when transcribing speech David Crystal University of Reading Until the day comes when this journal is available in an audio or video format, we shall have to rely

More information

LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION

LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon

More information

Design and Data Collection for Spoken Polish Dialogs Database

Design and Data Collection for Spoken Polish Dialogs Database Design and Data Collection for Spoken Polish Dialogs Database Krzysztof Marasek, Ryszard Gubrynowicz Department of Multimedia Polish-Japanese Institute of Information Technology Koszykowa st., 86, 02-008

More information

Denoising Convolutional Autoencoders for Noisy Speech Recognition

Denoising Convolutional Autoencoders for Noisy Speech Recognition Denoising Convolutional Autoencoders for Noisy Speech Recognition Mike Kayser Stanford University mkayser@stanford.edu Victor Zhong Stanford University vzhong@stanford.edu Abstract We propose the use of

More information

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID

More information

Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device

Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device International Journal of Signal Processing Systems Vol. 3, No. 2, December 2015 Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device K. Kurumizawa and H. Nishizaki

More information

Develop Software that Speaks and Listens

Develop Software that Speaks and Listens Develop Software that Speaks and Listens Copyright 2011 Chant Inc. All rights reserved. Chant, SpeechKit, Getting the World Talking with Technology, talking man, and headset are trademarks or registered

More information

Detecting and Correcting Transcription Discrepancies between Thai Parliament Meeting Speech Utterances and their Official Meeting Reports

Detecting and Correcting Transcription Discrepancies between Thai Parliament Meeting Speech Utterances and their Official Meeting Reports Detecting and Correcting Transcription Discrepancies between Thai Parliament Meeting Speech Utterances and their Official Meeting Reports Natnarong. Puangsri, Atiwong. Suchato, Proadpran. Punyabukkana,

More information

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu

More information

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer

More information

Building A Vocabulary Self-Learning Speech Recognition System

Building A Vocabulary Self-Learning Speech Recognition System INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes

More information

SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY. www.phonexia.com, 1/41

SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY. www.phonexia.com, 1/41 SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY www.phonexia.com, 1/41 OVERVIEW How to move speech technology from research labs to the market? What are the current challenges is speech recognition

More information

Verification of Correct Pronunciation. of Mexican Spanish using Speech Technology 1

Verification of Correct Pronunciation. of Mexican Spanish using Speech Technology 1 Verification of Correct Pronunciation of Mexican Spanish using Speech Technology 1 Ingrid Kirschning and Nancy Aguas Tlatoa Speech Processing Group, ICT, CENTIA 2, Universidad de las Américas- Puebla.

More information

Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition

Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 Weighting and Normalisation of Synchronous HMMs for

More information

Thirukkural - A Text-to-Speech Synthesis System

Thirukkural - A Text-to-Speech Synthesis System Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,

More information

Comparing Open-Source Speech Recognition Toolkits

Comparing Open-Source Speech Recognition Toolkits Comparing Open-Source Speech Recognition Toolkits Christian Gaida 1, Patrick Lange 1,2,3, Rico Petrick 2, Patrick Proba 4, Ahmed Malatawy 1,5, and David Suendermann-Oeft 1 1 DHBW, Stuttgart, Germany 2

More information

Reading Assistant: Technology for Guided Oral Reading

Reading Assistant: Technology for Guided Oral Reading A Scientific Learning Whitepaper 300 Frank H. Ogawa Plaza, Ste. 600 Oakland, CA 94612 888-358-0212 www.scilearn.com Reading Assistant: Technology for Guided Oral Reading Valerie Beattie, Ph.D. Director

More information

Grant: LIFE08 NAT/GR/000539 Total Budget: 1,664,282.00 Life+ Contribution: 830,641.00 Year of Finance: 2008 Duration: 01 FEB 2010 to 30 JUN 2013

Grant: LIFE08 NAT/GR/000539 Total Budget: 1,664,282.00 Life+ Contribution: 830,641.00 Year of Finance: 2008 Duration: 01 FEB 2010 to 30 JUN 2013 Coordinating Beneficiary: UOP Associated Beneficiaries: TEIC Project Coordinator: Nikos Fakotakis, Professor Wire Communications Laboratory University of Patras, Rion-Patras 26500, Greece Email: fakotaki@upatras.gr

More information

Ericsson T18s Voice Dialing Simulator

Ericsson T18s Voice Dialing Simulator Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of

More information

Towards the Italian CSLU Toolkit

Towards the Italian CSLU Toolkit Towards the Italian CSLU Toolkit Piero Cosi *, John-Paul Hosom and Fabio Tesser *** * Istituto di Fonetica e Dialettologia C.N.R. Via G. Anghinoni, 10-35121 Padova (ITALY), e-mail: cosi@csrf.pd.cnr.it

More information

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION By Ramasubramanian Sundaram A Thesis Submitted to the Faculty of Mississippi State University in Partial Fulfillment of the

More information

CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN

CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN Min Tang, Bryan Pellom, Kadri Hacioglu Center for Spoken Language Research University of Colorado at Boulder Boulder, Colorado

More information

Advanced Signal Processing and Digital Noise Reduction

Advanced Signal Processing and Digital Noise Reduction Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New

More information

Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)

Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Proceedings of the Twenty-Fourth Innovative Appications of Artificial Intelligence Conference Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Tatsuya Kawahara

More information

ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt

ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION Horacio Franco, Luciana Ferrer, and Harry Bratt Speech Technology and Research Laboratory, SRI International, Menlo Park, CA

More information

SASSC: A Standard Arabic Single Speaker Corpus

SASSC: A Standard Arabic Single Speaker Corpus SASSC: A Standard Arabic Single Speaker Corpus Ibrahim Almosallam, Atheer AlKhalifa, Mansour Alghamdi, Mohamed Alkanhal, Ashraf Alkhairy The Computer Research Institute King Abdulaziz City for Science

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1 Data balancing for efficient training of Hybrid ANN/HMM Automatic Speech Recognition systems Ana Isabel García-Moral,

More information

Adaptive Training for Large Vocabulary Continuous Speech Recognition

Adaptive Training for Large Vocabulary Continuous Speech Recognition Adaptive Training for Large Vocabulary Continuous Speech Recognition Kai Yu Hughes Hall College and Cambridge University Engineering Department July 2006 Dissertation submitted to the University of Cambridge

More information

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1

More information

Music Mood Classification

Music Mood Classification Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may

More information

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

More information

EFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS

EFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS EFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS Christian Fügen, Sebastian Stüker, Hagen Soltau, Florian Metze Interactive Systems Labs University of Karlsruhe Am Fasanengarten 5 76131 Karlsruhe, Germany

More information

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box

More information

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA

More information

TUNING SPHINX TO OUTPERFORM GOOGLE S SPEECH RECOGNITION API. 1 Introduction. Patrick Lange 1,2,3 and David Suendermann-Oeft 1

TUNING SPHINX TO OUTPERFORM GOOGLE S SPEECH RECOGNITION API. 1 Introduction. Patrick Lange 1,2,3 and David Suendermann-Oeft 1 TUNING SPHINX TO OUTPERFORM GOOGLE S SPEECH RECOGNITION API Patrick Lange 1,2,3 and David Suendermann-Oeft 1 1 DHBW, Stuttgart, Germany 2 Linguwerk GmbH, Dresden, Germany 3 Staffordshire University, Stafford,

More information

Using Direct Instruction Programs as Intervention Programs in Grades K 3

Using Direct Instruction Programs as Intervention Programs in Grades K 3 Using Direct Instruction Programs as Intervention Programs in Grades K 3 Direct Instruction News Volume 5, Number 2 Summer 2005 Introduction This article is about the use of Direct Instruction as an intervention

More information

INTEGRATING THE COMMON CORE STANDARDS INTO INTERACTIVE, ONLINE EARLY LITERACY PROGRAMS

INTEGRATING THE COMMON CORE STANDARDS INTO INTERACTIVE, ONLINE EARLY LITERACY PROGRAMS INTEGRATING THE COMMON CORE STANDARDS INTO INTERACTIVE, ONLINE EARLY LITERACY PROGRAMS By Dr. Kay MacPhee President/Founder Ooka Island, Inc. 1 Integrating the Common Core Standards into Interactive, Online

More information

Evaluating grapheme-to-phoneme converters in automatic speech recognition context

Evaluating grapheme-to-phoneme converters in automatic speech recognition context Evaluating grapheme-to-phoneme converters in automatic speech recognition context Denis Jouvet, Dominique Fohr, Irina Illina To cite this version: Denis Jouvet, Dominique Fohr, Irina Illina. Evaluating

More information

SEGMENTATION AND INDEXATION OF BROADCAST NEWS

SEGMENTATION AND INDEXATION OF BROADCAST NEWS SEGMENTATION AND INDEXATION OF BROADCAST NEWS Rui Amaral 1, Isabel Trancoso 2 1 IPS/INESC ID Lisboa 2 IST/INESC ID Lisboa INESC ID Lisboa, Rua Alves Redol, 9,1000-029 Lisboa, Portugal {Rui.Amaral, Isabel.Trancoso}@inesc-id.pt

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data Alberto Abad, Hugo Meinedo, and João Neto L 2 F - Spoken Language Systems Lab INESC-ID / IST, Lisboa, Portugal {Alberto.Abad,Hugo.Meinedo,Joao.Neto}@l2f.inesc-id.pt

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Solutions to Exam in Speech Signal Processing EN2300

Solutions to Exam in Speech Signal Processing EN2300 Solutions to Exam in Speech Signal Processing EN23 Date: Thursday, Dec 2, 8: 3: Place: Allowed: Grades: Language: Solutions: Q34, Q36 Beta Math Handbook (or corresponding), calculator with empty memory.

More information

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library

More information

SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS. Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen

SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS. Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Developing speech recognition software for command-and-control applications

Developing speech recognition software for command-and-control applications Developing speech recognition software for command-and-control applications Author: Contact: Ivan A. Uemlianin i.uemlianin@bangor.ac.uk Contents Introduction Workflow Set up the project infrastructure

More information

D2.4: Two trained semantic decoders for the Appointment Scheduling task

D2.4: Two trained semantic decoders for the Appointment Scheduling task D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive

More information

Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results

Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results Alan W Black 1, Susanne Burger 1, Alistair Conkie 4, Helen Hastie 2, Simon Keizer 3, Oliver Lemon 2, Nicolas Merigaud 2, Gabriel

More information

CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content

CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content Martine Garnier-Rizet 1, Gilles Adda 2, Frederik Cailliau

More information

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information