MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
|
|
|
- Eleanor Tate
- 9 years ago
- Views:
Transcription
1 MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Ulpu Remes, Kalle J. Palomäki, and Mikko Kurimo Adaptive Informatics Research Centre, Helsinki University of Technology P.O. Box 400, FIN-01 HUT, Finland phone: + (38) , fax: + (38) , [email protected] web: ABSTRACT Methods for noise robust speech recognition are often evaluated in small vocabulary speech recognition tasks. In this work, we use missing feature reconstruction for noise compensation in large vocabulary continuous speech recognition task with speech data recorded in noisy environments such as cafeterias. In addition, we combine missing feature reconstruction with constrained maximum likelihood linear regression (CMLLR) acoustic model adaptation and propose a new method for finding noise corrupted speech components for the missing feature approach. Using missing feature reconstruction on noisy speech is found to improve the speech recognition performance significantly. The relative error reduction 36 % compared to the baseline is comparable to error reductions introduced with acoustic model adaptation, and results further improve when reconstruction and adaptation are used in parallel. 1. INTRODUCTION Large vocabulary continuous speech recognition has reached reasonable performance levels in controlled environments, but real environments with changing and unpredictable noise conditions remain still a challenge even with the current noise compensation methods. It is noteworthy that methods designed to improve statistical robustness in general rather than compensate for noise are often superior in performance. Such methods include e.g. acoustic model adaptation with constrained maximum likelihood linear regression (CMLLR) [1]. In this work, the noise robustness issue is addressed with the missing feature methods as proposed in [2][3]. Applying missing feature methods for noise compensation in automatic speech recognition is based on the observation that additive noise affects some spectrotemporal regions in the speech signal more than others. Thus, while the noise corrupted parts are unreliable and should not be conventionally used in speech recognition, the less affected speech components are somewhat reliable and (i) may be utilised in the usual manner in speech recognition and (ii) provide information about the unreliable (missing) components. Motivation for the missing feature approach originally comes from the human speech perception and auditory scene analysis (ASA) [4][]. The missing feature methods have performed well under various noise conditions and several different methods have been proposed for handling the unreliable spectrotemporal components [2][3][]. The methods have mostly been tested with a limited vocabulary, and are yet to become popular in large vocabulary continuous speech recognition (LVCSR) systems. For earlier results on missing feature techniques in LVCSR task, see for example [6], where missing feature reconstruction is used with projected spectral features and evaluated on AURORA 4 database that contains speech with artificially added real-world noises. In this work, we use Finnish large vocabulary speech data recorded in noisy environments such as parks and cafeterias. For noise compensation, we use cluster-based missing feature reconstruction method proposed in [3], which we combine with CMLLR acoustic model adaptation. Reconstruction and adaptation are used in series, with adaptation following reconstruction in the same system, and in parallel, using linear weighting on the system outputs in log-likelihood level as proposed in [7]. According to our knowledge, no results have been published before where a missing feature method would have been used together with acoustic model adaptation. The performance of any missing feature method depends heavily on the accuracy of the spectrographic mask that partitions the speech signal into the reliable and unreliable regions. We propose a new method for detecting the unreliable speech components. In this method, a noise estimate is calculated from speech pauses detected using a Gaussian mixture based speech/non-speech classifier. The noise estimate is compared to the noisy speech signal to find the unreliable, noise dominated regions. This method is more suitable for on-line applications and changing noise conditions than the commonly used approach where the noise estimate is calculated from a fixed number of frames at the beginning of each audio file, e.g. [2]. Other mask estimation methods suited for on-line applications have been proposed in e.g. [3][6][8]. 2.1 Baseline system 2. METHODS Our large vocabulary continuous speech recognition system uses a morph-based growing n-gram language model [9] which is trained on book and newspaper data. The text data contains 14 million words. Since all words and word forms can be represented with the unsupervised morphs, the decoding vocabulary is in practise unlimited []. The decoder used is a time-synchronous beam-pruned Viterbi token-pass system [11]. The acoustic models are state-clustered hidden Markov triphone models constructed with a decisiontree method [12]. Each state is modelled with 16 Gaussians, and the states are also associated with gamma probability functions to model the state durations [13]. Speech is represented with 12 MFCC and a log energy feature. Features are used with their first and second order differentials, and they are treated with cepstral mean subtraction (CMS) and max-
2 1 1 1 ( a ) ( b ) ( c ) Figure 1: (a) Logarithmic mel spectrogram for an utterance recorded under public place noise conditions and (b) a spectrographic mask that divides the spectrogram to reliable (black) and unreliable (white) regions. (c) An estimate for the clean spectrogram constructed with the cluster-based missing feature reconstruction method [3]. imum likelihood linear transformation (MLLT) [14] learned in training. The training of the acoustic models is described in Section Missing feature reconstruction Noise mask estimation In most real world acoustic environments, it is reasonable to assume that the most notable noise components originate from sources that are uncorrelated with speech. Uncorrelated noise corrupts speech additively in power spectral domain. Thus, in logarithmic mel spectral domain, when speech dominates over noise, the time-frequency components Y(τ, i) of noisy speech signal may be considered as reliable estimates of the clean speech values X(τ,i) that would have been observed if there had not been any noise. The components in the noise dominated regions are, on the other hand, unreliable, and provide only an upper bound for the true values. Labels that divide the data to reliable and unreliable parts are referred to as the spectrographic mask. Partition to reliable and unreliable time-frequency regions is illustrated in Figure 1. In this work, we propose a new on-line method for spectrographic mask estimation. The method is based on speech pause detection. In this method, the spectrographic mask is constructed based on local signal-to-noise ratio (SNR) estimates. The estimates are derived from noise estimates calculated during speech pauses which are detected using a Gaussian mixture based speech/non-speech classifier. The nonspeech frames N(τ) i.e. the frames Y(τ) that have been classified as non-speech are collected and temporally smoothed to produce the noise estimate. For non-speech frames, the noise estimate is calculated as N ave (τ,i) = h(k)n(τ k,i), (1) k where τ indicates the time-frame and i the frequency channel. The temporal smoothing window is h(k) = β(40 k ), k =...40, where β is constant chosen so that the window gain is 1. This yields a weighted moving average where the current time instant is given maximal weight while past and future are linearly (in log domain) attenuated. The shape of the window is experimentally chosen. For speech frames, the last noise estimate N ave (τ) calculated before the speech onset is used. Time-frequency components are taken to be unreliable if their observed value Y(τ,i) does not exceed the estimated noise power N ave (τ,i) with minimum of θ db, where θ is manually optimised SNR threshold. In this work, the threshold is optimised with far recorded parameter optimisation data (see Section 3 for dataset description) and the threshold θ = 3 db Speech pause detection The speech/non-speech classifier used in this work is a hidden Markov model (HMM) classifier where speech and nonspeech are modelled as single states with 24 Gaussian components. Insertion penalties are used in decoding to exclude short speech or non-speech segments. The classifier uses the same features as the speech recognition system described in Section 2.1. The classifier training data contains over hours of television news data from the Finnish Broadcasting Company (YLE). It has precise hand-annotated time marks for speech and other audio. Since the classifier is used for noise mask estimation, it is crucial that it does not misclassify any speech segments as non-speech. To avoid this, the speech model probabilities are multiplied by two Cluster-based reconstruction The missing feature methods used in automatic speech recognition are commonly divided in two categories, classifier modification and data imputation approaches. Bounded marginalisation, for example, has been found efficient when tested with a limited vocabulary (e.g. in a connected digit recognition task), but as a classifier-modification method it limits the speech recogniser to use spectral features [2]. In HMM-based speech recognition systems, cepstral features are preferred [1]. In data imputation methods, the unreliable components are replaced with estimates that correspond to clean speech so that the reconstructed spectral features may be further processed as usual and the recogniser needs not be modified. This is especially important in large vocabulary continuous speech recognition where the state-of-the-art systems typically use various normalisation methods and feature transformations. In this work, we use data imputation with the clusterbased feature reconstruction method proposed in [3]. Here, the log spectral feature vectors Y(τ) are assumed independent and identically distributed, and the unreliable spectral components in Y(τ) are reconstructed based on their statistical relation to the other components in the same vector. The clean log spectral features X(τ) are assumed to originate from a Gaussian distribution, but not necessarily all from the same one; the feature distribution model is estimated from clean training data so that the data is divided to clusters, and each cluster is modelled with a Gaussian distribution. For a description on how the unreliable features are reconstructed with the distribution model, see [3]. The feature distribution model used in this work is a -
3 component GMM trained with 96 minutes of clean speech extracted from the SPEECON training set described in Section 3. Using the whole training set did not improve the reconstruction results in preliminary tests conducted with our parameter optimisation dataset. The clusters and distribution parameters were jointly estimated using the expectationmaximisation (EM) algorithm from the GMMBAYES Matlab Toolbox [16]. 2.3 Adaptation and reconstruction combined Regression based model adaptation methods are a common choice for speaker as well as environmental adaptation when hidden Markov model (HMM) based acoustic models with states modelled as Gaussian mixture distributions are used. Variation in environmental conditions affects the feature distribution, so with a new speaker or in a new environment, the distribution becomes mismatched with the acoustic models unless the models are adapted. Now, although the constrained maximum likelihood linear regression (CMLLR) is essentially a model adaptation method, it can be formulated as a linear transformation on features [1]. The transformation parameters are estimated from adaptation data so that the transformed models maximise the likelihood of the data. In this work, since the noise corrupted features are reconstructed to appear as equivalent to the clean speech features, CMLLR transformations can be estimated based on the reconstructed features and applied to the features in the usual manner. This will be referred to as using the methods in series. The methods are also used in parallel, embedded in separate systems that use the same acoustic models and receive the same input. The system outputs are combined using linear weighting as proposed in [7]. Acoustic model outputs are the feature log-likelihoods P(O S), where O = {o(τ)} is the feature representation for the input speech and S = {s} denotes the states in the HMM-based acoustic model. The combined output log-likelihoods are calculated as P = α P(Ô S)+(1 α)p(õ S) (2) where Ô are the adapted features and Õ the reconstructed features, and α [0, 1] is the weight that determines whether the system with adaptation or the system with reconstruction should be emphasised in the log-likelihood calculation. In this work, the weight parameter is manually optimised using far recorded parameter optimisation data (see Section 3 for dataset description) and α = 0.7. The log-likelihoods are sent to the decoder where they are combined with language model probabilities to produce the final hypothesis (speech recognition result). 3. DATA The acoustic models are trained with data selected from the Finnish SPEECON database [17]. The training dataset contains 26 hours of clean speech recorded with a close-talk microphone. The data is collected from 8 speakers with both male and female speakers. Among utterances are words, sentences and free speech. Parameter optimisation and evaluation data are also from the SPEECON database. The 72-minute parameter optimisation set has speech from 23 speakers, and the 1-minute evaluation set has been collected from 32 speakers. The evaluation set does not share speakers with the parameter optimisation set or the acoustic model training data. In addition, sentences selected from the parameter optimisation set were hand-annotated for evaluating the speech/non-speech classifier performance. The utterances used for parameter optimisation and evaluation are all read sentences recorded in public places both indoors and outdoors where speech, footsteps, unspecified clatter etc. appears in the background. The sentences are excerpts from Internet texts and occupy a large (unlimited) vocabulary. The proposed methods are tested under two different conditions: we use data recorded with the close-talk microphone and data recorded from 0. m 1 m distance. The recordings have been made simultaneously so they have the same speech contents. SNR values estimated with the recording platform are on average 24 db in the close-talk data and around 9 db in the far recorded data. The close-talk data almost corresponds to clean speech, whereas in the far recorded data, both environmental noise and reverberation (in some environments) affect the speech signal and decrease speech recognition performance. The data is recorded in sessions with one speaker in certain environment. Each session is around 3 minutes. When CMLLR is applied, the transformations are estimated based on the sessions, and the system is adapted to both the speaker and the environment. In this work, we use offline adaptation with all available data first used as adaptation data and then recognised. Since Finnish speech data is used, speech recognition performance is measured primarily with the letter error rate (LER). For other languages, the word error rate (WER) is more common, but it is not well applicable to Finnish where words are long. Finnish words are often concatenations of several morphemes and correspond to more than one English words like the word kahvin+juoja+lle+kin which translates to also for a coffee drinker. In this work, both error rates are reported, but system comparisons are based on the letter error rates alone. 4. RESULTS 4.1 Speech/non-speech classifier To evaluate the speech/non-speech classifier, we compared the classification results to hand-annotated speech and nonspeech regions. The frame classification accuracy was 93 % for the close-talk data and 92 % for the far recorded data. In addition, we tested how speech/non-speech classification affects the speech recognition performance when the classification results are used for spectrographic mask estimation for the cluster-based missing feature reconstruction. The spectrographic mask estimates were constructed either based on automatically classified or manually set speech/non-speech segments. SNR threshold θ was optimised for both segmentations with the far recorded data, and in this experiment, the threshold θ = 3 db for manually segmented data and θ= db for automatically segmented data. The difference in threshold values suggests that the speech/non-speech classifier has a tendency to label noise as speech rather than opposite. The speech recognition results are given in Table Missing feature reconstruction Speech recognition performance under environmental noise is evaluated with (a) the baseline system and when (b) cluster-based missing feature reconstruction or (c) acoustic model adaptation with the constrained maximum likelihood linear regression (CMLLR) or (d) (e) both reconstruc-
4 Table 1: Speech recognition results over the -utterance speech/non-speech classifier evaluation data when the speech and non-speech boundaries for spectrographic mask estimation are automatically detected with the classifier or manually set. The close-talk data is almost clean speech with average SNR 24 db and the far recorded data is noisy speech with average SNR 9 db. Close Close Far Far WER LER WER LER Autom. segmentation Manual segmentation Table 2: Speech recognition results over the full evaluation dataset when missing feature reconstruction and constrained maximum likelihood linear regression (CMLLR) are used in a large vocabulary continuous speech recognition system. The close-talk data is almost clean speech with average SNR 24 db and the far recorded data is noisy speech with average SNR 9 db. The best results obtained with the data are underlined. Close Close Far Far WER LER WER LER (a) Baseline (b) Reconstr (c) CMLLR (d) Reconstr. CMLLR (e) Reconstr. + CMLLR tion and adaptation are used for noise compensation. The reconstruction and adaptation are tested (d) in series and (e) in parallel as described in Section 2.3. The results are given in Table 2. Missing feature reconstruction (b) does not significantly improve the speech recognition performance on the closetalk data, but with the far recorded data, the relative error reduction from missing feature reconstruction is 36 % compared to the baseline (a). The percentage of unreliable (missing) components in the frames classified as speech is on average 43 % among the utterances recorded with the close-talk microphone, and on average 63 % among the corresponding far recorded utterances. Speaker and environmental adaptation with CMLLR (c) improves the speech recognition results from the baseline (a) so that the relative error reduction is 21 % with the close-talk data and 38 % with the far recorded data. Thus, with the far recorded data, missing feature reconstruction (b) results in performance comparable to the CMLLR performance (c). The difference between the results (b) and (c) on the far recorded data is not statistically significant according to Wilcoxon signed rank test (see Table 3). With the close-talk data that almost corresponds to clean speech, speaker adaptation (c) improves the results significantly more than missing feature reconstruction (b). Table 3: Test statistics from pairwise system comparisons. Systems are compared based on the letter error rate, and using the Wilcoxon signed rank test. The difference between two systems is statistically significant (p < 0.0) if the test statistic Z > Test statistics suggesting statistical significance are underlined. The systems compared are (a) baseline, (b) reconstruction, (c) CMLLR, and (d) reconstruction and CMLLR in series and (e) reconstruction and CMLLR in parallel. See Table 2 for the system average performance rates. (b) 0.24 Close-talk data (a) (b) (c) (d) (c) (d) (e) (b) Far recorded data (a) (b) (c) (d) (c) (d) (e) Adaptation improves the results on close-talk data also when applied together either in series or in parallel with missing feature reconstruction, but with the far recorded data, using reconstruction and adaptation in series (d) does not improve the results significantly compared to the results (b). Using reconstruction and adaptation in parallel does, and the relative error reduction is 40 % compared to the baseline (a). The difference in results (c) and (e) in close-talk data is not statistically significant.. CONCLUSIONS AND DISCUSSION In this work, we used cluster-based missing feature reconstruction [3] in a LVCSR system trained with clean speech. We tested the method with noisy speech data recorded in real public environments, and used it with constrained maximum likelihood linear regression (CMLLR) [1]. In addition, we presented a new method for finding the noise corrupted speech signal components. The results indicate that missing feature reconstruction can significantly improve noise robustness under changing and unpredictable noise conditions. With the far recorded noisy speech data, the improvements from cluster-based missing feature reconstruction were comparable to the improvements from CMLLR speaker and environmental adaptation. With the close-talk data that almost corresponds to clean speech, the improvements were minor and not statistically significant. CMLLR and the other methods from the linear transformation family are amongst the most popular and efficient methods for improving robustness in automatic speech
5 recognition. They have but one constraint: the methods need to be provided with enough (minimum 00 0 frames) adaptation data from the target speaker or environment in order for the estimated transformations to be reliable [18]. Since 2 3 utterances are sufficient, adaptation data is seldom a problem in continuous speech recognition tasks, unless that is, the data has not been organised in sessions (e.g. broadcast news, meeting room recordings). In such case, methods like missing feature reconstruction that do not need adaptation data have a clear advantage. Missing feature methods do not use any information about the current speaker, and cannot compensate for speaker variation. LVCSR systems usually need some speaker compensation method in order to reach good performance, so missing feature reconstruction needs to be combined with e.g. CMLLR. In the reported experiments, CMLLR did not improve the speech recognition performance when the transformations were estimated from reconstructed features. When the methods were used in parallel, the results were marginally better than the results obtained with the methods used in series. With the parallel system, the relative error reduction was 40 % compared to the baseline. The parallel system results could be further improved if, for example, an adaptive weight α = α(τ) was used. The parallel approach, however, requires more computation than using the methods in series because state probabilities need to be calculated for two feature streams. What should be determined now, is why CMLLR did not improve the speech recognition results when applied on the reconstructed features. It is possible that under noisy conditions, CMLLR focuses on modelling the environmental noise rather than the speaker, and effectively becomes a noise compensation method. This should, however, mostly affect the parallel system performance. When CMLLR is applied on cleaned speech i.e. after reconstruction, it should again compensate for speaker variation, but it is possible that (i) because the same speaker-independent model is used to reconstruct all features, missing feature reconstruction may have resulted in removing or smoothing speaker-specific characteristics, or (ii) missing feature reconstruction may have caused unusual variation in the features, in which case adaptation has sought to compensate for this rather than speaker variation. If missing feature reconstruction degrades speaker adaptation performance as suggested above, adaptation and reconstruction should be applied in a different order: adaptation before reconstruction. The reversal is not straightforward as missing feature reconstruction operates on spectral features while adaptation is used only after cepstral and the other feature transformations. Note that speaker normalisation methods applied in time or spectral domain come naturally before missing feature reconstruction, so there should be no similar problems in combining missing feature reconstruction and speaker normalisation. 6. ACKNOWLEDGMENTS This work was supported by the Academy of Finland in the projects Auditory approaches to automatic speech recognition, New adaptive and learning methods in speech recognition, and Adaptive Informatics and by the Helsinki Graduate School in Computer Science. REFERENCES [1] M. J. F. Gales, Maximum likelihood linear transformations for HMM based speech recognition, Computer Speech and Language, vol. 12, pp. 7 98, Apr [2] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data, Speech Communication, vol. 34, pp , Jun. 01. [3] B. Raj, M. L. Seltzer, and R. M. Stern, Reconstruction of missing features for robust speech recognition, Speech Communication, vol. 43, pp , Sep. 04. [4] A. S. Bregman, Auditory Scene Analysis. Address: MIT Press, [] J. Barker, Robust automatic speech recognition, in D. L. Wang and G. J. Brown, editors, Computational Auditory Scene Analysis. Address: Wiley-Interscience, 06. [6] M. Van Segbroeck and H. Van hamme, Vector-quantization based mask estimation for missing data automatic speech recognition, in Proc. INTERSPEECH 07, Antwerp, Belgium, August , pp [7] K. J. Palomäki, G. J. Brown, J. P. Barker, Recognition of reverberant speech using full cepstral features and spectral missing data, in Proc. ICASSP 06, Toulouse, France, May , pp [8] M. L. Seltzer, B. Raj, and R. M. Stern, A bayesian framework for spectrographic mask estimation for missing feature speech recognition, Speech Communication, vol. 43, pp , Sep. 04. [9] V. Siivola and B. Pellom, Growing an n-gram language model, in Proc. INTERSPEECH 0, Lisbon, Portugal, September , pp [] T. Hirsimäki, M. Creutz, V. Siivola, M. Kurimo, S. Virpioja, and J. Pylkkönen, Unlimited vocabulary speech recognition with morph language models applied to Finnish, Computer Speech and Language, vol., pp. 1 41, Oct. 06. [11] J. Pylkkönen, An efficient one-pass decoder for Finnish large vocabulary continuous speech recognition, in Proc. 2nd Baltic Conference on Human Language Technologies, Tallinn, Estonia, April 4-. 0, pp [12] J. J. Odell, The use of context in large vocabulary speech recognition. Ph.D. thesis, University of Cambridge, 199. [13] J. Pylkkönen and M. Kurimo, Duration modeling techniques for continuous speech recognition, in Proc. INTERSPEECH 04, Jeju Island, Korea, October , pages [14] M. J. F. Gales, Semi-tied covariance matrices for hidden Markov models, IEEE Trans. SAP, vol. 7, pp , May [1] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. ASSP,vol. 28, pp , Aug [16] GMMBAYES. [ [17] D. Iskra, B. Grosskopf, K. Marasek, H. van den Heuvel, F. Diehl, and A. Kiessling, SPEECON - speech databases for consumer devices: Database specification and validation, in Proc. LREC, Las Palmas, Canary Islands, Spain, May , pp [18] P. C. Woodland, Speaker adaptation for continuous density HMMs: A review, in Proc. ISCA Tutorial and Research Workshop on Adaptation Methods for Speech Recognition, Sophia Antipolis, France, August , pp
Developing an Isolated Word Recognition System in MATLAB
MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA
APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium [email protected]
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,
Online Diarization of Telephone Conversations
Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of
The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
Artificial Neural Network for Speech Recognition
Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken
Ericsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 Weighting and Normalisation of Synchronous HMMs for
Automatic Evaluation Software for Contact Centre Agents voice Handling Performance
International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,
THE problem of tracking multiple moving targets arises
728 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008 Binaural Tracking of Multiple Moving Sources Nicoleta Roman and DeLiang Wang, Fellow, IEEE Abstract This paper
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Emotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems
MUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.
Hardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
Denoising Convolutional Autoencoders for Noisy Speech Recognition
Denoising Convolutional Autoencoders for Noisy Speech Recognition Mike Kayser Stanford University [email protected] Victor Zhong Stanford University [email protected] Abstract We propose the use of
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION Horacio Franco, Luciana Ferrer, and Harry Bratt Speech Technology and Research Laboratory, SRI International, Menlo Park, CA
School Class Monitoring System Based on Audio Signal Processing
C. R. Rashmi 1,,C.P.Shantala 2 andt.r.yashavanth 3 1 Department of CSE, PG Student, CIT, Gubbi, Tumkur, Karnataka, India. 2 Department of CSE, Vice Principal & HOD, CIT, Gubbi, Tumkur, Karnataka, India.
Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA
Audio Engineering Society Convention Paper Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA The papers at this Convention have been selected on the basis of a submitted abstract
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
SPRACH - WP 6 & 8: Software engineering work at ICSI
SPRACH - WP 6 & 8: Software engineering work at ICSI March 1998 Dan Ellis International Computer Science Institute, Berkeley CA 1 2 3 Hardware: MultiSPERT Software: speech & visualization
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
TRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY
4 4th International Workshop on Acoustic Signal Enhancement (IWAENC) TRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY Takuya Toyoda, Nobutaka Ono,3, Shigeki Miyabe, Takeshi Yamada, Shoji Makino University
Identification algorithms for hybrid systems
Identification algorithms for hybrid systems Giancarlo Ferrari-Trecate Modeling paradigms Chemistry White box Thermodynamics System Mechanics... Drawbacks: Parameter values of components must be known
Robust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist ([email protected]) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
Solutions to Exam in Speech Signal Processing EN2300
Solutions to Exam in Speech Signal Processing EN23 Date: Thursday, Dec 2, 8: 3: Place: Allowed: Grades: Language: Solutions: Q34, Q36 Beta Math Handbook (or corresponding), calculator with empty memory.
An Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between
ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA. Multimedia Communications Department, EURECOM, Sophia Antipolis, France 2
3rd European ignal Processing Conference (EUIPCO) ADAPTIVE AND ONLINE PEAKER DIARIZATION FOR MEETING DATA Giovanni oldi, Christophe Beaugeant and Nicholas Evans Multimedia Communications Department, EURECOM,
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Annotated bibliographies for presentations in MUMT 611, Winter 2006
Stephen Sinclair Music Technology Area, McGill University. Montreal, Canada Annotated bibliographies for presentations in MUMT 611, Winter 2006 Presentation 4: Musical Genre Similarity Aucouturier, J.-J.
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Music Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
Myanmar Continuous Speech Recognition System Based on DTW and HMM
Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Advanced Signal Processing and Digital Noise Reduction
Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New
Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep
Engineering, 23, 5, 88-92 doi:.4236/eng.23.55b8 Published Online May 23 (http://www.scirp.org/journal/eng) Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep JeeEun
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon
THE goal of Speaker Diarization is to segment audio
1 The ICSI RT-09 Speaker Diarization System Gerald Friedland* Member IEEE, Adam Janin, David Imseng Student Member IEEE, Xavier Anguera Member IEEE, Luke Gottlieb, Marijn Huijbregts, Mary Tai Knox, Oriol
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and
HOG AND SUBBAND POWER DISTRIBUTION IMAGE FEATURES FOR ACOUSTIC SCENE CLASSIFICATION. Victor Bisot, Slim Essid, Gaël Richard
HOG AND SUBBAND POWER DISTRIBUTION IMAGE FEATURES FOR ACOUSTIC SCENE CLASSIFICATION Victor Bisot, Slim Essid, Gaël Richard Institut Mines-Télécom, Télécom ParisTech, CNRS LTCI, 37-39 rue Dareau, 75014
Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
The Effect of Network Cabling on Bit Error Rate Performance. By Paul Kish NORDX/CDT
The Effect of Network Cabling on Bit Error Rate Performance By Paul Kish NORDX/CDT Table of Contents Introduction... 2 Probability of Causing Errors... 3 Noise Sources Contributing to Errors... 4 Bit Error
Tracking Groups of Pedestrians in Video Sequences
Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal
Tracking and Recognition in Sports Videos
Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey [email protected] b Department of Computer
Log-Likelihood Ratio-based Relay Selection Algorithm in Wireless Network
Recent Advances in Electrical Engineering and Electronic Devices Log-Likelihood Ratio-based Relay Selection Algorithm in Wireless Network Ahmed El-Mahdy and Ahmed Walid Faculty of Information Engineering
Gender Identification using MFCC for Telephone Applications A Comparative Study
Gender Identification using MFCC for Telephone Applications A Comparative Study Jamil Ahmad, Mustansar Fiaz, Soon-il Kwon, Maleerat Sodanil, Bay Vo, and * Sung Wook Baik Abstract Gender recognition is
Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids
Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids Synergies and Distinctions Peter Vary RWTH Aachen University Institute of Communication Systems WASPAA, October 23, 2013 Mohonk Mountain
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Speech Signal Processing: An Overview
Speech Signal Processing: An Overview S. R. M. Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati December, 2012 Prasanna (EMST Lab, EEE, IITG) Speech
Aircraft sound level measurements in residential areas using sound source separation.
The 33 rd International Congress and Exposition on Noise Control Engineering Aircraft sound level measurements in residential areas using sound source separation. T.C. Andringa a,c, P.W.J. van Hengel a,
Figure1. Acoustic feedback in packet based video conferencing system
Real-Time Howling Detection for Hands-Free Video Conferencing System Mi Suk Lee and Do Young Kim Future Internet Research Department ETRI, Daejeon, Korea {lms, dyk}@etri.re.kr Abstract: This paper presents
ROBUST TEXT-INDEPENDENT SPEAKER IDENTIFICATION USING SHORT TEST AND TRAINING SESSIONS
18th European Signal Processing Conference (EUSIPCO-21) Aalborg, Denmark, August 23-27, 21 ROBUST TEXT-INDEPENDENT SPEAKER IDENTIFICATION USING SHORT TEST AND TRAINING SESSIONS Christos Tzagkarakis and
Separation and Classification of Harmonic Sounds for Singing Voice Detection
Separation and Classification of Harmonic Sounds for Singing Voice Detection Martín Rocamora and Alvaro Pardo Institute of Electrical Engineering - School of Engineering Universidad de la República, Uruguay
Maximum Likelihood Estimation of ADC Parameters from Sine Wave Test Data. László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár
Maximum Lielihood Estimation of ADC Parameters from Sine Wave Test Data László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár Dept. of Measurement and Information Systems Budapest University
Thirukkural - A Text-to-Speech Synthesis System
Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,
Introduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
Information Leakage in Encrypted Network Traffic
Information Leakage in Encrypted Network Traffic Attacks and Countermeasures Scott Coull RedJack Joint work with: Charles Wright (MIT LL) Lucas Ballard (Google) Fabian Monrose (UNC) Gerald Masson (JHU)
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
VEHICLE TRACKING USING ACOUSTIC AND VIDEO SENSORS
VEHICLE TRACKING USING ACOUSTIC AND VIDEO SENSORS Aswin C Sankaranayanan, Qinfen Zheng, Rama Chellappa University of Maryland College Park, MD - 277 {aswch, qinfen, rama}@cfar.umd.edu Volkan Cevher, James
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
A Study on the Comparison of Electricity Forecasting Models: Korea and China
Communications for Statistical Applications and Methods 2015, Vol. 22, No. 6, 675 683 DOI: http://dx.doi.org/10.5351/csam.2015.22.6.675 Print ISSN 2287-7843 / Online ISSN 2383-4757 A Study on the Comparison
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Transcription of polyphonic signals using fast filter bank( Accepted version ) Author(s) Foo, Say Wei;
