SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS
|
|
- David Hunt
- 7 years ago
- Views:
Transcription
1 SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University of Technology and Economics Abstract. Statistical parametric synthesis offers numerous techniques to create new voices. Speaker adaptation is one of the most exciting ones. However, it still requires high quality audio data with low signal to noise ration and precise labeling. This paper presents an automatic speech recognition based unsupervised adaptation method for Hidden Markov Model (HMM) speech synthesis and its quality evaluation. The adaptation technique automatically controls the number of phone mismatches. The evaluation involves eight different HMM voices, including supervised and unsupervised speaker adaptation. The effects of segmentation and linguistic labeling errors in adaptation data are also investigated. The results show that unsupervised adaptation can contribute to speeding up the creation of new HMM voices with comparable quality to supervised adaptation. Key words: HMM-based speech synthesis, unsupervised adaptation, automatic speech recognition 1 Introduction In the last decade the primary goal of speech synthesis was to achieve natural sounding, high quality voices. As the results of unit selection and statistical parametric speech synthesis improve, new challenges emerge. Creating a new voice, which is similar to the voice characteristics of a target speaker, is an attractive challenge. Context independent unit selection synthesis demands a well constructed speech database with hours of speech, its phonetic transcription and precise labeling for each new voice. This method is time consuming and a lot of human interaction is necessary. Statistical parametric synthesis offers speaker adaptation techniques, where a speech database of moderate size is required only to create a similar voice to the target speaker s. Human interaction is still necessary for precise phonetic transcription and labeling.
2 2 Authors Suppressed Due to Excessive Length As the quality of statistical parametric speech synthesis approaches the quality of state-of-the-art unit selection methods it became a focused research area. Usually the HMM paradigm - well known from the speech recognition domain - is used in statistical speech synthesis [1]. It has numerous advantages compared to unit selection: small footprint, the possibility of creating various voices [2], emotional speech [3] and adapting the voice characteristics to a target speaker [4], [5]. Recently hybrid approaches, like target cost prediction of unit selection systems by HMMs [6], smoothing the segment sequence of unit selection systems with statistical models and/or their dynamic features [7], mixing unit selection and statistical parametric speech synthesis [8] have also been proposed. 2 SUPERVISED AND UNSUPERVISED ADAPTATION In HMM speech synthesis and recognition the two main techniques of speaker adaptation are maximum likelihood linear regression (MLLR) [4] and maximum a posteriori (MAP) estimation [5]. MLLR is applied when the amount of adaptation data is small, for MAP more data is required as the Gaussian distributions are updated individually. In both cases supervised speaker adaptation uses precise phonetic transcriptions, manually transcribed or automatically annotated segmentation and linguistic labels. The advantages of unsupervised adaptation of HMM speech synthesis are quite appealing - the creation of target voices becomes automatic which is favorable if several voices are required or if pre-processing of the speech data is not possible. Probably the most advanced method would be to create a full-context speech recognizer and train the HMMs with the output of this system. Although no studies have been carried out, it is likely to be computationally inadequate and would probably create inaccurate labels. In Automatic Speech Recognition (ASR) systems both supervised and unsupervised adaptation are used to increase the recognition accuracy. The unsupervised method requires less manual work, but more adaptation data; about one hour per speaker is used in practice [9]. In [10] an interesting method of unsupervised speaker adaptation was introduced. In this study only phonetic labels were used for adaptation, the transformation matrices were computed from triphone models. The results of the study show that the degradation in quality and naturalness is caused mainly by limiting full-context labels to triphone labels, and not by triphone mismatches. Another study [11] investigates a two-pass decision tree construction technique for unsupervised adaptation. The decision trees of full context models are built in two phases: first the segmental, then the supra-segmental features are processed. According to the results of [11] there is no perceived quality difference between supervised and unsupervised adaptation, although the average voice was trained by ASR corpora, so it produces very low quality synthetic speech ( MOS values [11]), which may hide the quality degradation caused by this two-pass method.
3 Title Suppressed Due to Excessive Length 3 Another important aspect is described in [12]. Several tests of different TTS systems with the same labels and clear and noisy speech database are carried out. The results of [12] show that HMM-based adaptive speech synthesis is far more robust than concatenative, speaker-dependent HMM-based, or hybrid speech synthesis approaches. 3 ASR-BASED UNSUPERVISED SPEAKER ADAPTATION Complementing the results of [9], [10], [11], [12] our concept is to evaluate the quality of adaptation with inaccurate, noisy phonetic transcription. The consequences of inaccurate phonetic transcription are phoneme mismatches, inaccurate segmentation and linguistic labels due to phoneme mismatch accumulation. Speech recognizers for a given context perform quite well, but their output still contains various mismatches. Fig. 1. Block diagram of the proposed unsupervised adaptation method
4 4 Authors Suppressed Due to Excessive Length 3.1 The Proposed Method The speech recordings from the target speaker are recognized, then phone boundaries are determined with forced alignment based on the recognition results. If the results of forced alignment do not satisfy an item-drop criterion (which is described in 3.3) that part of the recordings is rejected. When phone boundary detection is accepted for at least ten minutes of recordings, linguistic labeling is carried out. Finally the adaptation is applied. The block diagram of the proposed method is shown in Fig Automatic Recognition of the Speech Corpus and Phonetic Transcription The TTS adaptation database is transcribed automatically with an LVCSR ASR system [9]. The output will contain recognition errors, which can be significantly reduced if the context of the TTS adaptation database and the ASR training database are from the same domain. The following processing step is transforming the orthographic output of the ASR system into phonetic representation. This may be completed either by dictionary or rule-based software modules. 3.3 Phone Boundary Detection The phone boundaries in the TTS adaptation database are marked automatically based on the phonetic transcription described in section 3.2 using the ASR system in forced alignment mode enabling a narrow beam only. As the word level ASR can produce recognition errors, the length of the recognized phone sequence is likely to be longer or shorter than the correct transcription. If at the beginning of an audio segment the word is misrecognized with more/less phones compared to the correct word then the forced alignment procedure probably gives bad results for the whole audio segment. If this happens at the end of an audio segment, it is not so severe because it will produce only some phone mismatches. To avoid using adaptation data with critical phone error accumulation, the following drop criterion was introduced: 1 e accumulation = 1 imax i=1 (i i 100 <= ɛ (1) (100 pci) max i + 1) imax where i is the position of the phone, i max is the length of the phone sequence, p ci is the confidence, that the i-th phone is correctly recognized in the [0..100] interval (which is computed by the ASR) and ɛ is the limit of the drop criteria in the [0..1] interval (0 means there were no errors, 1 is the theoretical worst case). So mistakes at the beginning are more weighted than at the end and error accumulation is avoided.
5 Title Suppressed Due to Excessive Length 5 4 Results To measure the difference between the proposed method and the supervised adaptation technique a listening test was conducted. In the experiment a modified Hungarian version of HTS [13] was used. The average voice was computed from five speakers (1.5-2 hours of phonetically balanced speech corpus from each). The adaptation database contained semi-spontaneous (parliament speeches by politicians), 10 minute long speech from each of four different speakers. For adaptation the Constrained Maximum Likelihood Linear Regression (CMLLR) method was used. For speech recognition a state-of-the-art Hungarian LVCSR system was applied [14]. The triphone based acoustic model was trained with 5 hours of speech from 500 speakers. The training corpus of the morpheme trigram language model contained 1.2 million words in the domain of political news. The average accuracy of the system is 72%, while the average phone accuracy is above 85%. For the TTS adaptation database the accuracy of the recognizer in phonetic level is shown in Table 1. Table 1. Accuracy of the recognizer for the four speakers Speaker Phone accuracy Speaker #1 58% Speaker #2 79% Speaker #3 87% Speaker #4 90% In case of supervised speaker adaptation consensus manual phonetic transcription with punctuation was created, the segmentation and linguistic labels were automatically determined. In case of unsupervised adaptation the phonetic transcription was determined from the recognition results, the segmentation and linguistic labels were determined in the same way, as in case of supervised speaker adaptation. In the test the supervised and unsupervised adaptation from all speakers -altogether eight systems- were involved. 4.1 Experimental Conditions The experiment consisted of three main parts: paired comparison, Mean Opinion Score (MOS) test and naturalness evaluation. In the first section test subjects had to define how similar two synthesized samples are on a five point scale. The text of the utterance in one pair was always the same. Altogether 24 pairs were played: 8 pairs were from the same system; 8 pairs came from the same speaker with different adaptation methods; and 8 pairs were compiled from different speakers. Pair comparison as the first part is beneficial, because test subjects
6 6 Authors Suppressed Due to Excessive Length get used to the synthetic voice and they will give consistent answers for the MOS test of the second part. There the test subjects had to mark the quality of 32 samples, 4 samples from each system. In the last section test subjects had to decide how much the synthesized samples are similar to the natural voice of the original speaker. This was carried out with 40 synthesized samples (5 for each system). The order of the three parts is chosen in this way to minimize the chance that the test subjects memorize the speakers. The samples were selected from a large set in order to get the desired information about the systems and not about the speech samples. In every section the synthesized samples were pseudo-randomly selected from the larger sample database keeping the distribution of samples and eight different systems even. The authors carried out a pre-test with four subjects to verify the effectiveness of the test design. The results of the pre-test were promising, consequently the same design was kept. Altogether 25 test subjects (19 male, 6 female) were involved in the test. The test was internet-based, the average age was 35, and the youngest subject was 21, the oldest 67 years old. 10 test subjects were speech experts. 4.2 Analysis of Results Table 2 shows the results of the experiment. The first three columns (Similarity to synthesized voice) are related to the first section of the test, the fourth column (Similarity to native voice, same speaker) is related to the third section of the test, and the last column (MOS) is related to the second part of the test. The s rows correspond to supervised adaptation, while u rows refer to unsupervised adaptation. In the first and third test sections 1 refers to the lowest, 5 to the highest similarity. In the MOS test 1 is the worst, 5 is the best value. Except column three higher values represent better results for all speakers. Individual analysis of the results The first two columns show that test subjects can tell, if the samples were generated from the same speaker with the same methods (s-s, u-u samples). There is a minor impact of using different adaptation methods: s-u, u-s samples score consequently less than s-s, u-u pairs. The third column shows that in case of these four speakers the subjects could tell, if the synthesized samples are from different speakers. Based on the values of the fourth column, both supervised and unsupervised samples are considered moderately similar to the native speakers, but they are still scored much better, than different speakers. The relative low values can be the result of the adaptation data being semi-spontaneous speech, including sputter, echo, cough and hesitation. This is also the reason for rather low MOS scores, which are shown in the fifth column. The standard deviation- and confidence level intervals (α = 0.05) are also shown in Table 2.
7 Title Suppressed Due to Excessive Length 7 Table 2. Results of the listening test (s: supervised, u: unsupervised) Similarity to Synthetized voice Native Same different voice MOS speaker speaker same s u speaker Speaker #1 s u Speaker #2 s u Speaker #3 s u Speaker #4 s u Standard s deviation u Confidence s (α = 0.05) u Test section Analyzing the trends of the results Each part of the test shows, that the difference between supervised and unsupervised adaptation reduces as the phone accuracy of the ASR system (see Table 1) gets higher. This trend can be seen by examining the following pairs: s-s, u-u samples compared to s-u, u-s samples from the same speaker, the u and s samples similarity of speaker #1,2,3,4 to a different speaker, the u and s samples similarity of speaker #1,2,3,4 to the native voice of the same speaker, the MOS scores of s and u samples. The results show that the proposed unsupervised adaptation method with good phone accuracy produced similar quality to supervised adaptation with semispontaneous adaptation data. Creating new HMM voices can be speeded up by the proposed method. Phone accuracy as low as 58% may still allow with unsupervised adaptation the creation of a comparable voice to the supervised one. 5 CONCLUSIONS In this paper a method for unsupervised adaptation of HMM-based speech synthesis systems was introduced and the quality evaluation of the technique was investigated. As the results are quite promising further studies will be carried out. The parameters of the drop criteria (described in 3.3) will be fine-tuned and other types of drop criteria will be investigated. Unsupervised minimum generation error linear regression (MGELR) and constrained structural maximum a
8 8 Authors Suppressed Due to Excessive Length posteriori linear regression (CSMAPLR) adaptation methods will be evaluated. Listening tests will be carried out using the adaptation data presented in this paper and with studio quality data as well. Acknowledgments. This research was supported by the TELEAUTO (OM /2007) project of the Hungarian National Office for Research and Technology and by the ETOCOM project (TAMOP /1/KMR ) through the Hungarian National Development Agency in the framework of the Social Renewal Operative Programme supported by EU and co-financed by the European Social Fund and by the KMOP /A project through the Hungarian National Development Agency. References 1. Black, A., Zen, H., Tokuda, K.: Statistical parametric speech synthesis. In: ICASSP 2007, pp (2007) 2. Iwahashi, N., Sagisaka, Y.: Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Speech Communications. Vol. 16, no. 2, pp (1995) 3. Tachibana, M., Yamagishi, J., Masuko, T., Kobayashi, T.: Speech synthesis with various emotional expressions and speaking styles by style Interpolation and morphing. IEICE Trans. Inf. Syst. Vol. E88-D, no.11, pp (2005) 4. Tamura, M., Masuko, T., Tokuda, K., Kobayashi, T.: Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR. In: ICASSP 2001, pp (1998) 5. Ogata, K., Tachibana, M., Yamagishi, J., Kobayashi, T.: Acoustic model training based on linear transformation and MAP modification for HSMM-based speech synthesis. In: ICSLP 2006, pp (2006) 6. Kawai, H., Toda, T., Ni, J., Tsuzaki, M., Tokuda, K.: XIMERA: A new TTS from ATR based on corpus-based technologies. In ISCA SSW5 2004, pp (2004) 7. Plumpe, M., Acero, A., Hon, H.-W., Huang, X.-D.: HMM-based smoothing for concatenative speech synthesis. In: ICSLP 1998, pp (1998) 8. Okubo, T., Mochizuki, R., Kobayashi, T.: Hybrid voice conversion of unit selection and generation using prosody dependent HMM. IEICE Trans. Inf. Syst. Vol. E89-D, no. 11, pp (2006) 9. Mihajlik, P., Fegyó, T., Tüske Z., Ircing, P.: A Morpho-graphemic Approach for the Recognition of Spontaneous Speech in Agglutinative Languages like Hungarian. In: Interspeech 2007, pp (2007) 10. King, S., Tokuda, K., Zen, H., Yamagishi, J.: Unsupervised adaptation for HMMbased speech synthesis. In Interspeech 2008, pp (2008) 11. Gibson, M.: Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models. In Interspeech 2009, pp (2009) 12. Yamagishi, J., Ling, Z., King, S.: Robustness of HMM-based speech synthesis. In Interspeech 2008, pp (2008) 13. Tóth, B., Németh, G.: Hidden Markov model based speech synthesis system in Hungarian. Infocommunications Journal Vol. LXIII, no. 2008/7, pp (2008) 14. Mihajlik, P., Tarján, B., Tüske, Z., Fegyó, T.: Investigation of Morph-based Speech Recognition Improvements across Speech Genres In: Interspeech 2009, pp (2009)
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr
More informationAn Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
More informationExperiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Fabio Tesser, Giacomo Sommavilla, Giulio Paci, Piero Cosi Institute of Cognitive Sciences and Technologies, National
More informationHMM-based Speech Synthesis with Various Degrees of Articulation: a Perceptual Study
HMM-based Speech Synthesis with Various Degrees of Articulation: a Perceptual Study Benjamin Picart, Thomas Drugman, Thierry Dutoit TCTS Lab, Faculté Polytechnique (FPMs), University of Mons (UMons), Belgium
More informationRobust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist (lgr@msi.vxu.se) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
More informationThirukkural - A Text-to-Speech Synthesis System
Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,
More informationGenerating Training Data for Medical Dictations
Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN pakhomov.sergey@mayo.edu Michael Schonwetter Linguistech Consortium, NJ MSchonwetter@qwest.net Joan Bachenko
More informationAUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
More informationSpeech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
More informationBLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be
More informationMembering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
More informationIMPROVING TTS BY HIGHER AGREEMENT BETWEEN PREDICTED VERSUS OBSERVED PRONUNCIATIONS
IMPROVING TTS BY HIGHER AGREEMENT BETWEEN PREDICTED VERSUS OBSERVED PRONUNCIATIONS Yeon-Jun Kim, Ann Syrdal AT&T Labs-Research, 180 Park Ave. Florham Park, NJ 07932 Matthias Jilka Institut für Linguistik,
More informationSTATISTICAL parametric speech synthesis based on
1208 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 6, AUGUST 2009 Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis Junichi Yamagishi, Member, IEEE, Takashi Nose, Heiga
More informationMISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Ulpu Remes, Kalle J. Palomäki, and Mikko Kurimo Adaptive Informatics Research Centre,
More information7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation
More informationEVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN
EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN J. Lööf (1), D. Falavigna (2),R.Schlüter (1), D. Giuliani (2), R. Gretter (2),H.Ney (1) (1) Computer Science Department, RWTH Aachen
More informationRegionalized Text-to-Speech Systems: Persona Design and Application Scenarios
Regionalized Text-to-Speech Systems: Persona Design and Application Scenarios Michael Pucher, Gudrun Schuchmann, and Peter Fröhlich ftw., Telecommunications Research Center, Donau-City-Strasse 1, 1220
More informationStatistical text-to-speech synthesis of Spanish subtitles
Statistical text-to-speech synthesis of Spanish subtitles S. Piqueras, M. A. del-agua, A. Giménez, J. Civera, and A. Juan MLLP, DSIC, Universitat Politècnica de València, Camí de Vera s/n, 46022, València,
More informationUsing Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments
Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments Y. Mamiya 1, A. Stan 2, J. Yamagishi 1,3, P. Bell 1, O. Watts 1, R.A.J. Clark 1, S. King 1 1 Centre for
More informationIEEE Proof. Web Version. PROGRESSIVE speaker adaptation has been considered
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification Shou-Chun Yin, Richard Rose, Senior
More informationTED-LIUM: an Automatic Speech Recognition dedicated corpus
TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France firstname.lastname@lium.univ-lemans.fr
More informationAnalysis and Synthesis of Hypo and Hyperarticulated Speech
Analysis and Synthesis of and articulated Speech Benjamin Picart, Thomas Drugman, Thierry Dutoit TCTS Lab, Faculté Polytechnique (FPMs), University of Mons (UMons), Belgium {benjamin.picart,thomas.drugman,thierry.dutoit}@umons.ac.be
More informationEricsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
More informationDevelop Software that Speaks and Listens
Develop Software that Speaks and Listens Copyright 2011 Chant Inc. All rights reserved. Chant, SpeechKit, Getting the World Talking with Technology, talking man, and headset are trademarks or registered
More informationAutomatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
More informationObjective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
More informationDetecting and Correcting Transcription Discrepancies between Thai Parliament Meeting Speech Utterances and their Official Meeting Reports
Detecting and Correcting Transcription Discrepancies between Thai Parliament Meeting Speech Utterances and their Official Meeting Reports Natnarong. Puangsri, Atiwong. Suchato, Proadpran. Punyabukkana,
More informationText-To-Speech Technologies for Mobile Telephony Services
Text-To-Speech Technologies for Mobile Telephony Services Paulseph-John Farrugia Department of Computer Science and AI, University of Malta Abstract. Text-To-Speech (TTS) systems aim to transform arbitrary
More informationEfficient diphone database creation for MBROLA, a multilingual speech synthesiser
Efficient diphone database creation for, a multilingual speech synthesiser Institute of Linguistics Adam Mickiewicz University Poznań OWD 2010 Wisła-Kopydło, Poland Why? useful for testing speech models
More informationImproving Automatic Forced Alignment for Dysarthric Speech Transcription
Improving Automatic Forced Alignment for Dysarthric Speech Transcription Yu Ting Yeung 2, Ka Ho Wong 1, Helen Meng 1,2 1 Human-Computer Communications Laboratory, Department of Systems Engineering and
More informationEstonian Large Vocabulary Speech Recognition System for Radiology
Estonian Large Vocabulary Speech Recognition System for Radiology Tanel Alumäe, Einar Meister Institute of Cybernetics Tallinn University of Technology, Estonia October 8, 2010 Alumäe, Meister (TUT, Estonia)
More informationSpot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and
More informationAutomatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
More informationThe effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
More informationSpeech Recognition on Cell Broadband Engine UCRL-PRES-223890
Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Yang Liu, Holger Jones, John Johnson, Sheila Vaidya (Lawrence Livermore National Laboratory) Michael Perrone, Borivoj Tydlitat, Ashwini Nanda
More informationADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney
ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department,
More informationThe ROI. of Speech Tuning
The ROI of Speech Tuning Executive Summary: Speech tuning is a process of improving speech applications after they have been deployed by reviewing how users interact with the system and testing changes.
More informationEstablishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
More informationInformation Leakage in Encrypted Network Traffic
Information Leakage in Encrypted Network Traffic Attacks and Countermeasures Scott Coull RedJack Joint work with: Charles Wright (MIT LL) Lucas Ballard (Google) Fabian Monrose (UNC) Gerald Masson (JHU)
More informationMusic Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
More informationComparative Error Analysis of Dialog State Tracking
Comparative Error Analysis of Dialog State Tracking Ronnie W. Smith Department of Computer Science East Carolina University Greenville, North Carolina, 27834 rws@cs.ecu.edu Abstract A primary motivation
More informationSASSC: A Standard Arabic Single Speaker Corpus
SASSC: A Standard Arabic Single Speaker Corpus Ibrahim Almosallam, Atheer AlKhalifa, Mansour Alghamdi, Mohamed Alkanhal, Ashraf Alkhairy The Computer Research Institute King Abdulaziz City for Science
More informationInvestigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
More informationAUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
More informationAutomatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion
Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion Prasanta Kumar Ghosh a) and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
More informationSlovak Automatic Transcription and Dictation System for the Judicial Domain
Slovak Automatic Transcription and Dictation System for the Judicial Domain Milan Rusko 1, Jozef Juhár 2, Marian Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Miloš Cerňak 1, Marek Papco 2, Róbert
More informationADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION Horacio Franco, Luciana Ferrer, and Harry Bratt Speech Technology and Research Laboratory, SRI International, Menlo Park, CA
More informationCarla Simões, t-carlas@microsoft.com. Speech Analysis and Transcription Software
Carla Simões, t-carlas@microsoft.com Speech Analysis and Transcription Software 1 Overview Methods for Speech Acoustic Analysis Why Speech Acoustic Analysis? Annotation Segmentation Alignment Speech Analysis
More informationOpen-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania oananicolae1981@yahoo.com
More informationAPPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA
APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer
More informationLecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
More informationA Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman
A Comparison of Speech Coding Algorithms ADPCM vs CELP Shannon Wichman Department of Electrical Engineering The University of Texas at Dallas Fall 1999 December 8, 1999 1 Abstract Factors serving as constraints
More informationFunctional Auditory Performance Indicators (FAPI)
Functional Performance Indicators (FAPI) An Integrated Approach to Skill FAPI Overview The Functional (FAPI) assesses the functional auditory skills of children with hearing loss. It can be used by parents,
More informationTranscription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)
Proceedings of the Twenty-Fourth Innovative Appications of Artificial Intelligence Conference Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Tatsuya Kawahara
More informationTranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification
TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification Mahesh Viswanathan, Homayoon S.M. Beigi, Alain Tritschler IBM Thomas J. Watson Research Labs Research
More informationTwo Related Samples t Test
Two Related Samples t Test In this example 1 students saw five pictures of attractive people and five pictures of unattractive people. For each picture, the students rated the friendliness of the person
More informationTagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
More informationSecure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems
More informationINTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)
INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of
More informationTechnologies for Voice Portal Platform
Technologies for Voice Portal Platform V Yasushi Yamazaki V Hitoshi Iwamida V Kazuhiro Watanabe (Manuscript received November 28, 2003) The voice user interface is an important tool for realizing natural,
More informationSubjective SNR measure for quality assessment of. speech coders \A cross language study
Subjective SNR measure for quality assessment of speech coders \A cross language study Mamoru Nakatsui and Hideki Noda Communications Research Laboratory, Ministry of Posts and Telecommunications, 4-2-1,
More informationReading Competencies
Reading Competencies The Third Grade Reading Guarantee legislation within Senate Bill 21 requires reading competencies to be adopted by the State Board no later than January 31, 2014. Reading competencies
More informationTEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE
TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE Sangam P. Borkar M.E. (Electronics)Dissertation Guided by Prof. S. P. Patil Head of Electronics Department Rajarambapu Institute of Technology Sakharale,
More informationTRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY
4 4th International Workshop on Acoustic Signal Enhancement (IWAENC) TRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY Takuya Toyoda, Nobutaka Ono,3, Shigeki Miyabe, Takeshi Yamada, Shoji Makino University
More informationEmotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
More informationUsing the Amazon Mechanical Turk for Transcription of Spoken Language
Research Showcase @ CMU Computer Science Department School of Computer Science 2010 Using the Amazon Mechanical Turk for Transcription of Spoken Language Matthew R. Marge Satanjeev Banerjee Alexander I.
More informationSWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne
SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne Published in: Proceedings of Fonetik 2008 Published: 2008-01-01
More informationII. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
More informationHardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
More informationAutomated Transcription of Conversational Call Center Speech with Respect to Non-verbal Acoustic Events
Automated Transcription of Conversational Call Center Speech with Respect to Non-verbal Acoustic Events Gellért Sárosi 1, Balázs Tarján 1, Tibor Fegyó 1,2, and Péter Mihajlik 1,3 1 Department of Telecommunication
More informationDescriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
More informationTalking machines?! Present and future of speech technology in Hungary
INVITED PAPER Talking machines?! Present and future of speech technology in Hungary GÉZA NÉMETH, GÁBOR OLASZY, KLÁRA VICSI, TIBOR FEGYÓ Budapest University of Technology and Economics, Department of Telecommunications
More informationDIXI A Generic Text-to-Speech System for European Portuguese
DIXI A Generic Text-to-Speech System for European Portuguese Sérgio Paulo, Luís C. Oliveira, Carlos Mendes, Luís Figueira, Renato Cassaca, Céu Viana 1 and Helena Moniz 1,2 L 2 F INESC-ID/IST, 1 CLUL/FLUL,
More informationInput Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device
International Journal of Signal Processing Systems Vol. 3, No. 2, December 2015 Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device K. Kurumizawa and H. Nishizaki
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationSlovak Automatic Dictation System for Judicial Domain
Slovak Automatic Dictation System for Judicial Domain Milan Rusko 1(&), Jozef Juhár 2, Marián Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Róbert Sabo 1, Matúš Pleva 2, Marián Ritomský 1, and
More informationPresentation Video Retrieval using Automatically Recovered Slide and Spoken Text
Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text Matthew Cooper FX Palo Alto Laboratory Palo Alto, CA 94034 USA cooper@fxpal.com ABSTRACT Video is becoming a prevalent medium
More information31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
More informationSpeech Transcription
TC-STAR Final Review Meeting Luxembourg, 29 May 2007 Speech Transcription Jean-Luc Gauvain LIMSI TC-STAR Final Review Luxembourg, 29-31 May 2007 1 What Is Speech Recognition? Def: Automatic conversion
More informationUsing Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents
Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library
More informationGrant: LIFE08 NAT/GR/000539 Total Budget: 1,664,282.00 Life+ Contribution: 830,641.00 Year of Finance: 2008 Duration: 01 FEB 2010 to 30 JUN 2013
Coordinating Beneficiary: UOP Associated Beneficiaries: TEIC Project Coordinator: Nikos Fakotakis, Professor Wire Communications Laboratory University of Patras, Rion-Patras 26500, Greece Email: fakotaki@upatras.gr
More informationDesign and Data Collection for Spoken Polish Dialogs Database
Design and Data Collection for Spoken Polish Dialogs Database Krzysztof Marasek, Ryszard Gubrynowicz Department of Multimedia Polish-Japanese Institute of Information Technology Koszykowa st., 86, 02-008
More informationTHE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
More informationAutomatic Evaluation Software for Contact Centre Agents voice Handling Performance
International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,
More informationConvention Paper Presented at the 118th Convention 2005 May 28 31 Barcelona, Spain
Audio Engineering Society Convention Paper Presented at the 118th Convention 25 May 28 31 Barcelona, Spain 6431 This convention paper has been reproduced from the author s advance manuscript, without editing,
More informationTraining Ircam s Score Follower
Training Ircam s Follower Arshia Cont, Diemo Schwarz, Norbert Schnell To cite this version: Arshia Cont, Diemo Schwarz, Norbert Schnell. Training Ircam s Follower. IEEE International Conference on Acoustics,
More informationTesting Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationProgram curriculum for graduate studies in Speech and Music Communication
Program curriculum for graduate studies in Speech and Music Communication School of Computer Science and Communication, KTH (Translated version, November 2009) Common guidelines for graduate-level studies
More informationA General Evaluation Framework to Assess Spoken Language Dialogue Systems: Experience with Call Center Agent Systems
Conférence TALN 2000, Lausanne, 16-18 octobre 2000 A General Evaluation Framework to Assess Spoken Language Dialogue Systems: Experience with Call Center Agent Systems Marcela Charfuelán, Cristina Esteban
More informationKNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 22/2013, ISSN 1642-6037 medical diagnosis, ontology, subjective intelligence, reasoning, fuzzy rules Hamido FUJITA 1 KNOWLEDGE-BASED IN MEDICAL DECISION
More informationEvaluation of speech technologies
CLARA Training course on evaluation of Human Language Technologies Evaluations and Language resources Distribution Agency November 27, 2012 Evaluation of speaker identification Speech technologies Outline
More informationTranscription System for Semi-Spontaneous Estonian Speech
10 Human Language Technologies The Baltic Perspective A. Tavast et al. (Eds.) 2012 The Authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms
More informationBuilding A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
More informationABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition
The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA
More informationGender Identification using MFCC for Telephone Applications A Comparative Study
Gender Identification using MFCC for Telephone Applications A Comparative Study Jamil Ahmad, Mustansar Fiaz, Soon-il Kwon, Maleerat Sodanil, Bay Vo, and * Sung Wook Baik Abstract Gender recognition is
More informationTracking Moving Objects In Video Sequences Yiwei Wang, Robert E. Van Dyck, and John F. Doherty Department of Electrical Engineering The Pennsylvania State University University Park, PA16802 Abstract{Object
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationChapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
More informationSPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS
SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS Mbarek Charhad, Daniel Moraru, Stéphane Ayache and Georges Quénot CLIPS-IMAG BP 53, 38041 Grenoble cedex 9, France Georges.Quenot@imag.fr ABSTRACT The
More information