PhD THESIS - ABSTRACT - ROMANIAN HMM-BASED TEXT-TO-SPEECH SYNTHESIS WITH INTERACTIVE INTONATION OPTIMISATION
|
|
|
- Clara Parks
- 9 years ago
- Views:
Transcription
1 Investeşte în oameni! FONDUL SOCIAL EUROPEAN Programul Operaţional Sectorial Dezvoltarea Resurselor Umane eng. Adriana Cornelia STAN PhD THESIS - ABSTRACT - ROMANIAN HMM-BASED TEXT-TO-SPEECH SYNTHESIS WITH INTERACTIVE INTONATION OPTIMISATION Scientific advisor, Prof.dr.eng. Mircea GIURGIU PhD Thesis Evaluation Committee: PRESIDENT: MEMBERS: - Prof.dr.eng. Dorin PETREUŞ - prodean, Faculty of Electronics, Telecommunications and Information Technology, Technical University of Cluj-Napoca; - Prof.dr.eng. Mircea GIURGIU- scientific advisor, Technical University of Cluj-Napoca; - Prof.dr.eng. Corneliu BURILEANU - reviewer, Politehnica University of Bucharest; - Prof.dr.eng. Horia-Nicolai TEODORESCU, m.c.a.r. - reviewer, Gh.Asachi Technical University of Iaşi; - Prof.dr.eng. Aurel VLAICU - reviewer, Technical University of Cluj-Napoca 2011
2 Romanian HMM-based Text-to-Speech Synthesis with Interactive Intonation Optimisation 1 Introduction 1.1 Motivation Speech synthesis has emerged as an important technology in the context of human-computer interaction. Although an intensive studied domain, its language dependency makes it less accessible for most of the languages. If for English, French, Spanish or German for example, the variety of choices starts from open-source user-configurable systems to high-quality proprietary commercial systems, this is not the case for Romanian. The lack of extended freely available resources makes it hard for the researchers to develop complete text-to-speech synthesis systems and design new language-dependent enhancements. The available Romanian synthesis systems are mainly commercial or based on outdated technologies such as formant synthesis or diphone concatenation. There is also one other problem that is the main focus for the international research community, and that is the prosodic enhancement of the synthetic speech. Results of most of the speech synthesisers still have a monotone, unattractive flat intonation contour. This problem is usually solved by the use of fundamental frequency (F0) contour modelling and control of the parameters in a deterministic or statistical manner. Most of the F0 modelling or parametrisation techniques are based on extended speech corpora and manual annotation of the intonation. Some other solutions are language dependent methods, involving accent patterns or phrasing. Adaptation of these solutions to under-resourced languages is unfortunately unpractical and hard to achieve. 1.2 Objectives Given the context presented before, the main objective of this thesis is the development of a new Romanian speech synthesiser, using the latest technology available. The system should also be able to allow for intonation adaptation. This challenge requires to address four specific objectives, as described below: Objective 1: Objective 2: To develop a large high-quality speech corpus in Romanian and an associated word lexicon, which can support statistical training of the HMM models, but which can also be used for other speech-based applications. Motivation: There are no Romanian speech corpora which can be used for statistical HMM training. To create a Romanian text-to-speech system using state-of-the-art technologies, in the form of HMM-based parametric synthesis. Motivation: The available Romanian TTS systems use either formant or concatenative synthesis. These types of synthesis methods have difficulties when trying to improve the naturalness or expressivity of the synthetic speech. Objective 3: Objective 4: To design a new pitch modelling technique, which can be easily applied for intonation control. Motivation: The existing pitch modelling techniques require extensive linguistic studies, and cannot provide a language-independent application. To devise a method for interactive intonation optimisation of the synthetic speech. Motivation: Even in state-of-the-art TTS systems, the expressivity of speech cannot be tuned by non-expert users. 1
3 Adriana Cornelia STAN 2 Thesis Outline The thesis is organised as follows: Chapter 1 defines the motivation and the objectives of the thesis. structure on a chapter by chapter basis. It also outlines the thesis Chapter 2 gives an overall view of speech synthesis methods with their respective advantages and disadvantages. A list of the available Romanian speech synthesiser is also presented. Chapter specific theoretical issues are presented on a chapter by chapter basis. Chapter 3 introduces the preparation of the resources needed for a Romanian parametric speech synthesiser. After a brief introduction of the Romanian language characteristics, the chapter describes the tools and design procedures of both text and speech data. For text, the following issues are addressed: text corpus selection and preprocessing, phonetic transcription, accent positioning, syllabification and part-of-speech tagging. Speech resources include the recording of a high-quality speech corpus (about 4 hours) with the respective recording text selection and speech data segmentation. Two key features of the speech resources represent a list of semantically unpredictable sentences used for speech synthesis evaluation, and the preparation of a freely available online speech corpus (Romanian Speech Synthesis (RSS) corpus) which includes the recordings and several other information, such as HTS labels, accent positioning for the recorded text, or synthesised audio samples using RSS. Chapter 4 presents the development of a Romanian HMM-based (Hidden Markov Model) speech synthesiser starting from the resources presented in chapter 3. A short theoretical overview of the HMM models and the HTS (HMM-based Speech Synthesis System) is presented. The preparation of HTS-compliant data is then described in terms of text annotation, decision tree clustering questions and segmentation and annotation of the training corpus. Apart from the novelty of a Romanian HTS system, the chapter introduces an evaluation of some language-independent configuration parameters. The results obtained are evaluated in a 3 section listening test: naturalness, speaker similarity and intelligibility. Chapter 5 describes a novel approach to F0 parametrisation using the discrete cosine transform (DCT). The chapter starts by first analysing some of the most common F0 modelling techniques and their potential application in a system that uses no additional information, except from the text input and no complex phonological information. The DCT was chosen for its simplicity, language independency, high modelling capabilities even with a reduced number of features and the direct inverse transform useful in chapter 6. A superpositional model using the DCT is then proposed and evaluated in the context of both modelling and prediction of the F0 contour. Chapter 6 uses the results of chapter 5 to define an interactive optimisation method using evolution strategies. The method uses the phrase level DCT coefficients of the F0 contour in a interactive CMA-ES (Covariance Matrix Adaptation - Evolution Strategy) algorithm. The basics of evolutionary computation are presented with a focus on evolution strategies and CMA-ES. Evaluation of the applicability scenario is performed. This includes the analysis of the initial standard deviation of the population, number of individuals per generation and dynamic expansion of the F0 contour. Results of a naturalness and expressivity listening test are analysed. Chapter 7 summarises the main conclusions of the thesis and future development possibilities. 2
4 Romanian HMM-based Text-to-Speech Synthesis with Interactive Intonation Optimisation 3 Discussion on Thesis Results and Future Work 3.1 Resource Development for a Romanian Parametric Speech Synthesiser Resource development in a new language is an important step in creating any new speech processing system. The analysis of extended corpora can provide more accurate results. Text resources, as well as speech resources were gradually introduced within chapter 3. Although the resources cover a wide variety of aspects, they can only be viewed as a starting point for a more complex and elaborate source of information. The text resources include a newspaper text corpus, simple letter-to-sound rules, accent positioning, syllabification and part-of-speech tagging. Text resources were not the main focus of the research and therefore each have identified problems. The text corpus contains around 4000 newspaper articles. Although the mass-media language is considered to be the reference for most of the speakers, it is not necessarily an optimal source for language studies. Literary works should also be included in such a resource. The phonetic transcriber written in Festival only included a minimal set of rules, which do not cover all of the rules described by phoneticians. Although, this can be argued against with the results of the intelligibility listening test. A good resource for accent positioning is the DEX, but it is not practical to use an entire word database in a text processor. Even if Romanian does not have deterministic accent positioning rules, the accent could be derived using machine learning algorithms. The preliminary evaluation of the syllabification using MOP principle is only preliminary and its results cannot be taken for granted. A more extensive analysis in conjunction with the standard syllabification rules for Romanian should be performed. Part-of-speech tagging was also determined from an external source and cannot be fully controlled so far. The developed lexicon includes accent positioning and phonetic transcription. Although, an extended and important resource, it should be modified so as to include more information, such as syllabification for example. The speech resources developed in the context of this thesis are potentially one of the greatest contributions, considering the lack of such resources for Romanian. The design of the speech corpus makes it easy to use in many types of speech processing applications, such as automatic speech recognition, speech coding and of course speech synthesis. Its high-quality and sampling frequency are also an important feature. The inclusion of both newspaper and fairytale text in the recorded speech makes it more comprehensive. The entire speech corpus is freely available under the name of Romanian Speech Synthesis (RSS) database. A possible extension of this resource is naturally the recording of more speech data. 3.2 A High Sampling Frequency Romanian Parametric Text-to-Speech Synthesiser based on Markov Models As it has been shown in chapter 4, Romanian language lacks proper open source synthesis systems for research. The HMM-based synthesis system along with the developed resources is an important addition to the research domain. Chapter 4 included the data preparation steps from text processing to speech segmentation and annotation. A first problem of the TTS system is the lack of an optimal text processor, with full text normalisation and POS tagging. Although it has not been proven, correct POS tagging could influence the result of the synthesiser. The results of the listening test showed that the speech resources, configuration parameters and sampling frequency were appropriately selected. The evaluation of the system also included the evaluation of the amount of training data. It is commonly known that for speech synthesis, the larger the speech corpus, the better the results. However, an interesting development would be the selection of a minimum duration speech corpus with results comparable to the ones achieved herein. The listening test also showed that for Romanian, general-purpose designed semantically unpredictable sentences cannot determine significant differences between systems. Thus, a more complex method of evaluating 3
5 Adriana Cornelia STAN Romanian intelligibility should be designed. 3.3 A Language-Independent Intonation Modelling Technique Chapter 5 was an evaluation of the parametrisation and prediction capabilities of the DCT transform for F0 contours. The proposed method makes a clear separation between the syllable and phrase levels of the fundamental frequency for F0 parametrisation. Each layer is individually modelled using a limited number of DCT coefficients. The statistical analysis of the DCT coefficients showed that as the order of the coefficient increases, the relative standard deviation decreases, which means less variability. It can therefore be concluded that by extending the number of DCT coefficients, no further major improvement can be achieved. Each of the DCT coefficients is predicted separately using CART algorithms. The features used for the training vector are the ones available in the HTS label format, and thus no additional processing is required. CART algorithms are fast and efficient methods of estimation for low-complexity problems. As the results showed, their performance for high order coefficients is drastically reduced. This means that the analysis of some more advanced machine learning methods, such as neural networks or Markov models is needed. Also because of the separate estimation of the coefficients, some joint features can be overlooked. A joint estimation mechanism would probably enhance the prediction results. The attribute selection provided the means for a complexity reduction of the problem, but it did not provide accurate correspondence between the DCT coefficients and the phonological features used in the feature vector. A more elaborate analysis of this correspondence should also be performed. 3.4 Optimising the F0 Contour with Interactive Non-Expert Feedback Language-independent F0 contour optimisation is a very important aspect of the speech synthesis domain. The method and prototype system presented in chapter 6 can be easily adapted to any HMM-based synthesiser with minimum adjustment. Preliminary evaluations carried out proposed the setup parameters of such a system and have shown that the dynamic pitch expansion can be achieved even with a small number of individuals and generations. As the results obtained in this preliminary research have achieved a high-level of intonational variation and user satisfaction, a web-based application of the interactive optimisation is under-way. The application would allow the user to select the entire utterance or just parts of it i.e., phrases, words or even syllables for the optimisation process to enhance. For a full prosodic optimisation, the duration of the utterance should be included in the interactive application as well. One drawback to the solution is the lack of individual manipulation of each of the 7 DCT coefficients in the genome, unattainable in the context of the evolutionary algorithm chosen. However the coefficients statistics showed that the average standard deviation is similar and thus the choice for the initial standard deviation does not alter the higher order coefficients. An interesting development would be a user-adaptive speech synthesiser. Based on previous optimisation choices, the system could adapt in time to a certain prosodic realisation. Having set up the entire workflow, testing different types of fitness functions is also of great interest. 4
6 Romanian HMM-based Text-to-Speech Synthesis with Interactive Intonation Optimisation 4 Thesis contributions The main contributions of the thesis are organised in chapters 3,4,5 and 6 and can be summarised as follows, along with their chapter correspondence and published papers: 1. A 65,000 Romanian word lexicon with phonetic transcription and accent positioning Phonetic transcription and accent positioning represent two key aspects of a text processing module for text-to-speech synthesis. The 65,000 word lexicon represents 4.7% of the total entries of the DEX online database. The phonetic transcription was performed using the standard phoneme set for Romanian, excluding allophones and rare case exception pronunciations. Simple initial letter-tosound rules were written in Festival, and some other rules were added manually in the lexicon. Accent positioning was directly extracted from the DEX online database. The lexicon is an important linguistic resource mainly because of its dimension and contents. To the best of the author s knowledge there are no available resources of this type. The correctness of the information within was tested through the use of the lexicon in the front-end training of the Romanian speech synthesiser. This contribution is supported by the development of the following additional resources: A text corpus of 4506 short newspaper articles trawled between August 2009 and September 2009 from the online newspaper Adevărul. It contains over 1,700,000 words, and the top 65,000 most frequent were used in the lexicon; A reduced set of Romanian letter-to-sound rules written in Festival format for the initial phonetic transcription of the lexicon; 2. The Romanian Speech Synthesis (RSS) corpus: A high-quality broad application Romanian speech resource Starting from the requirements of a parametric HMM-based speech synthesiser, the development of an extended speech corpus was identified. The Romanian Speech Synthesis corpus has a duration of 4 hours and comprises the following data: Training set utterances - approx. 3.5 hours 1493 random newspaper utterances 983 diphone coverage utterances 704 fairytale utterances - the short stories Povestea lui Stan Păţitul and Ivan Turbincă by Ion Creangă Testing set utterances - approx. 0.5 hours 210 random newspaper utterances 110 random fairytale utterances 216 semantically unpredictable sentences The recordings were performed at 96kHz, 24 bits per sample and downsampled at 48kHz using professional recording equipment. The entire corpus, along with ortographic and phonetic transcription, time-aligned HTS labels, and accent positioning are freely available at and represent the most extended Romanian speech corpus. The corpus was tested through its use in the model training part of the Romanian HMM-based speech synthesiser and also in a simple unit selection concatenative system. The semantically unpredictable sentences were evaluated as part of the intelligibility section of the listening test. The fairytale utterances have been used for the adaptation of the baseline trained models, in order to achieve a more 5
7 Adriana Cornelia STAN dynamic intonation of the output speech. Statistic analysis of the recorded text within the speech corpus show similarities to the statistical distributions of the Romanian language. This contribution is supported by the development of the following additional resources: The development of a list of 216 Romanian semantically unpredictable sentences used in speech synthesis evaluation. To the best of the author s knowledge, this is the first resource of this sort; A basic Romanian text processor for the HTS format labeling of the speech corpus. 3. An evaluation of the configuration parameters for the HTS system HMM-based statistical parametric speech synthesis has become one of the mainstream methods for speech synthesis. The HTS framework offers a large number of possibilities for the parameter tunning of the generic system. The evaluation of the configuration parameters included the frequency warping scale, spectral analysis method, cepstral order, sampling frequency and amount of training data. The first three were heuristically determined based on analysis-by-synthesis methods, while the last two are evaluated within the listening test for the Romanian HTS synthesiser. The results showed that: there are no significant perceptual differences between the Bark and ERB frequency scales when using the vocoder for 48kHz input data; the data driven generalised logf0 was validated; the MGC performed better than the mel-cepstrum analysis method; the cepstral analysis order is dependent on the sampling frequency; the use of high-sampling frequency increases the quality of the output speech, but the differences between 32kHz and 48kHz are not significant; an increased dimension of the training speech corpus enhances the quality of the synthetic speech. 4. A Romanian HMM-based speech synthesiser The developed TTS system uses HMM-based statistical parametric speech synthesis, which is the latest technology available for speech synthesis. Employing the text and speech resources developed priorly, and the established configuration parameters, a number of 5 distinct systems were trained. They differ by the amount and sampling frequency of the training data. The systems have been evaluated by 54 listeners, in a Blizzard-style listening test comprising 3 sections: naturalness, speaker similarity and intelligibility and along with a minimal unit selection concatenative system and the original recordings. The results of the listening test showed an average 3.0 MOS score for all of the HTS systems built, and an average of 3.3 MOS score for the best evaluated one. Sampling frequency has influenced the speaker similarity, but not the naturalness, while the amount of the training data had an effect on both sections. The WER in the intelligibility section, for all the systems was below 10%. All of the HTS systems outperformed the unit selection system. Additionally, they have the capability to adapt to a more dynamic intonation speech corpus, as proved by the adaptation to the fairytale speech subset. An interactive demonstration of the Romanian HTS synthesiser is available at com. This contribution is also supported by the following additional elements: A set of 179 Romanian phonetic decision tree questions for context clustering in the HTS system; A basic text processing tool using the Cereproc Development Framework with minimal text normalisation and which outputs HTS format labels; 6
8 Romanian HMM-based Text-to-Speech Synthesis with Interactive Intonation Optimisation 5. A language-independent F0 modelling technique based on the discrete cosine transform This contribution solves the F0 modelling as a part of the language-independence issue for textto-speech systems. The method adheres to the superpositional principle of pitch by modelling the syllable and phrase level contours, and uses a discrete cosine transform parametrisation. Only the textual features available in the HTS labels, without any additional linguistic information, and the DCT coefficients of the F0 contour are used for pitch modelling and prediction. F0 prediction was performed using independently trained classification and regression trees, for each of the DCT coefficients. The results revealed an average error of 15Hz per utterance, which is similar to other modelling techniques. Also, the listening test showed that the users did not consider the differences between the HTS generated F0 contour, and the DCT predicted one as perceivable. This contribution is supported by the following additional analysis: Statistic evaluation of the of the DCT coefficients within the rnd1 subset of the RSS database; Evaluation of the DCT coefficient prediction results using 3 CART algorithms: M5 rules, Linear Regression and Additive Regression; Objective and subjective evaluation of the F0 contour estimation from the tree-based prediction of the DCT coefficients. 6. A method for the application of interactive CMA-ES in intonation optimisation for speech synthesis The interactive intonation optimisation method solves a complex problem related to the expressivity enhancement of the synthesised speech, according to a non-expert listener s subjectivity. The originality of the method consists in using no prosodic annotations of the text, no deterministic rules and no predefined speaking styles. CMA-ES is applied in an interactive manner to the DCT coefficients of the phrase level F0 contour generated by the Romanian HTS system. The main parameters of the interactive CMA-ES are evaluated and include: initial standard deviation of the population used to control the naturalness of the speech output, by limiting the domain of the F0 values; population size used to minimise user fatigue while maintaining a sufficient number of different speech samples the user can opt for; dynamic expansion of pitch over a number of generations to determine the evolution of the pitch contour according to the user s choices. These parameters are also evaluated in the interactive intonation optimisation prototype system. To the best of the author s knowledge, this is also the first application of an interactive CMA-ES algorithm. 7. A prototype interactive intonation optimisation system using CMA-ES and DCT parametrisation The proposed interactive intonation optimisation method has been implemented in a prototype system. The system is language-independent and uses the developed Romanian HTS system and the interactive CMA-ES parameters determined before. Given the output of the baseline speech synthesiser, the user can opt for further enhancements of the intonation for the synthesised speech. Four new different speech samples derived from the original F0 contour are presented to the listener in a tournament like comparison method. Starting from the overall winner of one generation, the next 4 individuals are generated. 7
9 Adriana Cornelia STAN The results of the prototype system have been evaluated in a listening test comprising naturalness and expressivity sections. The individuals naturalness was evaluated with an average MOS score of 3.1, and all of the newly generated speech samples were considered to be more expressive than the original one. Thus proving that the prototype system is able to maintain a natural output speech, while enhancing its expressivity. The contributions can be included in the general processing scheme of an HMM-based speech synthesis system according to Fig. 1. TEXT INPUT Text processing 1 65,000 word lexicon HTS parameter configuration 3 Parameter estimation and generation HMM models 2 RSS corpus Speech synthesis TEXT TO SPEECH SYSTEM 4 Romanian HTS system SPEECH OUTPUT Synthetic speech enhancements Interactive intonation optimisation ENHANCED SPEECH OUTPUT Figure 1: The application of the thesis contributions within the general processing scheme of an HMM-based speech synthesis system (marked with numbers from 1 to 7). 8
10 Romanian HMM-based Text-to-Speech Synthesis with Interactive Intonation Optimisation 5 List of publications Journals 1. Adriana STAN, Junichi YAMAGISHI, Simon KING, Matthew AYLETT, The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate, Speech Communication, vol 53, pg , Conference Proceedings 1. Adriana STAN, Florin-Claudiu POP, Marcel CREMENE, Mircea GIURGIU, Denis PALLEZ, Interactive Intonation Optimisation Using CMA-ES and DCT Parametrisation of the F0 Contour for Speech Synthesis, In Proceedings of the 5 th Workshop on Nature Inspired Cooperative Strategies for Optimisation, in series Studies in Computational Intelligence, vol. 387, Springer, Adriana STAN, Mircea GIURGIU, A Superpositional Model Applied to F0 Parametrisation using DCT for Text-to-Speech Synthesis, In Proceedings of the 6 th Conference on Speech Technology and Human-Computer Dialogue, doi: /SPED , Braşov, România, May 2011, 3. Adriana STAN, Mircea GIURGIU, Romanian language statistics and resources for text-tospeech systems, In Proceedings of the 9 th Edition of the International Symposium on Electronics and Telecommunications, pg , Timişoara, România, November Adriana STAN, Linear Interpolation of Spectrotemporal Excitation Pattern Representations for Automatic Speech Recognition in the Presence of Noise, In Proceedings of the 5 th Conference on Speech Technology and Human-Computer Dialogue, pg , Constanţa, România, June Scientific reports 1. Adriana STAN, Raport de cercetare ştiinţifică 1: Elaborarea şi dezvoltarea unui sistem de sinteză text-vorbire în limba românăbazat pe modele Markov, independent de elementele de prozodie aferente textului, May, Adriana STAN, Raport de cercetare ştiinţifică 2: Elaborarea şi dezvoltarea unor metode deterministe de analiză şi control a prozodiei în limba română, January, Adriana STAN, Raport de cercetare ştiinţifică 3: Elaborarea şi dezvoltarea unor metode probabilistice de analiză şi control a prozodiei în limba română, April, 2011 Acknowledgment First of all I would like to thank my PhD advisor, prof.dr.eng. Mircea Giurgiu, for his continuous support and guidance. Thanks to the Centre for Speech Technology Research of the University of Edinburgh, for a very welcoming and fruitful research visit, especially to dr. Simon King, dr. Junichi Yamagishi and Oliver Watts. Also to the fellow research assistants and PhD students with whom I ve shared and discussed some ideas. Also I would like to thank the Cereproc team, Matthew Aylett, Chris Pidcock and Graham Leary for the help with the text processor. Thanks to Florin Pop, Marcel Cremene and Denis Pallez for the insight offered in the evolution strategies and the interactive intonation optimisation. To Cătălin Francu, the author of the online DEX database and to the authors of the Romanian POS Tagger, dr. Doina Tătar, Ovidiu Sabou and Paul V. Borza. *Adriana Stan was funded by the European Social Fund, project POSDRU/6/1.5/S/5. *Parts of this work made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF The ECDF is partially supported by the edikt initiative ( 9
Efficient diphone database creation for MBROLA, a multilingual speech synthesiser
Efficient diphone database creation for, a multilingual speech synthesiser Institute of Linguistics Adam Mickiewicz University Poznań OWD 2010 Wisła-Kopydło, Poland Why? useful for testing speech models
Thirukkural - A Text-to-Speech Synthesis System
Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,
Text-To-Speech Technologies for Mobile Telephony Services
Text-To-Speech Technologies for Mobile Telephony Services Paulseph-John Farrugia Department of Computer Science and AI, University of Malta Abstract. Text-To-Speech (TTS) systems aim to transform arbitrary
An Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania [email protected]
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Fabio Tesser, Giacomo Sommavilla, Giulio Paci, Piero Cosi Institute of Cognitive Sciences and Technologies, National
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
Robust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist ([email protected]) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
Hardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
Creating voices for the Festival speech synthesis system.
M. Hood Supervised by A. Lobb and S. Bangay G01H0708 Creating voices for the Festival speech synthesis system. Abstract This project focuses primarily on the process of creating a voice for a concatenative
SASSC: A Standard Arabic Single Speaker Corpus
SASSC: A Standard Arabic Single Speaker Corpus Ibrahim Almosallam, Atheer AlKhalifa, Mansour Alghamdi, Mohamed Alkanhal, Ashraf Alkhairy The Computer Research Institute King Abdulaziz City for Science
Emotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
Myanmar Continuous Speech Recognition System Based on DTW and HMM
Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
Carla Simões, [email protected]. Speech Analysis and Transcription Software
Carla Simões, [email protected] Speech Analysis and Transcription Software 1 Overview Methods for Speech Acoustic Analysis Why Speech Acoustic Analysis? Annotation Segmentation Alignment Speech Analysis
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics
Unlocking Value from Patanjali V, Lead Data Scientist, Anand B, Director Analytics Consulting, EXECUTIVE SUMMARY Today a lot of unstructured data is being generated in the form of text, images, videos
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium [email protected]
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne
SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne Published in: Proceedings of Fonetik 2008 Published: 2008-01-01
NATURAL SOUNDING TEXT-TO-SPEECH SYNTHESIS BASED ON SYLLABLE-LIKE UNITS SAMUEL THOMAS MASTER OF SCIENCE
NATURAL SOUNDING TEXT-TO-SPEECH SYNTHESIS BASED ON SYLLABLE-LIKE UNITS A THESIS submitted by SAMUEL THOMAS for the award of the degree of MASTER OF SCIENCE (by Research) DEPARTMENT OF COMPUTER SCIENCE
Automatic Transcription: An Enabling Technology for Music Analysis
Automatic Transcription: An Enabling Technology for Music Analysis Simon Dixon [email protected] Centre for Digital Music School of Electronic Engineering and Computer Science Queen Mary University
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
School Class Monitoring System Based on Audio Signal Processing
C. R. Rashmi 1,,C.P.Shantala 2 andt.r.yashavanth 3 1 Department of CSE, PG Student, CIT, Gubbi, Tumkur, Karnataka, India. 2 Department of CSE, Vice Principal & HOD, CIT, Gubbi, Tumkur, Karnataka, India.
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE
USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE Ria A. Sagum, MCS Department of Computer Science, College of Computer and Information Sciences Polytechnic University of the Philippines, Manila, Philippines
Speech Signal Processing: An Overview
Speech Signal Processing: An Overview S. R. M. Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati December, 2012 Prasanna (EMST Lab, EEE, IITG) Speech
TOOLS FOR RESEARCH AND EDUCATION IN SPEECH SCIENCE
TOOLS FOR RESEARCH AND EDUCATION IN SPEECH SCIENCE Ronald A. Cole Center for Spoken Language Understanding, Univ. of Colorado, Boulder ABSTRACT The Center for Spoken Language Understanding (CSLU) provides
Ericsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
MUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.
Lecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
A Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman
A Comparison of Speech Coding Algorithms ADPCM vs CELP Shannon Wichman Department of Electrical Engineering The University of Texas at Dallas Fall 1999 December 8, 1999 1 Abstract Factors serving as constraints
Tagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
Subjective SNR measure for quality assessment of. speech coders \A cross language study
Subjective SNR measure for quality assessment of speech coders \A cross language study Mamoru Nakatsui and Hideki Noda Communications Research Laboratory, Ministry of Posts and Telecommunications, 4-2-1,
The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
DIXI A Generic Text-to-Speech System for European Portuguese
DIXI A Generic Text-to-Speech System for European Portuguese Sérgio Paulo, Luís C. Oliveira, Carlos Mendes, Luís Figueira, Renato Cassaca, Céu Viana 1 and Helena Moniz 1,2 L 2 F INESC-ID/IST, 1 CLUL/FLUL,
Develop Software that Speaks and Listens
Develop Software that Speaks and Listens Copyright 2011 Chant Inc. All rights reserved. Chant, SpeechKit, Getting the World Talking with Technology, talking man, and headset are trademarks or registered
The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project
Seminar on Dec 19 th Abstracts & speaker information The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project Eleni Bozia (USA) Angelos Barmpoutis (USA)
Analysis and Synthesis of Hypo and Hyperarticulated Speech
Analysis and Synthesis of and articulated Speech Benjamin Picart, Thomas Drugman, Thierry Dutoit TCTS Lab, Faculté Polytechnique (FPMs), University of Mons (UMons), Belgium {benjamin.picart,thomas.drugman,thierry.dutoit}@umons.ac.be
Telecommunication (120 ЕCTS)
Study program Faculty Cycle Software Engineering and Telecommunication (120 ЕCTS) Contemporary Sciences and Technologies Postgraduate ECTS 120 Offered in Tetovo Description of the program This master study
Things to remember when transcribing speech
Notes and discussion Things to remember when transcribing speech David Crystal University of Reading Until the day comes when this journal is available in an audio or video format, we shall have to rely
Prosodic Phrasing: Machine and Human Evaluation
Prosodic Phrasing: Machine and Human Evaluation M. Céu Viana*, Luís C. Oliveira**, Ana I. Mata***, *CLUL, **INESC-ID/IST, ***FLUL/CLUL Rua Alves Redol 9, 1000 Lisboa, Portugal [email protected], [email protected],
Automatic Evaluation Software for Contact Centre Agents voice Handling Performance
International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
BCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT. March 2013 EXAMINERS REPORT. Knowledge Based Systems
BCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT March 2013 EXAMINERS REPORT Knowledge Based Systems Overall Comments Compared to last year, the pass rate is significantly
MEng, BSc Computer Science with Artificial Intelligence
School of Computing FACULTY OF ENGINEERING MEng, BSc Computer Science with Artificial Intelligence Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks Text Analytics World, Boston, 2013 Lars Hard, CTO Agenda Difficult text analytics tasks Feature extraction Bio-inspired
Incorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
Master s Program in Information Systems
The University of Jordan King Abdullah II School for Information Technology Department of Information Systems Master s Program in Information Systems 2006/2007 Study Plan Master Degree in Information Systems
Discourse Markers in English Writing
Discourse Markers in English Writing Li FENG Abstract Many devices, such as reference, substitution, ellipsis, and discourse marker, contribute to a discourse s cohesion and coherence. This paper focuses
Voice Driven Animation System
Voice Driven Animation System Zhijin Wang Department of Computer Science University of British Columbia Abstract The goal of this term project is to develop a voice driven animation system that could take
Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking
Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking The perception and correct identification of speech sounds as phonemes depends on the listener extracting various
TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE
TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE Sangam P. Borkar M.E. (Electronics)Dissertation Guided by Prof. S. P. Patil Head of Electronics Department Rajarambapu Institute of Technology Sakharale,
ETS Automated Scoring and NLP Technologies
ETS Automated Scoring and NLP Technologies Using natural language processing (NLP) and psychometric methods to develop innovative scoring technologies The growth of the use of constructed-response tasks
An Implementation of Active Data Technology
White Paper by: Mario Morfin, PhD Terri Chu, MEng Stephen Chen, PhD Robby Burko, PhD Riad Hartani, PhD An Implementation of Active Data Technology October 2015 In this paper, we build the rationale for
Master of Science in Computer Science
Master of Science in Computer Science Background/Rationale The MSCS program aims to provide both breadth and depth of knowledge in the concepts and techniques related to the theory, design, implementation,
Study Plan for Master of Arts in Applied Linguistics
Study Plan for Master of Arts in Applied Linguistics Master of Arts in Applied Linguistics is awarded by the Faculty of Graduate Studies at Jordan University of Science and Technology (JUST) upon the fulfillment
A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song
, pp.347-354 http://dx.doi.org/10.14257/ijmue.2014.9.8.32 A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song Myeongsu Kang and Jong-Myon Kim School of Electrical Engineering,
Weblogs Content Classification Tools: performance evaluation
Weblogs Content Classification Tools: performance evaluation Jesús Tramullas a,, and Piedad Garrido a a Universidad de Zaragoza, Dept. of Library and Information Science. Pedro Cerbuna 12, 50009 [email protected]
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Yousef Ajami Alotaibi 1, Mansour Alghamdi 2, and Fahad Alotaiby 3 1 Computer Engineering Department, King Saud University,
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
ENGLISH LANGUAGE. A Guide to co-teaching The OCR A and AS level English Language Specifications. A LEVEL Teacher Guide. www.ocr.org.
Qualification Accredited Oxford Cambridge and RSA A LEVEL Teacher Guide ENGLISH LANGUAGE H470 For first teaching in 2015 A Guide to co-teaching The OCR A and AS level English Language Specifications Version
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Transcription of polyphonic signals using fast filter bank( Accepted version ) Author(s) Foo, Say Wei;
MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music
ISO/IEC MPEG USAC Unified Speech and Audio Coding MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music The standardization of MPEG USAC in ISO/IEC is now in its final
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)
Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger European Commission Joint Research Centre (JRC) https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems
MEng, BSc Applied Computer Science
School of Computing FACULTY OF ENGINEERING MEng, BSc Applied Computer Science Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give a machine instructions
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Proceedings of Meetings on Acoustics
Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
CallAn: A Tool to Analyze Call Center Conversations
CallAn: A Tool to Analyze Call Center Conversations Balamurali AR, Frédéric Béchet And Benoit Favre Abstract Agent Quality Monitoring (QM) of customer calls is critical for call center companies. We present
Program curriculum for graduate studies in Speech and Music Communication
Program curriculum for graduate studies in Speech and Music Communication School of Computer Science and Communication, KTH (Translated version, November 2009) Common guidelines for graduate-level studies
Transcription bottleneck of speech corpus exploitation
Transcription bottleneck of speech corpus exploitation Caren Brinckmann Institut für Deutsche Sprache, Mannheim, Germany Lesser Used Languages and Computer Linguistics (LULCL) II Nov 13/14, 2008 Bozen
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
