P H D T H E S I S. Statistical Parametric Speech Synthesis Based on the Degree of Articulation
|
|
|
- Douglas Norris
- 9 years ago
- Views:
Transcription
1 University of Mons Doctoral School MUSICS Signal Processing P H D T H E S I S to obtain the title of PhD in Applied Sciences of University of Mons Specialty : Speech Processing Defended by Benjamin Picart Statistical Parametric Speech Synthesis Based on the Degree of Articulation Thesis Advisor: Thierry Dutoit Thesis Co-Advisor: Thomas Drugman prepared at University of Mons, Faculté Polytechnique, TCTS Lab defended on October 29, 2013 Jury : Prof. Marc Pirlot - University of Mons (UMONS) Prof. Thierry Dutoit - University of Mons (UMONS) Prof. Francis Grenez - Université Libre de Bruxelles (ULB) Prof. Simon King - University of Edinburgh (Scotland) Dr. Thomas Drugman - University of Mons (UMONS) Dr. Vincent Pagel - Acapela Group S.A. (Mons) Dr. Raphael Sebbe - Creaceed S.P.R.L. (Mons)
2
3 To my grandfather Marcel. To my grandmother Marcelle, my parents Annie and Pascal, my sister Justine and my girlfriend Virginie.
4
5 When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. Arthur C. Clarke (16th of December th of March 2008) For certain you have to be lost to find a place as can t be found. Elseways, everyone would know where it was. Geoffrey Rush, alias Hector Barbossa Pirates of the Caribbean: At World s End
6
7 v Abstract Nowadays, speech synthesis is part of various daily life applications. The ultimate goal of such technologies consists in extending the possibilities of interaction with the machine, in order to get closer to human-like communications. However, current state-of-the-art systems often lack of realism: although high-quality speech synthesis can be produced by many researchers and companies around the world, synthetic voices are generally perceived as hyperarticulated. In any case, their degree of articulation is fixed once and for all. The present thesis falls within the more general quest for enriching expressivity in speech synthesis. The main idea consists in improving statistical parametric speech synthesis, whose most famous example is Hidden Markov Model (HMM) based speech synthesis, by introducing a control of the articulation degree, so as to enable synthesizers to automatically adapt their way of speaking to the contextual situation, like humans do. The degree of articulation, which is probably the least studied prosodic parameters, is characterized by modifications of phonetic context, of speech rate and of spectral dynamics (vocal tract rate of change). It depends upon the surrounding environment and the communication context, and provides information on the relationship between the speaker and the listener(s). According to Lindblom s H and H theory, speakers are expected to vary their output along a continuum of hypo and hyperarticulated speech. Compared to the neutral case, hyperarticulated speech tends to maximize the clarity of the speech signal by increasing the articulation efforts to produce it, while hypoarticulated speech is produced with minimal articulation efforts. The work presented in this PhD thesis provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. This framework is very convenient for creating a synthesizer whose speaker characteristics and speaking styles can be easily modified. In order to achieve this goal, a new French database consisting of three distinct and parallel sets (one for each articulation degree to be studied, i.e. neutral, hypoarticulated and hyperarticulated speech) was recorded. This database allows: i) the study of both acoustic and phonetic modifications due to articulatory effort changes; ii) the design of a high-quality speech synthesizer integrating a continuous control of the articulation degree. This first requires to address the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Secondly, we perform a perceptual study of speech with a variable articulation degree, specifically focusing on: i) the internal mechanisms leading to the perception of the degree of articulation by listeners (i.e. cepstrum, prosody, phonetic transcription adaptation and the complete adaptation); ii) how intelligibility and various other voice dimensions are affected. Based on the ensuing conclusions, we finally implement an automatic modification of the degree of articulation in an existing standard neutral voice for which no hypo or hyperarticulated recordings are available. Keywords: HMM-based Speech Synthesis, Speech Analysis, Expressive Speech, Degree of Articulation, Speaking Style Adaptation, Speaking Style Transposition, Voice Quality, Speech Intelligibility
8
9 vii Acknowledgements The present thesis has been fulfilled within the Circuit Theory and Signal Processing (TCTS) lab of the Faculté Polytechnique (FPMs) in the University of Mons (UMONS), and was made possible by the support from the Fonds pour la formation à la Recherche dans l Industrie et dans l Agriculture (FRIA). I would like to express my deepest gratitude to my supervisor, Prof. Thierry Dutoit, and to my co-supervisor, Dr. Thomas Drugman, for their kindness, their insightful guidance, their availability and their support throughout this thesis. I am also thankful to Acapela Group S.A. for the fruitful collaboration and for providing me with their linguistic front-end. In particular, I am grateful to Mr. Geoffrey Wilfart, Mr. Fabrice Malfrère, Dr. Vincent Pagel and Mr. Olivier Deroo for their judicious advices, their time and the industrial partnership. I would also like to thank Hui Liang and Lakshmi Saheer, from Idiap Research Institute, for their help and advices when I started my thesis. I would like to thank all the people working in TCTS for their friendship and help throughout my thesis, and particularly: Alexis, Maria, Onur, Jérôme and Joëlle, Thomas, Jean-Marc, Hüseyin, Sandrine, Nicolas R., Stéphane, Loïc, Thierry R., Matéi, Radwan, Matthieu, Thierry C., Stéphanie, Nicolas D., Johan, William, Nathalie, Bernard, Joël, etc. A special thank to the card player team, for all the good games: Thomas, Justine, Zacharie, Amaury, Vasiliki and Christophe. Another special thank to Anderson, who introduced and addicted me to the game of GO. I am also grateful to Caroline, Hatice and Véronique, from the Secrétariat des Etudes, for their kindness since I came in FPMs as a student, almost 9 years ago. This is also the End of an Era. I thank Nicolas Linze, alias Reggie, for all the good time we spent together during our 7-year parallel academic path (which was so close that even our FRIA project defenses started some minutes apart!), and also to all the nice people I have met across the world. For their availability and their judicious comments, I am thankful to all my thesis proofreaders: Thierry Dutoit, Thomas Drugman, Alexis Moinet, Jérôme Urbain and Joëlle Tilmanne, Maria Astrinaki, Onur Babacan and Sandrine Brognaux. I would finally like to express my deepest gratitude to my grandfather for his kind thoughts, his help and advices when needed, and to my grandmother, my parents, my sister and Virginie for their so kind support all along this journey.
10
11 Contents 1 General Introduction Introduction Unit Selection Speech Synthesis HMM-based Speech Synthesis The Degree of Articulation Contributions and Structure of the Thesis Background Introduction Markov Model Theory Discrete-Time Markov Process Hidden Markov Model Overview of HMM-based Speech Synthesis Training Step in HMM-based Speech Synthesis Spectral Parameters F 0 Modeling State Duration Clustering Synthesis Step in HMM-based Speech Synthesis Maximizing P (q W, λ) Maximizing P (O q, λ) Voice Adaptation Techniques Maximum Likelihood Linear Regression (MLLR) Maximum A Posteriori (MAP) Adaptation Creation of a Database with various Degrees of Articulation Introduction Database Specifications Recording Hardware Audio Acquisition System - Motu 8pre Microphone - AKG C3000B XLR Connections Digital Effects - Behringer Virtualizer DSP Amplifier - Behringer Powerplay Pro-8 HA Conclusions
12 x Contents 4 Analysis of Hypo and Hyperarticulated Speech Introduction Increase in the Articulation Effort Decrease in the Articulation Effort Contributions and Structure of the Chapter Acoustic Analysis Vocal Tract-based Modifications Glottal-based Modifications Phonetic Analysis Glottal Stops Phone Variations Phone Durations Speech Rate Conclusions HMM-based Synthesis of Hypo and Hyperarticulated Speech Introduction Reactive Speech Synthesis Knowledge Integration in Speech Synthesis Contributions and Structure of the Chapter Method Acoustic Analysis Objective Evaluation Subjective Evaluation Conclusions Continuous Control of the Degree of Articulation Introduction From Source toward Target Speakers Voice Interpolation and Extrapolation between Statistical Models Contributions and Structure of the Chapter Speaking Style Adaptation Method Objective Evaluation Subjective Evaluation Interpolation and Extrapolation of the Degree of Articulation Method Perception of the Degree of Articulation Segmental Quality of the Interpolation and Extrapolation Conclusions
13 Contents xi 7 Subjective Assessment of Hypo and Hyperarticulated Speech Introduction Speech Intelligibility Estimation Speech Intelligibility Enhancement Contributions and Structure of the Chapter Effects Influencing the Perceived Degree of Articulation Method Experiments Intelligibility and Quality Assessments of Hypo and Hyperarticulated speech Method Semantically Unpredictable Sentences Test Absolute Category Rating Test Conclusions Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Introduction Creating Target Style Model without any Target Style Speech Data Contributions and Structure of the Chapter Creation of the Articulation Model Techniques for the Transposition of the Articulation Model to a New Speaker Prosody Transposition Experimental Framework Speech Quality of the Prosody Model Transposition Perception of the Degree of Articulation Filter Transposition Experimental Framework Speech Quality of the Filter Model Transposition Perception of the Degree of Articulation Identity Preservation Assessment Conclusions on Filter Transposition Generalization to Other Voices Experimental Framework Speech Quality of the Prosody and Filter Models Transposition Perception of the Degree of Articulation Identity Preservation Assessment Conclusions General Conclusion and Future Works Conclusions Creation of a Database with various Degrees of Articulation Analysis of Hypo and Hyperarticulated Speech Continuous Control of the Degree of Articulation
14 xii Contents Subjective Assessment of Hypo and Hyperarticulated Speech Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Thesis Contributions Perspectives In Direct Continuity Average-Voice-based Speech Synthesis integrating the Degree of Articulation Generalization to other types of Data and Languages Bibliography 145 A Publications 175 A.1 Journals A.2 Conference Proceedings A.3 Scientific Reports
15 List of Figures 2.1 Schematic representation of (a) a 3-state ergodic HMM and (b) a 4-state leftto-right HMM, together with emission B = {b j (o)} and transition A = {a ij } probabilities associated with each state Output distributions: (a) Gaussian PDF, (b) Gaussian mixture PDF, (c) Multi-stream PDF. Adapted from [Yamagishi 2006] Overview of the HMM-based Speech Synthesis System ( H-Triple-S - HTS), from [Zen et al. 2009] F 0 pattern modeling, from [Masuko 2002] Multi-Space probability Distribution (MSD) and observations, from [Masuko 2002] A HMM based on Multi-Space probability Distribution (MSD), from [Masuko 2002] Multi-Space probability Distribution Hidden Markov Model (MSD-HMM) for F 0 modeling, from [Tokuda & Zen 2009] HMM duration PDFs modeled either by their state self-transition probabilities (decreasing exponential blue curve) or by a Gaussian distribution (Gaussian red curve), from [Tokuda & Zen 2009] Decision tree context clustering, from [Tokuda et al. 2002b] Duration synthesis, from [Yamagishi 2006] Generated speech parameter trajectory, from [Tokuda & Zen 2009] Maximum Likelihood Linear Regression (MLLR) and its related algorithms, adapted from [Yamagishi et al. 2009a] Combined algorithm of the (C)MLLR and MAP adaptation, adapted from [Yamagishi et al. 2009a] Relationship between the MAP and the ML estimates, adapted from [Yamagishi 2006] Sound-proof room equipped in order to record natural-sounding NEU, HPO and HPR speech Schematic illustration of the standard recording protocol designed in this work to induce the speaker s (a) HPO ( amplification effect) and (b) HPR ( cathedral effect ) speech Vocalic triangle estimated on the original recordings for each DoA, together with dispersion ellipses Pitch histograms for each DoA Averaged magnitude spectrum of the glottal source for each DoA (in the top right corner, a zoom on the glottal formant frequency) Histograms of the maximum voiced frequency for each DoA
16 xiv List of Figures 4.5 Number of glottal stops for each vowel and for each DoA Phone duration histograms. (a) Front, central, back & nasal vowels. (b) Plosive & fricative consonants. (c) Pauses Phone duration histograms. (a) Semi-vowels. (b) Trill consonants Standard training of the NEU, HPO and HPR full data models, from the database containing 1220 training sentences for each DoA Vocalic triangle estimated on the generated recordings for each DoA, together with dispersion ellipses Subjective evaluation of the overall speech quality of the full data models (mean score with its 95% CI) Standard training of the NEU, HPO and HPR full data models (Chapter 5), from the database containing 1220 training sentences for each DoA. Adaptation of the NEU full data model using CMLLR transform with HPO and HPR speech data to produce HPO and HPR adapted models (Section 6.2). Implementation of a tuner, manually adjustable by the user, for a continuous control of the DoA (Section 6.3) Objective evaluation - Average MCD [db] computed between the adapted and the full data models. Black dots indicate actual measures Objective evaluation - RMSE of log F0 [cent] computed between the adapted and the full data models. Black dots indicate actual measures Objective evaluation - RMSE of vowel durations [number of frame] (frame shift = 5 ms) computed between the adapted and the full data models. Black dots indicate actual measures Subjective evaluation of the overall speech quality of the adapted models - Effect of the number of adaptation sentences on CCR scores (mean scores with their 95% confidence intervals) Subjective evaluation of the adapted models - Perceived interpolation and extrapolation ratio as a function of the actual interpolation and extrapolation ratio, together with its 95% confidence interval Subjective evaluation of the perception of the DoA - Mean PDA scores with their 95% confidence intervals (CI) for each DoA Subjective evaluation of the perception of the DoA - ACR test Subjective intelligibility evaluation of the DoA (SUS Test) - Mean word (top) and sentence (bottom) recognition accuracies [%], together with their 95% CI Subjective quality evaluation of the DoA (ACR Test) - Mean scores together with their 95% CI Vocalic triangles estimated on the original NEU recordings for Voices A, B, M and F, together with dispersion ellipses Creation of the articulation model on Voice A. Transforms are computed in two alternative ways, using LS or CMLLR adaptation
17 List of Figures xv 8.3 Comparison of mean vector µ adaptation in CMLLR and model-space LS Prosody and filter adaptation transforms computed on Voice A are applied to an existing standard NEU Voice B with no HPO or HPR recordings available for generating Voice B HPO and HPR adapted models. The most successful method (selected through various evaluations) is then used for automatically modifying the DoA of two other speakers (Voices M and F) Transposition of the articulation model learned on Voice A to Voice B. Leaf nodes mapping is performed in two alternative ways, using phonetic (based on decision trees) or acoustic (based on KL divergence) mapping CMOS test for prosody transposition - Mean CMOS score for each method and each DoA, together with their 95% confidence intervals (CI) CMOS test for prosody transposition - Detailed preference scores (expressed in [%]), averaged for all the participants and utterances used in the test, for each method compared to the baseline, for HPR speech (left) and HPO speech (right) CPDA test for prosody transposition - Mean score of the perceived DoA using the four methods or the baseline (1 being the reference DoA, defined on Voice A), together with their 95% CI MOS test for the second pruning step - Overall speech quality (with its 95% CI) of the sentences synthesized by the HPO and HPR transposed models of Voice B CMOS test for filter transposition - Mean CMOS score for each method and each DoA, together with their 95% CI CPDA test for filter transposition - Mean score of the perceived DoA using the four methods or the baseline (1 being the reference DoA, defined on Voice A), together with their 95% CI ID test for filter transposition - Mean score for each method and each DoA (Voice A = 0, Voice B = 1), together with their 95% CI Vocalic triangles estimated on the synthesized HPR, NEU and HPO speech for Voices A, B, M and F, together with dispersion ellipses CMOS test for the generalization of the prosody and filter transposition - Mean CMOS scores each DoA using the LSP_LS_Phn method, together with their 95% CI CPDA test for the generalization of the prosody and filter transposition - Mean score of the perceived DoA using the LSP_LS_Phn method (1 being the reference DoA, defined on Voice A), together with their 95% CI ID test for the generalization of the prosody and filter transposition - Mean score for each DoA using the LSP_LS_Phn method (Voice A = 0, Voice M or F = 1), together with their 95% CI
18
19 List of Tables 2.1 Mel-Generalized Cepstral (MGC) analysis Vocalic space (in khz 2 ) for the three DoA for the original sentences Deleted and inserted phone percentage in HPO and HPR speech respectively, compared to NEU style, and their repartition inside the words: total (first row), beginning (second row), middle (third row), end (fourth row) Speech rates and related time information for NEU, HPO & HPR speech, together with the positive or negative variation from the NEU style (in [%]) Vocalic space (in khz 2 ) for the three DoA for the synthesized sentences Objective evaluation of the overall speech quality of the full data models: average MCD [db], RMSE_lf0 [cent] and RMSE_dur [number of frames] (frame shift = 5 ms) with their 95% confidence intervals (CI) for each DoA Grades in the CCR scale Grades in the CMOS scale Subjective evaluation of the adapted models (CMOS test) - Perceived synthesis quality of the test sentence X vs. the NEU sentence B (CMOS scores with their 95% confidence intervals) Four different synthesizers, so as to analyze the internal mechanisms leading to the perception of the DoA by listeners Answering the questions by comparing the synthesizers performance Question list asked to listeners during the ACR test, together with their corresponding extreme category responses [de Mareüil et al. 2006] Question list (complement to Table 7.3) asked to listeners during the ACR test, together with their corresponding extreme category responses [de Mareüil et al. 2006] Speech rates, mean and standard deviation of F 0 values for Voices A NEU, HPO and HPR recordings and for Voice B, M and F NEU recordings Methods for applying the prosody and filter transposition transforms from Voice A to Voice B Selected methods after the first pruning step ( and ) and after the second one ( ). Observed artefacts on the rejected methods are also indicated (u: filter unstability; g: occurrence of glitches; i: complete target speaker identity loss)
20
21 Acronyms ACR: Absolute Category Rating AI: Articulation Index ANN: Artificial Neural Network ANOV A: ANalysis Of VAriance ASR: Automatic Speech Recognition CCD: Complex Cepstrum-based Decomposition CCR: Comparison Category Rating CI: Confidence Interval CM OS: Comparative Mean Opinion Score CM LLR: Constrained Maximum Likelihood Linear Regression CSM AP LR: Constrained Structural Maximum A Posteriori Linear Regression db: Decibel DoA: Degree of Articulation DSM of residual signal: Deterministic plus Stochastic Model of residual signal DSP : Digital Signal Processor EM: Expectation-Maximization F W S: Frequency Weighted Segmental F 0 : Fundamental frequency F x: Formant x (x = formant id) GCI: Glottal Closure Instant GMM: Gaussian Mixture Model GP : Glimpse Proportion HMM: Hidden Markov Model HNM: Harmonic plus Noise Model HP O: HyPOarticulation or HyPOarticulated
22 xx Acronyms HP R: HyPeRarticulation or HyPeRarticulated HSM: Harmonic/Stochastic Model HSMM: Hidden Semi Markov Model HT S: HMM-based Speech Synthesis System ( H-Triple-S ) Hz: Hertz LAR: Log Area Ratio LF model: Liljencrants-Fant model LP C: Linear Predictive Coding LS: Linear Scaling LSF : Line Spectral Frequency LSP : Line Spectral Pairs MAP : Maximum A Posteriori M CD: Mel-Cepstral Distortion M ELP : Mixed Excitation Linear Prediction MF A: Mixtures of Factor Analyzers M F CC: Mel-Frequency Cepstrum Coefficient M GC coefficients: Mel Generalized Cepstral coefficients M L: Maximum Likelihood M LLR: Maximum Likelihood Linear Regression M LSA: Mel Log Spectrum Approximation MOS: Mean Opinion Score M RGV : Multiple-Regression Global Variance M SD HSM M: Multi-Space probability Distribution Hidden Semi Markov Models NEU: NEUtral NLP : Natural Language Processor P ARCOR coefficients: PARtial CORrelation coefficients P DA: Perceived Degree of Articulation
23 Acronyms xxi P DF : Probability Density Function P W I: Prototype Waveform Interpolation RIR: Room Impulse Response RM SE: Root-Mean-Square Error RM SE_dur: Root-Mean-Square Error of vowel durations RMSE_lf0: Root-Mean-Square Error of log F0 SAT : Speaker-Adaptive Training SEDREAMS: Speech Event Detection using the REsidual And Mean-based Signals SII: Speech Intelligibility Index SNR: Signal to Noise Ratio ST I: Speech Transmission Index ST OI: Short-Time Objective Intelligibility ST RAIGHT : Speech Transformation and Representation using Adaptive Interpolation of weighted spectrum SU S: Semantically Unpredictable Sentences T CGP P : Template Constrained Generalized Posterior Probability T T S: Text-To-Speech V C: Voice Conversion V T LN: Vocal Tract Length Normalization W SS: Weighted Spectral Slope
24
25 Chapter 1 General Introduction 1.1 Introduction Nowadays, the speech synthesis market is expanding. In addition to the numerous daily life multimedia applications, the speech synthesis domain is most of the time associated with the search for extending the interaction possibilities between the human and the machine, in order to get closer to human-like communications. On the one hand, speech quality is characterized by its naturalness, its intelligibility and its expressivity. On the other hand, speech synthesizer efficiency is characterized by the amount of resources required for synthesis (i.e. the amount of data collected during the database recording, and the amount of time needed for collecting and processing them) and by the number of languages available (if possible with the same voice). Two main techniques are governing speech synthesis: unit selection [Hunt & Black 1996] and statistical parametric speech synthesis [Zen et al. 2009]. Unit selection speech synthesis, in which appropriate subword units are automatically selected from a natural speech database, allows the generation of high-quality human-like sounding speech, but requires a huge amount of resources. The basic idea relies on two cost functions: the target cost, which represents how well the selected unit matches the target, and the concatenation cost, representing how well two selected units combine. The target cost function can also be calculated in advance using tree-based clustering [Donovan & Woodland 1995] [Black & Taylor 1997]. At synthesis time, the goal thus consists in the minimization of the overall cost of a label sequence to be produced, which is equal to the sum of the target and concatenation cost functions. Many works have focused on this kind of speech synthesis, and more information can be found in [Taylor 2009]. In direct contrast with this selection of actual unmodified instances of speech from a database, statistical parametric speech synthesis might be most simply described as generating the average of some sets of similarly sounding speech segments [Zen et al. 2009]. This produces good quality speech synthesis, but slightly degraded by its buzziness compared to what is generated by unit selection speech synthesis. It has the significant advantage of greatly reducing the memory footprint (as only the statistical models have to be stored), although the runtime computation may be much higher. As explained further in the present thesis, the speech parameters, i.e. spectrum, fundamental frequency (F 0) and phone duration, allowing the reconstruction of any speech unit, are statistically modeled and generated by Hidden Markov Models (HMMs) or Hidden Semi-Markov Models (HSMMs). This is the reason why the most famous example of statistical parametric speech synthesis is often called HMM-based speech synthesis.
26 2 Chapter 1. General Introduction Unit Selection Speech Synthesis ATR ν-talk was the first to demonstrate the effectiveness of the automatic selection of appropriate units [Sagisaka et al. 1992], based on minimizing acoustic distortions between selected units and the target spectrum. Then CHATR generalized these techniques to multiple languages and an automatic scheme [Hunt & Black 1996], by taking into account both the prosodic and phonetic appropriateness of units. Synthetic speech is directly linked to the database. Indeed, the more carefully the database is recorded (i.e. high quality recordings), the higher the generated speech quality. The quality is also directly linked with the size of the database, as a larger database implies a better unit coverage (although never perfect [Möbius 2003]). Most of current commercial systems use this synthesis technique. Its main characteristics are:! high-quality speech synthesis, as speech units are directly selected from a database of actual human speech (there is no underlying statistical process); % it is not very portable on embedded devices which often have limited memory resources, as a large database is required (typically around 400 MB) in order to cover most of the phonetic and prosodic contexts. However, it should be noted that the runtime computation may be lower compared to HMM-based speech synthesis; % it is not very flexible, as speech units cannot be easily and straightforwardly modified (e.g. changes in spectrum, fundamental frequency and phone duration). If expressive speech synthesis is required, for any arbitrary style, a database containing this kind of expressive human speech is necessary. Recording a database is time-consuming and therefore expensive; % Automatic Speech Recognition (ASR) techniques are hardly applicable (e.g. speaker adaptation methods). Although this is a successful method, high quality speech synthesis cannot be guaranteed in all case. The synthesized speech quality can be dramatically degraded if the input text requires phonetic and prosodic contexts that are under-represented in the database. Even if it is not common, a single bad concatenation in an utterance can dramatically affect the resulting subjective appreciation. Due to the prohibitive number of possible combinations between units, it is impossible to ensure that no bad joins or inadequate unit selection will occur, except in the special case of limited domain synthesizers [Black & Lenzo 2000] where the database is designed for specific applications. As selected units cannot be (easily) modified, synthetic speech is limited to the same style as the one in the original recordings. As a consequence, larger speech databases containing various speaking styles are required in order to limit this effect and have more control on the synthesis (like IBM s stylistic synthesis [Eide et al. 2004]). Unfortunately, recording large databases with variations is very difficult and costly [Black 2003]. The time needed to record a normal database varies from 8h to 40h, depending on the language and on the desired synthetic quality. Moreover, this data has to be processed afterwards (i.e. annotations, segmentation, etc.), which can last several months.
27 1.1. Introduction HMM-based Speech Synthesis A new method for speech synthesis emerged around ten years ago: statistical parametric speech synthesis. This technique consists of two parts: the training and the synthesis steps. During the training step, a natural speech database is analyzed, i.e. for each analysis frame, spectral (filter contribution) and excitation (glottal source contribution) parameters are extracted. These parameters are modeled through context-dependent (e.g. phonetic, prosodic, etc.) statistical models. During the synthesis step, the input text is first converted into such a context-dependent label sequence. The idea is that realistic parameters should be generated by the models, by maximizing the likelihood of the sequence given the model. The speech signal is eventually reconstructed from some parametric representation of speech. The main characteristics of this approach are:! higher portability for embedded devices, which often have limited memory resources (as only the statistical models have to be stored). However, it should be noted that the runtime computation may be higher compared to unit selection speech synthesis;! higher flexibility, including the possibility of using voice conversion and techniques developed for ASR like speaker adaptation methods, potentially leading to more expressive speech synthesis;! smaller memory footprint, typically within one MB, as only the statistical models have to be stored; % lower synthetic speech quality, often termed as buzziness. The latter characteristic is the main drawback of HMM-based speech synthesis. This is mainly due to the fact that this synthesis technique is based on a parametric representation of the speech signal: the excitation signal, consisting of either a pulse train for voiced speech, or a white noise for unvoiced speech, is far too simplistic. Several studies have been and are still focusing on this latter issue, in order to improve the output speech quality of such systems: among others, the Speech Transformation and Representation using Adaptive Interpolation of weighted spectrum (STRAIGHT) [Kawahara et al. 1999] [Kawahara & Morise 2011], the glottal-flow-derivative model [Cabral et al. 2007] [Cabral et al. 2008] and the Deterministic plus Stochastic Model (DSM) of the residual signal [Drugman et al. 2009b] [Drugman & Dutoit 2012] The Degree of Articulation Current state-of-the-art systems often lack realism: synthetic voices are most of the time perceived as hyperarticulated, and in any case, their degree of articulation is fixed once and for all. The expressivity of synthetic voices can be improved by modifying various prosodic parameters, including the fifth dimension of prosody: the degree of articulation [Pfitzinger 2006].
28 4 Chapter 1. General Introduction According to Lindblom s H and H theory [Lindblom 1983], speakers are expected to vary their output along a continuum of hypo and hyperarticulated speech. Compared to the neutral case, hyperarticulated speech tends to maximize the clarity of the speech signal by increasing the articulation efforts to produce it, while hypoarticulated speech is produced with minimal articulation efforts. Therefore the degree of articulation (DoA) provides information on the relationship between the speaker and the listeners, as well as on the speaker s introversion and extroversion in real life situation [Beller 2009]. This status can be induced by contextual factors (like the listener s emotional state) or simply by the speaker s own expressivity. Indeed, when talkers speak, they also listen to each other [Cooke et al. 2012]. Speakers can adopt a speaking style allowing them to be more easily understood in difficult communication situations. In this work, hyperarticulated speech (HPR) refers to the situation of a person talking in a reverberant environment, e.g. a teacher or a speaker talking in front of a large audience (important articulation efforts have to be made to be understood by everybody). Hypoarticulated speech (HPO) refers to the situation of a person talking in a quiet environment (e.g. in a library) or very close to someone (few articulation efforts have to be made to be understood). Neutral speech (NEU) refers to the daily life situation of a person reading aloud a text emotionless (e.g. no happiness, no anger, no excitement, etc) and without any specific articulation efforts to produce the speech, keeping only the sentence intonation: rising intonation for questions, flat intonation for affirmative or negative sentences, etc. It is worth noting that these three modes of expressivity are emotionless, but can vary amongst speakers as reported in [Beller 2009]. The influence of emotion on the DoA has been studied in [Beller 2007] [Beller et al. 2008] and is out of the scope of this work. The DoA is characterized by modifications of phonetic stream, of fundamental frequency, of speech rate and of spectral dynamics (vocal tract rate of change). A common measure of the DoA consists in defining formant targets for each phone, taking coarticulation into account, and studying differences between real observations and targets vs. the speech rate [Wouters & Macon 2001]. Since defining formant targets is not an easy task, Beller proposed in [Beller 2009] a statistical measure of the DoA by studying the joint evolution of the vocalic triangle (i.e. the shape formed by the vowels /a/, /i/ and /u/ in the F 1 - F 2 space) area and the speech rate. A recent study presented a computational model of human speech production to provide a continuous adjustment according to environmental conditions [Nicolao et al. 2012]. In direct connection with HPR speech, the Lombard effect [Lombard 1911] refers to the speech changes due to the immersion of the speaker in a noisy environment. It is indeed known that a speaker tends to increase his vocal efforts to be more easily understood while talking in a background noise [Summers et al. 1988]. Various aspects of the Lombard effect were already studied, including acoustic and articulatory characteristics [Garnier et al. 2006b] [Garnier et al. 2006a], features extracted from the glottal flow [Drugman & Dutoit 2010a], or changes of F0 and of the spectral tilt [Lu & Cooke 2009]. Some works have been done in the framework of concatenative speech synthesis to enhance speech intelligibility by means of a kind of Lombard or HPR speech. For example, speech intelligibility improvement has been performed for a limited domain task
29 1.2. Contributions and Structure of the Thesis 5 in [Langner & Black 2005] based on voice conversion techniques. For this, they recorded the CMU_SIN database [Langner & Black 2004] containing two parallel corpora obtained respectively under clean and noisy conditions. Another example is the Loudmouth synthesizer [Patel et al. 2006], which emulates human modifications (both acoustic and linguistic) to speech in noise by manipulating word duration, fundamental frequency and intensity. In [Bonardo & Zovato 2007], it is proposed to tune dynamic range controllers (e.g. compressors and limiters) and some user controls (e.g. speaking rate and loudness) to improve the intelligibility of synthesized speech. Various methods allowing automatic modification of speech in order to achieve the same goal are investigated in [Anumanchipalli et al. 2010] (e.g. boosting the signal amplitude in important frequency bands, modification of prosodic and spectral properties, etc). Another work [Cer nak 2006] introduced an additional measure evaluating intelligibility for the unit cost, so as to bias the synthesis by choosing more intelligible units from the speech database. A new method for extracting or modifying mel cepstral coefficients based on an intelligibility measure for speech in noise, the Glimpse proportion measure, has been proposed in [Valentini-Botinhao et al. 2012a] [Valentini-Botinhao et al. 2012b]. Lombard speech synthesis in HMM-based speech synthesis [Zen et al. 2009] has also been performed in [Raitio et al. 2011a]. Nonetheless, the Lombard effect is a reflex produced unconsciously due to the noisy surrounding environment [Junqua 1993] [Pick et al. 1989], while HPR speech is defined as the voice produced with increased articulatory efforts compared to the NEU style. From a general point of view, these latter efforts might therefore also result from a voluntary decision to enhance speech intelligibility to facilitate the listener s comprehension (like in the case of teaching). A similar case happens when people hyperarticulate in front of interactive systems, hoping to correct their recognition errors [Oviatt et al. 1998]. 1.2 Contributions and Structure of the Thesis The present thesis provides a detailed and complete study on the analysis and the integration of a variable DoA in HMM-based speech synthesis: NEU speech, HPO (or casual) and HPR (or clear) speech. HPO and HPR speech are of interest in many daily life applications: expressive voice conversion (e.g. for embedded systems and video games); reading speed control for visually impaired people (i.e. fast speech synthesizers, more easily produced using HPO speech, as synthetic speech at very high speaking rates is frequently used by blind users to increase the amount of presented information [Pucher et al. 2010a] [Moos & Trouvain 2007] [Stent et al. 2011]); improving intelligibility performance in adverse environments (e.g. perceiving GPS voice inside a moving car, understanding train or flight information in stations or halls); adapting the difficulty level when learning foreign languages with the student s progresses (i.e. from HPR to HPO speech); etc. Note also that the ultimate goal of our research is to be able to continuously control the DoA of an existing standard NEU voice for which no HPO and HPR recordings are available. The present thesis is divided into chapters and structured as follows. Personal contributions are indicated in italic. Audio examples for each DoA are available online at picart.
30 6 Chapter 1. General Introduction Chapter 2 explains the theoretical background related to the Markov models theory and to the HMM-based speech synthesis system. Chapter 3 describes the creation, the recording protocol and the specifications of a specific database used throughout all next chapters. This database is unique, in the sense that: i) it contains three parallel sets, each one containing 1359 sentences pronounced with a different DoA (i.e. NEU, HPO and HPR speech), allowing a thorough analysis of the effects caused and induced by the DoA; ii) it is made of high-quality recordings (i.e. recorded in a sound-proof room, which is noise or perturbation-free), in order to generate high-quality HMM-based speech synthesis with a varying DoA. Chapter 4 details the analysis of the acoustic and phonetic characteristics of HPO and HPR speech, compared to the NEU case. Acoustic and phonetic analyses are performed on the previously recorded database. It is shown that a variable DoA is reflected by considerable changes of both vocal tract and glottal characteristics, and of speech rate, phone durations, phone variations and the presence of glottal stops. Chapter 5 focuses on the synthesis of NEU, HPO and HPR speech in the framework of HMM-based speech synthesis. This first synthesis experiment is conducted by training a specific synthesizer for each DoA, using the entire training set of the corresponding database. Both objective and subjective evaluations aiming to assess the generated speech quality are performed, and it is shown that synthesized HPO speech seems to be less naturally rendered than NEU speech, and that the latter style seems to be less naturally rendered than HPR speech. Chapter 6 investigates the implementation of a continuous control of the DoA in the framework of HMM-based speech synthesis. By means of inter-speaker voice adaptation techniques, applied here to intra-speaker voice adaptation, we study in a first step the adaptation of a NEU speech synthesizer to directly generate HPO and HPR speech using a limited amount of HPO and HPR speech data. We show that around 7 (for HPO) and 13 (for HPR) minutes of speech are needed to adapt cepstra with a good quality, while only half of it is sufficient to adapt F0 and phone duration correctly. The implementation of a continuous control of the DoA is then proposed in a second step. We prove that good quality NEU, HPO and HPR speech, and also any intermediate, interpolated or extrapolated DoA, can be obtained from a HMM-based speech synthesizer. Chapter 7 focuses on the understanding of the internal mechanisms leading to highquality HMM-based speech synthesis with various DoAs, as well as how intelligibility and other voice dimensions are affected when the synthesizer is embedded in adverse environments. In a first step, the process of adapting a NEU speech synthesizer to directly generate HPO and HPR speech is broken down into four factors: cepstrum, prosody, phonetic transcription and the complete adaptation. The impact of these factors on the perceived DoA is studied, and the importance of prosody and cepstrum adaptation as well as the use of a Natural Language Processor able to generate realistic HPO and HPR phonetic transcriptions is quantified. Moreover, HPO and HPR speech is assessed through various dimensions: comprehension, non-monotony, fluidity and pronunciation. In a second step, we focus on the assessment of both the intelligibility and the quality of speech when the HMM-based speech synthesizers integrating a variable DoA is working in adverse condi-
31 1.2. Contributions and Structure of the Thesis 7 tions. Simulated noisy and reverberant conditions are applied to the speech produced by the latter synthesizers, and we quantify how the possibility of varying the DoA improves the intelligibility of synthetic speech in various adverse conditions. Again, HPO and HPR speech is assessed through a subjective multi-dimensional evaluation. Chapter 8 implements the ultimate goal of our research, i.e. the automatic modification of the DoA of an existing standard NEU voice for which no HPO or HPR recordings are available, in the framework of HMM-based speech synthesis. The idea consists in finding new methods to transpose, to a target voice, the DoA model estimated on a source voice. Starting from a source speaker for which NEU, HPO and HPR speech data is available, statistical transformations are computed during the adaptation of the NEU speech synthesizer. These transformations are then applied to a new target speaker for which no HPO or HPR recordings are available. Four statistical methods are investigated. They differ in the speaking style adaptation technique (model-space Linear Scaling LS vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. The methods are model-independent, in the sense that they can be applied to the prosody (pitch and phone duration) and filter models independently. Moreover, we investigate various parametric spaces for representing the spectral envelope in order to find out the most appropriate space for our purpose.
32
33 Chapter 2 Background Contents 2.1 Introduction Markov Model Theory Discrete-Time Markov Process Hidden Markov Model Overview of HMM-based Speech Synthesis Training Step in HMM-based Speech Synthesis Spectral Parameters F 0 Modeling State Duration Clustering Synthesis Step in HMM-based Speech Synthesis Maximizing P (q W, λ) Maximizing P (O q, λ) Voice Adaptation Techniques Maximum Likelihood Linear Regression (MLLR) Maximum A Posteriori (MAP) Adaptation Introduction This chapter is devoted to explaining the theoretical background required to understand the techniques used throughout this work. Most of those methods are based on the widely used and well-known Hidden Markov Model (HMM), which is a statistical way of modeling systems in various domains. Speech signals, for instance, can be well characterized as a parametric random process, and the parameters of the stochastic process can be determined in a precise and well-defined way [Rabiner & Juang 1993]. The mathematical basics of the HMM-based modeling process are described in Section 2.2. Those models are then used in a concrete application: HMM-based speech synthesis. An overview of the HMM-based Speech Synthesis System ( H-Triple-S - HTS) [Zen et al. 2009] is detailed in Section 2.3. This system is made of two main parts: training step and synthesis step. We first describe the training procedure of those HMM models
34 10 Chapter 2. Background in Section 2.4. After that, speech synthesis is performed in Section 2.5. Finally, relying on the inherent flexibility of HMM-based speech synthesis due to the statistical modeling process, voice adaptation techniques are detailed in Section 2.6, in order to modify a source speaker s voice to sound as if it was pronounced by a target speaker. 2.2 Markov Model Theory Hidden Markov Models (HMMs) are statistical models used to characterize observed time series. They were and are still widely used in Automatic Speech Recognition (ASR). More recently, they have proven to be also useful for speech synthesis. Despite their intrinsic simplicity, HMMs are able to model complex systems. In order to understand their mechanisms, we first describe the discrete-time Markov processes in Section HMMs are then detailed in Section Discrete-Time Markov Process A discrete-time Markov process is a stochastic finite state machine which, at any time, can be in one amongst N distinct states. Transitions between states occur on a discrete time basis, according to a set of state transition probabilities P (q t = j q t 1 = i, q t 2 = k,...) (2.1) denoting the probability of being in state j at time t, given state i at time t 1, state k at time t 2, etc. In the case of a first-order Markov process, the transition probability associated with state j at time t depends only on the state i at time t 1, i.e., P (q t = j q t 1 = i, q t 2 = k,...) = P (q t = j q t 1 = i) (2.2) and in the case of time-independent transition probabilities, the first-order Markov process can be described by the following parameters: a ij = P (q t = j q t 1 = i) 1 i, j N (2.3) where a ij represents the probability of state changing from i to j, under the following constraints: a ij 0 i, j (2.4a) N a ij = 1 i (2.4b) j=1 Such a process can be considered as an observable Markov model, because each state corresponds to an observable physical state of the system.
35 2.2. Markov Model Theory Hidden Markov Model A Hidden Markov Model (HMM) is a finite state machine which generates a sequence of observations. But, in this case, the states cannot be directly observed (they are hidden) [Boite et al. 1999] [Rabiner & Juang 1993]. A HMM is a doubly embedded stochastic process in which the state changes at each time unit according to the state transition probabilities of a Markov process, and then generates the observational data through the output probability function associated with the current state. b (o) i o a ii i i a ii a jj a kk a ll a ki a ji a ij a jk a kl a ik a ij i i j k l k k a jk j j a ik a jl a kj a kk a jj b (o) k b (o) j b (o) i b (o) j b (o) k b (o) l o o o o o o (a) (b) Figure 2.1: Schematic representation of (a) a 3-state ergodic HMM and (b) a 4-state leftto-right HMM, together with emission B = {b j (o)} and transition A = {a ij } probabilities associated with each state. Figure 2.1 provides some examples of typical HMM topologies. Figure 2.1a represents a 3-state completely interconnected model in which each state can be reached from every other state in a single transition. A model in which each state can be reached from any other state in a finite number of transitions is called ergodic. Figure 2.1b represents a 4-state left-to-right model, also called Bakis model, in which successive state indices are greater or equal to the preceding ones. Left-to-right models with no skip are widely used to model speech units. In the following, o is a d-dimensional observation vector (o t representing a particular observation at time t) and q t = i denotes the fact of being in state i at time t.
36 12 Chapter 2. Background A N-state HMM is defined by its model parameters including: λ = {A, B, π} (2.5) the initial state probabilities π = {π i } N i=1 : π i = P (q 1 = i) 1 i N (2.6) the matrix of state transition probabilities A = {a ij } N i,j=1 where a ij = P (q t+1 = j q t = i) 1 i, j N (2.7) is the probability of changing from state i to state j, under the common hypotheses of considering an underlying first order Markov process (i.e. the transition probability depends only on the current state, not on the previous ones) and time-independent transition probabilities. In the case of a fully connected HMM, i.e. that each state can reach all the other ones in one step, we have a ij > 0. In other cases, we could have a ij = 0 for one or more (i, j) state pairs. a matrix of emission probabilities B = {b j (o t )} N j=1 where b j (o t ) = P (o t q t = j) 1 j N (2.8) is the probability of generating the observation o t given state j at time t: in discrete distribution HMM, o t V = {v 1, v 2,..., v K } (K being the number of distinct observation symbols per state) and b j (o t ) = b j (k) = P (o t = v k q t = j) 1 k K (2.9) defines the probability of observing the output o t = v k while being in state j, j = 1, 2,..., N. in a Continuous Distribution HMM (CD-HMM), o t R d and the emission probability distribution is generally modeled by a multivariate Gaussian mixture distributions as follows: M b j (o t ) = c jm N (o t ; µ jm, Σ jm ) 1 j N (2.10) where: m=1 M is the number of Gaussian components in the mixture; c jm is the weight of mixture component m in state j, respecting the following constraints: c jm 0 1 j N, 1 m M (2.11a) M c jm = 1 1 j N (2.11b) m=1
37 2.2. Markov Model Theory 13 N (o t ; µ jm, Σ jm ) corresponds to the m th Gaussian mixture component in state j (with mean vector µ jm and covariance matrix Σ jm ). Note that the Gaussian assumption is made without any loss of generality [Rabiner & Juang 1993]. In the general case, the above-mentioned multivariate Gaussian PDF is expressed as followed: ( 1 N (o; µ jm, Σ jm ) = (2π) d/2 exp 1 ) Σ jm 1/2 2 (o µ jm) Σ 1 jm (o µ jm) (2.12) In the case of a diagonal covariance matrix (i.e. when the coefficients of the feature vector are not correlated between each other), the latter equation becomes: N (o; µ jm, Σ jm ) = ( d 1 exp 1 ( ) ) oi µ 2 jmi 2πΣ 2 2 Σ jmi jmi i=1 (2.13) where µ jmi represents the i th component of µ jm and Σ 2 jmi are the diagonal elements of the covariance matrix Σ jm. Figure 2.2: Output distributions: (a) Gaussian PDF, (b) Gaussian mixture PDF, (c) Multistream PDF. Adapted from [Yamagishi 2006]. When the observation vector o t is divided into S stochastic-independent data stream, i.e., o = [ o 1, o 2,..., o S ] as illustrated in Figure 2.2, bj (o) can be formulated by a product of Gaussian mixture densities [Yamagishi 2006]: b j (o) = = S b js (o s ) s=1 S M s s=1 m=1 (2.14a) c jsm N (o s ; µ jsm, Σ jsm ) 1 j N (2.14b) where M s is the number of components in stream s, and c jsm, µ jsm, and Σ jsm are the weight, mean vector and covariance matrix of the m th mixture component of state j in stream s respectively.
38 14 Chapter 2. Background Modeling real-world processes by HMMs requires to solve the three following basic problems [Ferguson 1980b], whose formal efficient mathematical solutions are detailed for instance in [Boite et al. 1999] [Rabiner & Juang 1993]: problem #1 (P (O λ) Evaluation): given a HMM model λ = {A, B, π}, how to compute efficiently the probability P (O λ) of the observation sequence O = (o 1,..., o t,..., o T )? problem #2 (Optimal State Sequence): given a HMM model λ = {A, B, π}, how to determine the state sequence q = (q 1,..., q t,..., q T ) that best explains the observation sequence O = (o 1,..., o t,..., o T )? problem #3 (Parameter Estimation): given the observation sequence O = (o 1,..., o t,..., o T ), how to adjust the model parameters λ = {A, B, π} in order to maximize P (O λ)? Solution to Problem #1: P (O λ) Evaluation The probability P (O λ) of the observation sequence O = (o 1,..., o t,..., o T ) given the model λ can be efficiently computed by the Forward-Backward procedure. This procedure is based on the forward and backward probabilities defined as: α t (i) = P (o 1, o 2,..., o t, q t = i λ), the probability of the partial observation sequence o 1, o 2,..., o t until time t and state i at time t, given the model λ; β t (i) = P (o t+1, o t+2,..., o T q t = i, λ), the probability of the partial observation sequence from t + 1 to the end, given state i at time t and the model λ. α t (i) and β t (i) can be calculated recursively as follows: 1. Initialization α 1 (i) = π i b i (o 1 ) 1 i N (2.15a) β T (i) = 1 1 i N (2.15b) 2. Recursion N α t+1 (i) = α t (j)a ji b i (o t+1 ) 1 i N 1 t T 1 (2.16a) β t (i) = j=1 N a ij b j (o t+1 )β t+1 (j) 1 i N T 1 t 1 (2.16b) j=1 Finally, the probability P (O λ) is given by P (O λ) = N α t (i)β t (i) t [1, T ] (2.17) i=1
39 2.2. Markov Model Theory Solution to Problem #2: Optimal State Sequence The difficulty here lies with the definition of the optimal state sequence, that is, there are several possible optimality criteria. However, the most widely used criterion is to find the single best state sequence (path), i.e., to maximize P (q O, λ), which is equivalent to maximizing P (q, O λ). This can be achieved based on dynamic programming techniques, using the Viterbi algorithm [Viterbi 1967] [Forney 1973]. Let δ t (i) be the highest probability along a single path which accounts for the first t observations and ends in state i: δ t (i) = The Viterbi algorithm can be written as follows: 1. Initialization 2. Recursion max P (q 1, q 2,..., q t 1, q t = i, O t q 1 q 2...q t 1 1 λ) (2.18) δ 1 (i) = π i b i (o 1 ) 1 i N (2.19a) ψ 1 (i) = 0 1 i N (2.19b) δ t (j) = max t 1(i)a ij ] b j (o t ) 1 i N 1 j N 2 t T (2.20a) ψ t (j) = arg max t 1(i)a ij ] 1 i N 1 j N 2 t T (2.20b) In order to later retrieve the followed state sequence, it is necessary keep track of the argument that maximized Equation 2.20a for each t and each j. This is the role of the array ψ t (j). 3. Terminaison p = P (O, q λ) = max [δ T (i)] 1 i N qt = arg max [δ T (i)] 1 i N (2.21a) (2.21b) 4. Path (state sequence) backtracking q t = ψ t+1 (q t+1) t = T 1, T 2,..., 1 (2.22) Solution to Problem #3: Parameter Estimation There is no known analytical solution for finding the model parameter set λ = {A, B, π} which globally maximize the probability P (O λ) of a given observation sequence O in a closed form: λ = arg max λ P (O λ) = arg max P (O, q λ) (2.23) λ q
40 16 Chapter 2. Background However, a parameter set λ which locally maximizes the likelihood P (O λ) can be obtained using an iterative procedure such as the Baum-Welch algorithm (also called forwardbackward algorithm) [Dempster et al. 1977] or the Viterbi algorithm, depending on wheter the likelihood is estimated by considering all possible paths in the model or only the best one respectively. These algorithms are variants of the Expectation-Maximization (EM) procedure, a general technique for finding maximum likelihood estimators in models including hidden (also called latent or missing) variables, such as the states of an HMM model. In the following, the Baum-Welch algorithm for the CD-HMM with Gaussian mixture distributions is briefly described. Corresponding formulae for single Gaussian or discrete output distributions can be derived straightforwardly. In the Expectation step, the current model parameter set λ is used to compute the posterior probabilities of the HMM hidden variables as follows: the transition posterior probability ξ t (i, j) is the probability of being in state i at time t and in state j at time t + 1 given the model λ and the observation sequence O, i.e., ξ t (i, j) = P (q t = i, q t+1 = j O, λ ) (2.24) We have: ξ t (i, j) = P (q t = i, q t+1 = j, O λ ) P (O λ ) = P (q t = i, q t+1 = j, O λ ) N P (q t = i, q t+1 = j, O λ ) i,j=1 (2.25) and from the definition of the forward and backward probabilities α t (i) and β t (i), it follows that ξ t (i, j) = α t(i)a ij b j (o t+1 )β t+1 (j) P (O λ ) = α t(i)a ij b j (o t+1 )β t+1 (j) N α t (i)a ij b j (o t+1 )β t+1 (j) i,j=1 (2.26) the state posterior probability γ t (i) is the probability of being in state i at time t given the model λ and the observation sequence O, i.e., We have: γ t (i) = P (O, q t = i λ ) P (O λ ) γ t (i) = P (q t = i O, λ ) (2.27) = P (O, q t = i λ ) N (P (O, q t = j λ ) j=1 (2.28) and from the definition of the forward and backward probabilities α t (i) and β t (i), it follows that γ t (i) = α t(i)β t (i) P (O λ ) = α t(i)β t (i) (2.29) N α t (j)β t (j) j=1
41 2.2. Markov Model Theory 17 The probability γ t (i, m) of being in state i at time t given the model λ and the observation sequence O, taking into account only the m th component of the considered state Gaussian mixture output distribution, is given by γ t (i, m) = α t(i)β t (i) N α t (j)β t (j) j=1 c im N (o t ; µ im, Σ im ) M c in N (o t ; µ in, Σ in ) n=1 (2.30) In the Maximization step, the current model parameter set λ is replaced by a new parameter set λ which maximize the auxiliary function Q(λ, λ ) = q P (q O, λ ) ln P (O, q λ) (2.31) taking into account the posterior probabilities of the HMM hidden variables computed in the expectation step. Applied iteratively, this procedure can be proved to increase the model likelihood P (O λ) monotonically and converge to a critical point. The maximization of the auxiliary function Q(λ, λ ) over λ, subject to the constraints N j=1 π j = 1, N j=1 a ij = 1, M m=1 c im = 1 (1 i N), leads to the following reestimation formulae: π i = γ 1 (i) (2.32) T 1 ξ t (i, j) t=1 ā ij = (2.33) T 1 γ t (i) c im = µ im = Σ im = t=1 T γ t (i, m) t=1 T t=1 n=1 M γ t (i, n) T γ t (i, m) o t t=1 T γ t (i, m) t=1 T γ t (i, m) (o t µ im ) (o t µ im ) t=1 T γ t (i, m) t=1 (2.34) (2.35) (2.36)
42 18 Chapter 2. Background 2.3 Overview of HMM-based Speech Synthesis In such a statistical parametric speech synthesis system, speech parameters are extracted from a database of natural speech signals. Combined with their associated labels, generative models are trained to model those parameters. The synthesized speech waveform is eventually built from the learned parametric representations of speech by HMMs. Statistical parametric speech synthesis is called HMM-based speech synthesis when HMMs are used as generative models, albeit any other generative models could be implemented. An overview of the HMM-based speech synthesis system is displayed in Figure 2.3. This system is made of two main parts: the training and the synthesis steps. Figure 2.3: Overview of the HMM-based Speech Synthesis System ( H-Triple-S - HTS), from [Zen et al. 2009]. 2.4 Training Step in HMM-based Speech Synthesis During the training step, spectral and excitation parameters are extracted from a database of natural speech signals. Spectral parameters typically consist of the mel-cepstral coefficients together with their first and second derivatives (respectively and 2, detailed in Section 2.5.2). Excitation parameters are generally the logarithm of the fundamental frequency log(f 0), also with its and 2 coefficients. Associated with their respective labels, HMMs are trained in order to model these speech parameters. As a result, not only
43 2.4. Training Step in HMM-based Speech Synthesis 19 spectrum parameters but also F 0 and duration are modeled in a unified framework. The model parameters set is commonly estimated based on the following Maximum Likelihood (ML) criterion, using the EM algorithm described in Section : λ = arg max{p (O W, λ)} (2.37) λ where λ is the model parameter set, O is the training data set and W is the label sequence set associated with O Spectral Parameters Feature extraction basically relies on the source-filter model, in which speech is described as a source signal, representing the air flow at the vocal folds, passed through a time-varying filter, representing the effect of the vocal tract [Dutoit & Dupont 2010]. It is based on the hypothesis that the glottis and the vocal tract are fully discoupled, leading to the separation of the filter and source parts of the model. The general approach consists in extracting some smooth representation of the signal power spectral density (characteristic of the filter frequency response), usually estimated over analysis frames of typically 25 ms with 5 ms shifts. This takes into account the time-varying nature of both the source and the filter. The main tools used in spectral parameters extraction include: short-time Fourier transform, providing the power and phase spectra of short analysis frames; Linear Predictive Coding (LPC) in which the vocal tract is modeled by an all-pole filter, whose transfer function is described as: K H(z) = (2.38) 1 M c(m)z m m=0 where K and c(m) are respectively the gain of the filter and the M th order LP coefficients; cepstrum, computed as the inverse short-time Fourier transform of the logarithm of the power spectrum. In this case: H(z) = exp M c(m)z m (2.39) m=0 where c(m) are the M th order cepstral coefficients. It can be shown that low order elements of the cepstrum vectors provide a good approximation of the filter part of the model; Mel-Frequency Cepstrum Coefficients (MFCCs), which take into account the human auditory system, where the cepstral coefficients are computed for a spectrum that has been warped along a nonlinear spectral scale.
44 20 Chapter 2. Background In direct continuity, the Mel-Generalized Cepstral (MGC) analysis was proposed in [Kobayashi & Imai 1984]: H(z) = ( = exp 1 + γ M m=0 M m=0 c α,γ (m)z m α ) 1/γ if 0 < γ 1 (2.40a) c α,γ (m)z m α if γ = 0 (2.40b) where c α,γ (m) are the M th order MGC coefficients. The variable zα 1 the following first order all-pass function: is expressed as z 1 α = Ψ(z) = z 1 α 1 αz 1 (2.41) modeling the nonlinear frequency transformation performed by the human auditory system. Combined with the frequency warping factor α, γ is a parameter that allows to obtain various standard types of coefficients, as illustrated in Table 2.1. By choosing judicious α values, the mel-scale becomes a good approximation of the human perceptual scale: e.g. α = 0.31 for a sampling rate of 8 khz, and is made equal to 0.42 for a sampling frequency of 16 khz. Table 2.1: Mel-Generalized Cepstral (MGC) analysis. γ = 0 0 < γ < 1 γ = 1 α = 0 Cepstral analysis Generalized cepstral analysis LPC analysis α < 1 Mel-cepstral analysis Mel-generalized cepstral analysis Mel-LPC analysis F 0 Modeling To model fixed-dimensional parameter sequences, such as spectral parameters, single multivariate Gaussian distributions are typically used as their stream-output distributions. However, it is difficult to apply a discrete or continuous distribution to model variabledimensional parameter sequences, such as log(f 0). Indeed, the values of F 0 are not defined in unvoiced regions, i.e. the observation sequence of an F 0 pattern is composed of one-dimensional continuous values in voiced regions and a discrete symbol which represents unvoiced in unvoiced regions. Considering that the observed F 0 value occurs from one-dimensional space Ω 1 and the unvoiced symbol occurs from a zero-dimensional space Ω 2, this kind of observation sequence can be modeled by Multi-Space probability Distribution (MSD), as displayed in Figure 2.4. The integration of MSD in the HMM framework is called MSD-HMM [Tokuda et al. 1999] [Tokuda et al. 2002a].
45 2.4. Training Step in HMM-based Speech Synthesis 21 Figure 2.4: F 0 pattern modeling, from [Masuko 2002] Multi-Space probability Distribution (MSD) In general, a MSD is described considering a sample space Ω which consists of G spaces, as illustrated in Figure 2.5: Ω = G Ω g (2.42) g=1 where Ω g is an n g -dimensional real space R ng. Each space has its own dimensionality. Figure 2.5: Multi-Space probability Distribution (MSD) and observations, from [Masuko 2002].
46 22 Chapter 2. Background Each space Ω g has its probability w g, i.e. P (Ω g ) = w g, where G g=1 w g = 1. If n g > 0, each space has a probability distribution function N g (x) with x R ng. If n g = 0, Ω g is assumed to contain only one sample point. In a MSD, each observation vector o consists of a set of space indices X and a continuous random variable x R n, that is: o = (X, x) (2.43) where all spaces specified by X are n-dimensional. Note that X does not necessarily include all indices which specify n-dimensional spaces. Not only the observation vector x but also the space index set X is a random variable, determined at each observation. The observation probability of o is defined by b(o) = w g N g (V(o)) (2.44) where g S(o) S(o) = X V(o) = x (2.45a) (2.45b) Although N g (x) does not exist for n g = 0 since Ω g contains only one sample point, for simplicity of notation, N g (x) 1 is defined for n g = 0. As an example, the observation o 1 shown in Figure 2.5 consists of a set of space indices X 1 = {1, 2, G} and a three-dimensional vector x 1 R 3. Thus the random variable x is drawn from one of the three spaces Ω 1, Ω 2, Ω G R 3, and its PDF is given by w 1 N 1 (x) + w 2 N 2 (x) + w G N G (x) HMMs-based on Multi-Space probability Distribution (MSD-HMM) An N-state MSD-HMM λ is specified by the initial state probability distribution π = {π i } N i=1, the state transition probability distribution A = {a ij} N i,j=1 and the state output probability distribution B = {b j (o)} N j=1 given in Equation As shown in Figure 2.6, each state i has G PDFs N i1 ( ), N i2 ( ),..., N ig ( ) associated with their weights w i1, w i2,..., w ig, respecting G g=1 w ig = 1. The MSD-HMM parameters estimation procedure is derived from the Baum-Welch algorithm used for conventional HMM. In the Expectation step, the posterior probabilities of the MSD-HMM hidden variables are computed from the current model parameters as follows: the posterior probability of being in state i and space h at time t, given the observation sequence and the model, γ t (i, h) = α t(i)β t (i) N α t (j)β t (j) j=1 w ih N ih (V(o t )) w ig N ig (V(o t )) g S(o t) (2.46)
47 2.4. Training Step in HMM-based Speech Synthesis 23 the posterior probability of transitions from state i to state j at time t + 1, given the observation sequence and the model, ξ t (i, j) = N m=1 k=1 α t (i)a ij b j (o t+1 )β t+1 (j) N α t (m)a mk b k (o t+1 )β t+1 (k) (2.47) where the forward and backward probabilities α t (i) and β t (i) are calculated using the same inductive procedure as for conventional HMMs (see Section ). Figure 2.6: A HMM based on Multi-Space probability Distribution (MSD), from [Masuko 2002]. In the Maximization step, the parameters of the MSD-HMM are updated by means of the generalized reestimation formulae: π i = ā ij = w ig = g S(o 1 ) T 1 T 1 t=1 γ 1 (i, g) (2.48) ξ t (i, j) t=1 g S(o t) G t T (O,g) h=1 t T (O,h) γ t (i, g) γ t (i, g) γ t (i, h) (2.49) (2.50)
48 24 Chapter 2. Background and when n g > 0: µ ig = Σ ig = γ t (i, g) V(o t ) t T (O,g) t T (O,g) t T (O,g) γ t (i, g) γ t (i, g) (V(o t ) µ ig ) (V(o t ) µ ig ) γ t (i, g) t T (O,g) (2.51) (2.52) where the function T (O, g) = {t g S(o t )} returns a set of time t at which the space index set S(o t ) includes space index g. It is interesting to note that when n g = 0, the MSD-HMM is the same as the discrete HMM. If n g = m > 0 and S(o) = {1, 2,..., G}, the MSD-HMM is the same as the m- dimensional continuous G-mixture HMM. Accordingly, MSD-HMM includes discrete HMM and continuous mixture HMM as special cases, and furthermore, can model the sequence of observation vectors with variable dimensionality including zero-dimensional observations, i.e. discrete symbols. As a result, MSD-HMMs can model F 0 patterns without heuristic assumptions F 0 Pattern Modeling using MSD-HMM Figure 2.7: Multi-Space probability Distribution Hidden Markov Model (MSD-HMM) for F 0 modeling, from [Tokuda & Zen 2009]. In the MSD-HMM, the observation sequence of F 0 pattern is viewed as a mixed sequence of outputs from a one-dimensional space Ω 1 and a zero-dimensional space Ω 2 which correspond to voiced and unvoiced regions, respectively [Yamagishi 2006]. This is illustrated in Figure 2.7. Each space has the space weight w g ( 2 g=1 w g = 1). The space Ω 1 has a one-dimensional normal PDF N 1 (x). On the other hand, the space Ω 2 has only one sample point. An F 0 observation o consists of a continuous random variable x and a set of space indices X, that is
49 2.4. Training Step in HMM-based Speech Synthesis 25 o = (X, x) (2.53) where X = {1} for voiced region and X = {2} for unvoiced region. Then the observation probability of o is defined by Equations 2.44, 2.45a and 2.45b. It is noted that, although N 2 (x) does not exist for Ω 2, N 2 (x) 1 is defined for simplicity of notations State Duration In the HMM-based speech synthesis system, rhythm and tempo of synthesized speech is controlled by decision tree-clustered state duration model (see Section 2.4.4). The state durations of the generative sentence HMM λ are determined so as to maximize their probabilities: N log P (d λ) = log p j (d j ) (2.54) where d = {d 1, d 2,..., d N } is a set of state durations, d j is the state duration at the j th state, N is the number of states in the sentence HMM λ and p j (.) denotes the state duration PDF of the j th state. One of the major limitations of the HMM is that it does not adequately represent the temporal structure of speech. This is because state duration probability density functions of HMMs are implicitly modeled by their state self-transition probabilities. Indeed, the probability of d consecutive observations in the j th state is given by the product of the probability of taking the self-loop at the j th state for d 1 times and the probability to leave this j th state: j=1 p j (d) = (a jj ) d 1 (1 a jj ) (2.55) This equation shows that the state duration probabilities decrease exponentially with time in this case, which does not lead to a realistic model for speech synthesis (see the decreasing exponential blue curve in Figure 2.8). Moreover, the state durations d that maximize their probabilities are accordingly determined as d = arg max {log P (d λ)} (2.56a) d N = arg max {(d j 1) log a jj + log(1 a jj )} (2.56b) d 1,d 2,...,d N j=1 = {1,..., 1}. (2.56c) and all expected state durations become 1, which is not useful for controlling the temporal structure of speech.
50 26 Chapter 2. Background Figure 2.8: HMM duration PDFs modeled either by their state self-transition probabilities (decreasing exponential blue curve) or by a Gaussian distribution (Gaussian red curve), from [Tokuda & Zen 2009]. To avoid this problem, the state durations can be explicitly modeled by Gaussian distributions 1 (see the Gaussian red curve in Figure 2.8). In [Yoshimura et al. 1998], the set of state durations of each phoneme HMM is modeled by a multi-dimensional Gaussian distribution. The dimension of the state duration density is equal to the number of states in the phoneme HMM, its n th dimension corresponding to the duration density of the n th state (left-to-right model with no skip is assumed). However, in the frame of conventional HMMs, these state duration PDFs are not incorporated into the expectation step of the forward-backward (Baum-Welch) algorithm. Their parameters are estimated from the statistical variables obtained in the last iteration of the algorithm. Moreover, the spectrum and pitch models parameters are estimated without considering the state duration PDFs. This inconsistency can make the synthesized speech sound less natural. To resolve this inconsistency, a Hidden Semi-Markov Model (HSMM) is introduced in [Zen et al. 2007]. This model can be viewed as an HMM with explicit state duration PDFs, explicitly incorporated into both the synthesis and the training parts of the system. The HSMM training is performed according to the generalized forward-backward algorithm (expectation step) and parameter reestimation formulae (maximization step) briefly summarized hereafter [Yamagishi & Kobayashi 2007]. Let us consider a N-state left-to-right HSMM λ with no skip path, specified by the state output probability distribution {b i (o)} N i=1 and the state duration probability distribution {p i (d)} N i=1. We assume that the ith output and duration distributions are single Gaussian distributions characterized by a mean vector µ i R L and diagonal covariance matrix Σ i R L L, and a scalar mean m i and variance σi 2, respectively, i.e., b i (o) = N (o; µ i, Σ i ) p i (d) = N (d; m i, σ 2 i ) (2.57a) (2.57b) 1 Although other distributions, like the Gamma or log Gaussian distributions, have been applied to state duration modeling in HMM-based speech synthesis, the Gaussian distribution is widely used because it is mathematically easy to use.
51 2.4. Training Step in HMM-based Speech Synthesis 27 where o R L is an observation vector and d is the duration in state i. In the Expectation step, the forward and backward probabilities α t (i) and β t (i) are recursively calculated by t N t α t (i) = α t d (j) p i (d) b i (o s ) 2 t T (2.58a) β t (i) = d=1 T t d=1 j=1 j i N p j (d) j=1 j i t+d s=t+1 where α 0 (i) = 1, β T (i) = 1, and 1 i N. s=t d+1 b j (o s )β t+d (j) T 1 t 1 (2.58b) the probability P (O λ) of the observation data O of length T, given the model λ is determined by N N t t P (O λ) = α t d (j) p i (d) b i (o s )β t (i) t [1, T ] (2.59) i=1 j=1 j i d=1 s=t d+1 the probability γt d (i) that the observation sequence o t d+1,..., o t be generated at the i th state is given by γt d 1 N t (i) = α t d (j) p i (d) b i (o s )β t (i) (2.60) P (O λ) j=1 j i s=t d+1 In the Maximization step, the parameter of the HSMM are updated by means of the following reestimation formulae: T t t γt d (i) µ i = Σ i = m i = σ 2 i = t=1 d=1 T T t=1 d=1 T t=1 d=1 T t=1 d=1 t γt d (i) o s s=t d+1 (2.61) t γt d(i) d t γt d (i) d t=1 d=1 T t=1 d=1 t s=t d+1 T t=1 d=1 (o s µ i )(o s µ i ) (2.62) t γt d(i) d (2.63) t γt d(i) t γt d (i) (d m i ) 2 T t=1 d=1 (2.64) t γt d(i)
52 28 Chapter 2. Background Clustering Many contextual factors affect spectrum, excitation and duration. Context-dependent HMMs are therefore used to model these effects. As an example, the phonetic, linguistic and prosodic contexts used in the HMM-based speech synthesis system are listed in [Zen et al. 2009] [Tokuda et al. 2002b]. It should however be noted that the total number of possible combination of these factors exponentially increases with the latter. In order to estimate the model parameters reliably with a reasonable amount of speech data, a decision-tree based context clustering method [Odell 1995] is implemented individually for spectrum, excitation and duration, as each of them has its own context dependency. This is illustrated in Figure 2.9. This technique also allows generalizing to unseen contexts, as all possible context combinations may not be present in the speech database. A consequence of context clustering is that spectrum, excitation and duration are modeled in the unified framework of HMMs [Yoshimura et al. 1999] [Yoshimura 2002]. Although decisiontree based context clustering technique was originally derived for HMM, it can be applied to HSMM with no modification. It has been extended for MSD-HMMs in [Yoshimura 2002]. Figure 2.9: Decision tree context clustering, from [Tokuda et al. 2002b]. 2.5 Synthesis Step in HMM-based Speech Synthesis During the synthesis step, the text to be synthesized is first converted into a contextdependent label sequence (Figure 2.3) by a Natural Language Processor (NLP). Then, according to this label sequence, a sentence HMM is built by concatenating the corresponding context-dependent HMMs. In the next step, state durations of the sentence HMM
53 2.5. Synthesis Step in HMM-based Speech Synthesis 29 are computed in order to maximize the state duration probability of the state sequence [Yoshimura et al. 1998]. Taking into account the obtained state durations, a sequence of mel-cepstral coefficients and log(f 0) values (including voiced/unvoiced decisions) is then generated, so as to maximize its probability for the given sentence HMM, using Case 1 in the speech parameter generation algorithm proposed in [Tokuda et al. 2000]. The speech parameter sequence O for a given label sequence W, corresponding to the text to be synthesized, is generated from the estimated models λ so as to maximize its output probability: Ô = arg max O {P (O W, λ)} (2.65a) = arg max { O q P (O, q W, λ)} (2.65b) arg max max{p (O, q W, λ)} (2.65c) O q = arg max max{p (q W, λ)p (O q, λ)} (2.65d) O q where T is the length of the speech parameter sequence O = (o 1,..., o T ) and q = {q 1,..., q T } is a state sequence. The step from Equations 2.65b to 2.65c is an approximation following the same idea as in the Viterbi algorithm, i.e. taking into account the only most likely state sequence, as no known analytical method allows the maximization of P (O W, λ) in a closed form. Bayes theorem is applied to the joint probability of O and q in Equation 2.65c, leading to Equation 2.65d. As a result, maximizing the probability of the speech parameter sequence O given a label sequence W and the estimated model λ can be divided into the following two optimization problems [Yamagishi 2006], solved in Sections and respectively: q = arg max q {P (q W, λ)} (2.66a) Ô = arg max O {P (O q, λ)} (2.66b) In the final stage, the speech waveform is reconstructed directly from the generated spectral and excitation parameters with the Mel Log Spectrum Approximation (MLSA) filter [Imai et al. 1983]. However, the synthesized speech sounds buzzy, as the HMM-based speech synthesizer uses a mel-cepstral vocoder with a simple periodic pulse-train or white-noise excitation [Yoshimura et al. 1999]. To alleviate this problem, many works developed high-quality vocoders: for example, the Mixed Excitation Linear Prediction (MELP) [Yoshimura et al. 2001] [Gonzalvo et al. 2007], the multi-band excitation [Abdel-Hamid et al. 2006], the Harmonic plus Noise Model (HNM) [Hemptinne 2006] [Kim & Hahn 2007], the flexible pitch-synchronous Harmonic/Stochastic Model (HSM) [Banos et al. 2008], the use of a codebook of pitchsynchronous excitation frames [Drugman et al. 2009c], the Speech Transformation and Representation using Adaptive Interpolation of weighted spectrum (STRAIGHT) [Kawahara et al. 1999] [Kawahara & Morise 2011], the glottal-flow-derivative model [Cabral et al. 2007] [Cabral et al. 2008], the glottal waveform [Raitio et al. 2008] or the Deterministic plus Stochastic Model (DSM) of residual signal [Drugman & Dutoit 2012].
54 30 Chapter 2. Background Maximizing P (q W, λ) Assuming a left-to-right HMM λ with no skip, as commonly used to model speech, and considering that the state durations are modeled by explicit distributions as explained in Section 2.4.3, the probability of the state sequence q = {q 1,..., q T } can be written as [Yoshimura et al. 1998] P (q W, λ) K = p k (d k ) (2.67) where p k (d k ) is the probability of being in state k during d k frames, K is the number of states in HMM λ, and K d k = T (2.68) k=1 In other words, the state sequence q which maximize P (q W, λ) is determined by the state durations {d k } K k=1 which maximize K k=1 p k(d k ). k=1 Figure 2.10: Duration synthesis, from [Yamagishi 2006]. When the state duration density is modeled by a single Gaussian PDF ( ) p k (d k ) = N (d k ; µ k, σk 2 ) = 1 exp (d k µ k ) 2 2πσk 2 2σk 2, (2.69) the state durations {d k } K k=1 which maximize Equation 2.67 under the constraint 2.68, obtained using Lagrange multipliers, are given by: d k = µ k + ρ σk 2 1 k K (2.70a) ( ) K K ρ = T µ k / (2.70b) k=1 where µ k and σk 2 are respectively the mean and variance of the duration density of state k (see Figure 2.10). k=1 σ 2 k
55 2.5. Synthesis Step in HMM-based Speech Synthesis 31 Note that, since ρ is associated with T in Equation 2.70b, the speaking rate can be controlled by ρ instead of T. Equation 2.70a shows that, to synchronize synthesized speech with average speaking rate, ρ should be set to 0, that is, T = K k=1 µ k, and the speaking rate becomes faster or slower when ρ is set respectively to a negative or a positive value. σk 2 represents the elasticity of the kth state duration Maximizing P (O q, λ) To simplify the notation here, we assume that each state output distribution is a single stream, single multi-variate Gaussian distribution [Zen et al. 2009] b k (o t ) = N (o t ; µ k, Σ k ) (2.71) where µ k and Σ k are the k th state output distribution mean vector and covariance matrix respectively. In this section, the speech parameter sequence O is rewritten in a vector form as o = (o 1,..., o T ) (2.72) that is, o is a super-vector made from all of the speech parameter vectors in the sequence. Using these notations, we have ô = arg max {P (o q, λ)} = arg max {N (o; µ bq, Σ bq )} (2.73) o o where µ q = ( µ q 1,..., µ q T ) and Σq = diag (Σ q1,..., Σ qt ) are respectively the mean vector and the covariance matrix for a state sequence q. If the parameter vector at time t is determined independently of the preceding and succeeding parameter vectors, the speech parameter sequence o maximizing P (o q, λ) corresponds to the sequence of mean vectors of each state composing the optimal state sequence q, because of the Gaussian distribution assumption. In addition to being a poor representation of the speech transitions, which are smoother in actual speech, this will degrades the synthesized speech quality as discontinuities (also called clicks ) will occur during state transitions [Yamagishi 2006]. This is illustrated by the horizontal red lines (i.e. steps) in Figure 2.11, representing the mean vectors of each state. To generate a more realistic speech parameter trajectory, relationships between static and dynamic features are introduced as constraints for the maximization problem, by considering that the speech parameter vector o t consists of the M-dimensional standard static feature vector c t together with the M-dimensional first and second order dynamic feature vectors c t and 2 c t, i.e., o t = (c t, c t, 2 c t ) (2.74) The dynamic feature vectors are computed by linear combination of the static feature vectors of several frames around the current frame, typically [Tokuda & Zen 2009]: c t = 0.5 (c t+1 c t 1 ) 2 c t = c t+1 2c t + c t 1 (2.75a) (2.75b)
56 32 Chapter 2. Background Figure 2.11: Generated speech parameter trajectory, from [Tokuda & Zen 2009]. Denoting by c = (c 1,..., c T ) the static feature-vector sequence, the relationship between o and c can be arranged in the following matrix form: o = Wc (2.76) where W is a matrix that appends dynamic features to c. Therefore, maximizing N (o; µ bq, Σ bq ) with respect to o is equivalent to that with respect to c: The key point is the following: by setting we can obtain a set of linear equations to determine ĉ as ĉ = arg max c {N (Wc; µ bq, Σ bq )} (2.77) ln N (Wc; µ bq, Σ bq ) = 0 (2.78) c W Σ 1 bq W ĉ = W Σ 1 bq µ bq (2.79) which can be solved very efficiently thanks to the special structure of W Σ 1 bq W. The trajectory of ĉ will no longer be piece-wise stationary since associated dynamic features also contribute to the likelihood. As illustrated by the smooth and continuous blue lines in Figure 2.11, the inclusion of the dynamic features into the standard feature vector constrains the synthesis to be more realistic. 2.6 Voice Adaptation Techniques The main advantage of statistical parametric speech synthesis is its flexibility in changing voice characteristics, speaking styles, and emotions. This can be easily performed by transforming its model parameters. In this section, we summarize the major basic techniques used in voice adaptation, in which an existing initial voice model is adapted using
57 2.6. Voice Adaptation Techniques 33 (a possibly small amount of) speech adaptation data from a target voice: the Maximum Likelihood Linear Regression (MLLR) [Leggetter & Woodland 1995] and the Maximum A Posteriori (MAP) estimation [Gauvain & Lee 1994] Maximum Likelihood Linear Regression (MLLR) In voice adaptation based on Maximum Likelihood Linear Regression (MLLR), a set of linear transforms is used to map an existing model to a target model such that the likelihood for the adaptation data is maximized. In order to simplify the notation, we consider here single stream (and single space), single Gaussian distributions. In MLLR, the state output distributions of the adapted model are obtained by linearly transforming the mean vectors µ i and covariance matrices Σ i of an existing model distributions as follows: b i (o) = N (o; ζ i µ i + ε i, H i Σ i H i ) = N (o; W i ξ i, H i Σ i H i ) (2.80) where W i = [ζ i, ε i ] R L (L+1) and H i R L L are the transformation matrices which transform respectively the extended mean vector ξ i = [ µ i, 1] R L+1 and the covariance matrix Σ i of the i th state output distribution. ζ i is the transformation matrix of the mean vector, and ε i is the mean vector bias. There are two main variants of MLLR. When different transforms ζ and H are used to adapt the mean vector and the covariance matrix of a distribution, this is called unconstrained (or model-space) MLLR; otherwise, it is called constrained (or feature-space) MLLR [Gales 1998]. Conventional speech adaptation systems usually used a particular case of the unconstrained MLLR, restricted to the adaptation of the mean vectors of the PDFs: b i (o) = N (o; ζ i µ i + ε i, Σ i ) = N (o; W i ξ i, Σ i ) (2.81) However, the covariance is also an important factor affecting the characteristics of the synthetic speech. An efficient way to adapt both the mean vectors and covariance matrices is thus desirable. In constrained MLLR (CMLLR) adaptation, the mean vector and the covariance matrix of a distribution are transformed simultaneously using the same transformation matrix ζ i R L L. This model-space transform is equivalent to the following affine transform of the feature-space: b i (o) = N (o; ζ iµ i ε i, ζ iσ i ζ k ) (2.82a) = ζ i N (ζ i o + ε i ; µ i, Σ i ) (2.82b) = ζ i N (W i ω; µ i, Σ i ) (2.82c) where ε i R L is the bias term of the transform, ζ i = ζ 1 i, ε i = ζ 1 i ε i, and W i = [ζ i, ε i ] R L (L+1) is the transformation matrix which transform the extended feature vector ω = [ o, 1 ] R L+1.
58 34 Chapter 2. Background The MLLR transformation matrices are estimated, so that the likelihood of the adaptation data is maximized, by means of the Baum-Welch algorithm. However, it is not always possible to estimate these transformation matrices for every distribution, because the amount of adaptation data is generally limited. Therefore, the distributions are usually clustered by a regression-class tree, and transformation matrices are shared among distributions belonging to the same regression class. By changing the size of the regression-class tree according to the amount of adaptation data, it is possible to control the complexity and generalization abilities of adaptation [Zen et al. 2009]. If W k is the transformation matrix shared by the set of distributions belonging to the same regression class k, Equations 2.81 and 2.82c are rewritten as follows: b i (o) = N (o; ζ k µ i + ε k, Σ i ) = N (o; W k ξ i, Σ i ) MLLR (2.83a) b i (o) = N (o; ζ k µ i ε k, ζ k Σ iζ k ) = ζ k N (W k ω; µ i, Σ i ) CMLLR (2.83b) In these equations, 1 k K where K is the number of regression classes. The principle of HMM-based unconstrained and constrained MLLR adaptation is illustrated in Figure Figure 2.12: Maximum Likelihood Linear Regression (MLLR) and its related algorithms, adapted from [Yamagishi et al. 2009a]. The MLLR adaptation algorithms are derived in [Leggetter & Woodland 1995], [Tamura et al. 2001], and [Yamagishi & Kobayashi 2007], for the HMM, MSD-HMM and HSMM speech models respectively. A description of the CMLLR adaptation algorithm can be found in [Yamagishi et al. 2009a].
59 2.6. Voice Adaptation Techniques 35 Let us here illustrate the HSMM-based MLLR adaptation algorithm, where the state output and duration distributions are transformed simultaneously as follows: b i (o) = N (o; ζµ i + ε, Σ i ) = N (o; Wξ i, Σ i ) p i (d) = N (d; χm i + ν, σ 2 i ) = N (d; Xφ i, σ 2 i ) (2.84a) (2.84b) where X = [χ, ν] R 1 2 is the transformation matrix for the extended mean vectors φ i = [m i, 1] R 2 of the state duration distributions belonging to the same regression class. In the same way, W = [ζ, ε] R L (L+1) is the transformation matrix for the extended mean vectors ξ i = [ µ i, 1] R L+1 of the state output distributions belonging to the same regression class. Note that the subscripts corresponding to the regression class indices are now omitted, in order to simplify the transformation matrices notation. The HSMM-based MLLR adaptation estimates the set of transformation matrices Λ = (W, X) so as to maximize the likelihood of the adaptation data O of length T as follows: ) Λ = (Ŵ, X = arg max{p (O λ, Λ)} (2.85) Λ where λ is the parameter set of the existing initial HSMM. The re-estimation formulae of the transformation matrices Λ, based on the Baum-Welch algorithm, are derived as follows: w l = y l G 1 l X = zk 1 (2.86a) (2.86b) where w l R L+1 is the l th row vector of W. In these equations, y l R L+1, G l R (L+1) (L+1), z R 2 and K R 2 2 are given by y l = G l = T r R b t=1 d=1 T r R b t=1 d=1 t γt d (r) t γt d (r) d 1 Σ r (l) t s=t d+1 o s (l) ξ r (2.87) 1 Σ r (l) ξ rξ r (2.88) z = T t γt d (r) 1 σ 2 r R p t=1 d=1 r d φ r (2.89) K = T t γt d (r) 1 σ 2 t=1 d=1 r φ r φ r (2.90) r R p where Σ r (l) is the l th diagonal element of the diagonal covariance matrix Σ r, and o s (l) is the l th element of the observation vector o s. γ d t (r) is the probability of being in the state r during the time period from t d + 1 to t given the adaptation data O, as defined by Equation R b (respectively R p ) is the set of indices of the state output (respectively state duration) distributions belonging to the considered regression class, across which the transformation matrix W (respectively X) is tied.
60 36 Chapter 2. Background Maximum A Posteriori (MAP) Adaptation MAP estimation of a model parameter set λ involves the use of prior knowledge about the distributions of the considered parameters: λ = arg max{p (λ O, W )} = arg max{p (O W, λ) P (λ)} (2.91) λ where P (λ) is the prior distribution of λ. A major drawback of MAP estimation is that every distribution in the model is individually updated. If the adaptation data is sparse, many of the model parameters will not be updated. This causes the synthesized speech to often switch between the initial and the target voice characteristics within an utterance. For cases where the amount of adaptation data is limited, MLLR is currently a more effective form of adaptation than MAP estimation [Zen et al. 2009]. On the other hand, MLLR relies on a rough assumption that the target voice model would be expressed by the piecewise linear regression of the existing initial voice model. When this assumption is not appropriate, the estimation accuracy provided by MLLR adaptation could be lower than that of a direct estimate of the target model parameters using sufficient amount of speech data. By additionally applying a MAP adaptation to the model transformed by (C)MLLR, it is possible to appropriately modify and upgrade the estimation for the distributions having sufficient amount of adaptation data [Digalakis & Neumeyer 1995] [Chien et al. 1997] [Yamagishi et al. 2009a], as illustrated in Figure λ Figure 2.13: Combined algorithm of the (C)MLLR and MAP adaptation, adapted from [Yamagishi et al. 2009a]. Let us here illustrate this approach for the mean of the distributions of an HSMM-based voice model [Ogata et al. 2006] [Yamagishi 2006]. In a first step, the mean µ i and m i of the target voice model output and duration distributions are obtained after (C)MLLR adaptation of the source model, using the adaptation data. In a second step, a MAP estimation is applied to the adapted model using
61 2.6. Voice Adaptation Techniques 37 the same adaptation data, as follows: τ b µ i + T t t γt d (i) µ MAP t=1 d=1 i = τ b + T t γt d(i) d τ b m i + T m MAP t=1 i = τ b + T t=1 d=1 t γt d (i) d d=1 t=1 d=1 t γt d(i) o s s=t d+1 (2.92a) (2.92b) where τ b and τ p are positive hyper parameters of the MAP estimation for the state output and duration distributions, respectively. γ d t (i) is the probability that the adaptation sequence o t d+1,..., o t be generated by the i th distribution. Figure 2.14: Relationship between the MAP and the ML estimates, adapted from [Yamagishi 2006]. If the means of the distributions are directly estimated by the EM algorithm, using the same adaptation data, we obtain their ML estimations as follows: µ ML i = m ML i = T t=1 d=1 T t γt d (i) T t=1 d=1 T t=1 d=1 t o s s=t d+1 t γt d(i) d t γt d (i) d t=1 d=1 and the MAP estimates can be rewritten as: µ MAP i = m MAP i = t γt d(i) τ b τ b + Γ b (i) µ i + τ p τ p + Γ p (i) m i + Γ b(i) τ b + Γ b (i) µml i Γ p(i) τ p + Γ p (i) mml i (2.93a) (2.93b) (2.94a) (2.94b)
62 38 Chapter 2. Background where T t Γ b (i) = γt d (i) d Γ p (i) = t=1 d=1 T t=1 d=1 t γt d (i) (2.95a) (2.95b) As a result, the MAP estimate µ MAP i can be viewed as a weighted average of the (C)MLLR adapted mean vector µ i and the ML-estimated mean vector µ ML i. This is illustrated in Figure Similarly, m MAP i can be viewed as that of m i and m ML i. When Γ b (i) and Γ p (i) are equal to zero, i.e., no adaptation data is available for the i th distribution, the MAP estimates are simply the prior means µ i and m i. On the other hand, when a large number of adaptation data is available, i.e., Γ b (i) or Γ p (i), the MAP estimates approach the ML estimates µ ML i and m ML i asymptotically.
63 Chapter 3 Creation of a Database with various Degrees of Articulation Contents 3.1 Introduction Database Specifications Recording Hardware Audio Acquisition System - Motu 8pre Microphone - AKG C3000B XLR Connections Digital Effects - Behringer Virtualizer DSP Amplifier - Behringer Powerplay Pro-8 HA Conclusions Introduction State-of-the-art techniques in voice technologies, e.g. Automatic Speech Recognition (ASR) or Text-To-Speech (TTS), nowadays mainly rely on statistical models. These models have to be trained on large corpora [van Santen & Buchsbaum 1997] in order to perform a specific task (e.g. speech recognition or synthesis). They are sometimes recorded by different speakers spread over various recording sessions, which is important for ASR but not necessarily consistent for TTS. This problem can be alleviated by optimizing the database content to make it suitable for a specific task [Krstulović et al. 2006] [Nagórski et al. 2002]. The speech synthesis framework uses databases containing clean speech signals, recorded in quiet or, even better, in sound-proof rooms so as to control background noise, interferences, reverberations, etc. For example, the CMU_ARTIC speech databases were carefully created for this purpose [Kominek & Black 2004] [Kominek & Black 2003]. The goal is to cover every phone variation in most of the phonetic contexts, requiring very large corpora of speech data [van Santen 1997]. This process could be lightened by automatically selecting a reasonable amount of sentences covering the most frequent phones (in contexts) of the database [Black & Lenzo 2003] [Falaschi 1989] [François & Boëffard 2002].
64 40 Chapter 3. Creation of a Database with various Degrees of Articulation Only a few databases designed for speech synthesis are freely available online (e.g. the Festvox project 1, which includes the already mentioned CMU_ARTIC database; Emo- DB [Burkhardt et al. 2005]). Various others are commercially available (e.g. European Language Resources Association ELRA 2 ). Unfortunately, these databases are not suited for a deep study of the effects caused and induced by the Degree of Articulation (DoA) of speech. In order to reach this goal, we therefore require to record a particular speech database, fulfilling the following specifications: for a thorough acoustic and phonetic analysis: three parallel corpora, each one pronounced with a different DoA: neutral (NEU), hypo (HPO) and hyperarticulated (HPR); for reliable model estimation and high-quality speech synthesis: high dynamics and clean (i.e. noise and reverberation-free) speech signals. Recording such a database first requires finding an actor, preferably professional, able to produce specific speech styles on demand. The problem is that different speakers will react differently to explicit instructions on the nature of HPO and HPR speech. Moreover there are no proofs that the produced speech corresponds to real HPO and HPR speech, as no clear definitions of these speaking styles exist. A method for eliciting specific speaking styles while controlling the speech material is therefore required. Some studies focused on such elicitation methods in the past few years. For example, Brink [Brink et al. 1998] designed a novel method for recording speech produced under various DoA. In his work, HPO speech was elicited by a distractor task carried out while simultaneously pronouncing the sentences. The distractor task involved remembering a sequence of four to seven digits presented immediately prior to the sentence to be pronounced. NEU speech was collected by simply reading prompted sentences on a screen, without any other task. HPR speech was elicited by frequently asking the speaker to repeat the last sentence more clearly. As already mentioned, this requires to clarify the definition of the desired speaking style target. A similar recording protocol was implemented in [Harnsberger et al. 2008] for NEU and HPR speech, except that the digit span of the distractor task was individually-calibrated, contrarily to [Brink et al. 1998]. Similarly to [Langner & Black 2004] or [Chen 1980], we propose in this chapter an objective recording protocol inducing the speaker s HPO or HPR speech automatically, while avoiding the speaker to be aware of it. In [Langner & Black 2004], authors created a database of speech in noise (i.e. Lombard speech), by recording a voice talent pronouncing the utterances while listening to babble noise through a headset, thus keeping the perturbation out of the recorded signal. In [Chen 1980], a clear speech database was recorded. Clear speech was elicited through an intermediate person listening to masking noise and speaker s speech. Each time the speaker read a sentence including a Consonant-Vowel
65 3.2. Database Specifications 41 (CV), the listener responded with the embedded CV which he heard. The speaker repeated the sentence until the listener perceived the correct CV. However, none of them recorded HPO speech. The database specifications and the elicitation protocol are detailed in Section 3.2. More information about the recording hardware is given in Section 3.3. Finally Section 3.4 concludes the chapter. 3.2 Database Specifications For the purpose of our research, a new French database was recorded. It consists of utterances produced by a single male speaker, aged 25 and native French (Belgium) speaking. The database contains three separate sets, each set corresponding to one DoA (NEU, HPO and HPR). For each set, the speaker was asked to pronounce the same 1359 phonetically balanced sentences (around 75, 50 and 100 minutes of NEU, HPO and HPR speech respectively), as emotionless as possible. These sentences were kindly provided by Acapela Group S.A., and were chosen so as to cover the most frequent phones (in contexts). The speaker was placed inside a sound-proof room, equipped with a screen displaying the sentences to be pronounced, and with an AKG C3000B microphone, as illustrated in Figure 3.1. The audio acquisition system Motu 8pre was used outside this room, with a sampling rate of 44.1 khz. Finally, as illustrated in Figure 3.2, a headset was provided to the speaker for both HPO and HPR recordings, in order to induce him to speak naturally while modifying his DoA, similarly to [Langner & Black 2004] and [Chen 1980]. Figure 3.1: Sound-proof room equipped in order to record natural-sounding NEU, HPO and HPR speech.
66 42 Chapter 3. Creation of a Database with various Degrees of Articulation While recording HPR speech, the speaker was listening to a version of his voice modified by a Cathedral effect. This effect generates long and dense reverberation (simulating the acoustic environment that would actually be experienced in real churches or cathedrals), forcing the speaker to talk slower and as clearly as possible (more articulatory effort to produce speech). We thus expect this kind of speech to be more robust to reverberant environment, and this will be checked in Chapter 7. The Cathedral environment was generated by the digital multi-effects processor Behringer Virtualizer DSP1000. On the other hand, while recording HPO speech, the speaker was listening to an amplified version of his own voice. This effect produces the impression of talking very close to someone in a quiet environment (e.g. in a library), allowing the speaker to talk faster and less clearly (less articulatory effort to produce speech). The amplification effect was created using the Powerplay Pro-8 HA8000 amplifier. Proceeding that way allows us to create a standard recording protocol to obtain repeatable conditions if required in the future. HPO HPR (a) (b) Figure 3.2: Schematic illustration of the standard recording protocol designed in this work to induce the speaker s (a) HPO ( amplification effect) and (b) HPR ( cathedral effect ) speech. Recordings were then resampled to 16 khz, mainly because most of the spectral information is spread over the [0-8] khz frequency band, and utterances were finally segmented such that they start and finish with a silence of about 200 ms. 3.3 Recording Hardware The speaker was placed inside a sound-proof room, equipped with a screen displaying the sentences to be pronounced, and with the AKG C3000B microphone recording speech. Outside this room, a Motu 8pre was capturing audio from the microphone using XLR connections on one side, and was sending the results to a computer connected through Firewire. On the other side, Motu 8pre was connected to a special audio effects device, the Behringer Virtualizer DSP1000 (used to induce HPR speech). Finally, this Virtualizer was connected to the Powerplay Pro-8 amplifier (used to induce HPO speech). This is illustrated in Figure 3.1.
67 3.4. Conclusions Audio Acquisition System - Motu 8pre Motu 8pre [Motu8pre 2006] is a powerful 24-bit, 96 khz digital audio acquisition system. The back panel is composed of 8 analog XLR/TRS inputs and other interfaces (FireWire, MIDI, ADAT optical, etc.). The front panel is composed of 8 input level adjustment tuners (one for each mic or instrument input), a headphone input connector, main outputs and headphones volume control, five-segment level meters displaying the audio levels for each input and LEDs indicating the clock frequency Microphone - AKG C3000B AKG C3000B [AKGC3000B 1999] is a professional large-diaphragm condenser microphone optimized for on stage and studio applications. It has a wide dynamic range of 140 db XLR Connections Amongst all the advantages XLR connections present, the most important one is the noise/interference reduction/cancelation. Indeed, the XLR connector is composed of three pins: the ground, the positive polarity terminal (hot pin) and the return terminal (cold pin). The useful signal is sent through the hot pin with the actual polarity and through the cold pin with the inverted polarity. Noise/interferences potentially induced by the transport of the signal through the cable will be reflected identically on both hot and cold pins. Therefore, since the receptor computes the difference between the hot pin and the cold pin, noise/interferences can be reduced/canceled Digital Effects - Behringer Virtualizer DSP1000 Behringer Virtualizer DSP1000 [VirtualizerDSP ] is a Digital Multi-Effects Processor powered by a 24-bit high-speed Digital Signal Processor (DSP). It allows to apply various effects to the audio signal, and in particular high quality reverb effects (virtual acoustics reverb algorithms calculated from precise mathematical models of real rooms). This virtualizer was the key to induce the speaker s HPR speech. We chose in this work the first effect: Cathedral, as illustrated in Figure 3.2 (b) Amplifier - Behringer Powerplay Pro-8 HA8000 Powerplay Pro-8 HA8000 amplifier [PowerplayPro8 2002] is an 8-channel professional multi-purpose headphones amplifier system for stage and studio applications. It is composed of 8 totally independent stereo high-power amplifier sections in one rack space. This amplifier was the key to induce the speaker s HPO speech, as illustrated in Figure 3.2 (a). 3.4 Conclusions In this chapter, we recorded a new French database which will be used throughout this work. The database was recorded following two constraints. First it should consist of three
68 44 Chapter 3. Creation of a Database with various Degrees of Articulation distinct and parallel sets (one for each DoA to be studied), so as to analyze the effects caused and induced by the DoA. Then the recordings had to be of high-quality and noise or perturbation-free, in order to generate high-quality HMM-based speech synthesis with a varying DoA. Moreover, a standard recording protocol was created in order to obtain repeatable conditions if required in the future. Indeed, the speaker was provided with a headset and was listening to either a high level of reverberation (for HPR speech) or a high voice amplification (for HPO speech). We implemented such a recording protocol because defining exactly and precisely HPO and HPR speech is a very difficult task.
69 3.4. Conclusions 45 Summary of Chapter 3 For the purpose of our research, a new French database was recorded: the database was recorded by a single male speaker, aged 25 and native French (Belgium) speaking; the database contains 3 separate sets, one for each DoA: NEU, HPO and HPR; each set is made with the same 1359 phonetically balanced sentences (around 75, 50 and 100 minutes of NEU, HPO and HPR speech respectively); the sentences are pronounced as emotionless as possible. Creation of a standard recording protocol : a headset was provided to the speaker for both HPO and HPR recordings, in order to induce him to speak naturally while modifying his DoA; while recording HPR speech, the speaker was listening to a version of his voice modified by a Cathedral effect (producing a lot of reverberations, inducing more efforts to produce speech); while recording HPO speech, the speaker was listening to an amplified version of his own voice (giving the impression of talking very close to someone in a quiet environment, inducing less efforts to produce speech). To the best of our knowledge, this database is unique in the sense that it allows both a thorough analysis and high-quality speech synthesis of the DoA.
70
71 Chapter 4 Analysis of Hypo and Hyperarticulated Speech Contents 4.1 Introduction Increase in the Articulation Effort Decrease in the Articulation Effort Contributions and Structure of the Chapter Acoustic Analysis Vocal Tract-based Modifications Glottal-based Modifications Phonetic Analysis Glottal Stops Phone Variations Phone Durations Speech Rate Conclusions Introduction Human talkers continuously adjust their speech production while talking. Lindblom introduced the H and H theory [Lindblom 1983], in which speakers are expected to vary their output along a continuum of HPO and HPR speech, in order to analyze the evolution of speech production with the speaker s behavior. He affirms, amongst others, that modifications occurring in human speech production results in a tradeoff between the transmission of the useful information and the efforts required to achieve this goal [Lindblom 1990]. A recent study [Nicolao et al. 2012] presented a computational model of human speech production to provide a continuous adjustment according to environmental conditions. Moreover, due to the high representation space dimensionality, the problem of visualizing some aspects of the human speech production system was addressed in [Nicolao & Moore 2012b], by developing two-dimensional computational models. According to Lindblom and Moon [Lindblom 1963] [Moon & Lindblom 1994], the DoA is characterized by modifications of
72 48 Chapter 4. Analysis of Hypo and Hyperarticulated Speech phonetic context, of fundamental frequency, of speech rate and of spectral dynamics (vocal tract rate of change). As already pointed out in Chapter 1, HPR speech refers to the situation of a person talking in a reverberant environment, e.g. a teacher or a speaker talking in front of a large audience (important articulation efforts have to be made to be understood by everybody). HPO speech refers to the situation of a person talking in a quiet environment (e.g. in a library) or very close to someone (few articulation efforts have to be made to be understood). NEU speech refers to the daily life situation of a person reading aloud a text emotionless (e.g. no happiness, no anger, no excitement) and without any specific articulation efforts to produce the speech, keeping only the modality-related intonation: pitch rise for questions, flat pitch for affirmative or negative sentences, etc. The DoA of speech is considered to be the fifth dimension of prosody [Pfitzinger 2006], which is composed of intensity, intonation, timing, voice quality [d Alessandro 2006] and DoA. The DoA is usually defined [Wouters & Macon 2001] as the joint evolution of the speech rate and the difference existing between real formant observations and formant targets for each phone (these latter are defined taking coarticulation into account). Because the definition of formant targets is not straightforward, Beller proposed in [Beller 2009] [Beller et al. 2008] [Beller 2007] a statistical measure of the DoA by studying the joint evolution of the vocalic triangle area (i.e. the shape formed by the vowels /a/, /i/ and /u/ in the F 1 - F 2 space) and the speech rate. From the analysis point of view, [Wouters 1996] studied the DoA in the context on unit-selection speech synthesis. The latter work focused, amongst other things, on the spectral rate-of-change of acoustic units, which is related to the speed of the articulatory movements in the vocal tract. In direct connection with the DoA, a phonetic classification system for describing various voice qualities was created by [Laver 1980] [Laver 1994]. The author defines voice quality as the characteristic auditory coloring of a speaker s voice. The voice production apparatus is modeled by different laryngeal and supralaryngeal settings, and these physical patterns are used to characterize the voice quality. Supralaryngeal settings are divided into longitudinal, latitudinal, and velopharyngeal settings. Laryngeal settings include modal voice, falsetto, whisper, creak, harshness, and breathiness [Laver 1980]. General tensing or laxing of the entire vocal tract musculature are also defined. A summary of Laver s articulatory and acoustic voice quality scheme can be found in [Keller 2005]. Voice quality gathers different aspects of the implicit communication channel: speaker s intentions, mood, health status, etc, and has been the focus of many studies. For example, [Airas 2008] analyzed it and the lax-tense axis of laryngeal voice quality was studied more deeply. HPR, NEU, HPO, clear, and casual speech have been the focus of many studies from the acoustic and/or phonetic points of view. An overview of such studies is presented in Sections and when respectively an increase and a decrease is observed in the articulation effort Increase in the Articulation Effort Lombard speech and clear speech are directly related to HPR speech. The Lombard effect [Lombard 1911] refers to the speech changes due to the immersion of the speaker in
73 4.1. Introduction 49 a noisy environment. It is indeed known that a speaker tends to increase his vocal efforts to be more easily understood while talking in a background noise [Summers et al. 1988]. Moreover, it was shown in [Lu & Cooke 2008] that speakers modified their productions in N-talker noise backgrounds. Various aspects of the Lombard effect were already studied, including acoustic and articulatory characteristics [Garnier et al. 2006b] [Garnier et al. 2006a], features extracted from the glottal flow [Drugman & Dutoit 2010a], or changes of F0 and of the spectral tilt [Lu & Cooke 2009], etc. Amongst other changes, the Lombard effect is typically associated with an increase in pitch, a slower speaking rate and a decrease in spectral tilt [Summers et al. 1988] [Lu & Cooke 2008] [Junqua 1993]. In [Hazan & Baker 2011], the authors investigated whether speech production is guided by interlocutors communicative needs, and showed that talkers use their control on speech production to ensure effective and efficient communication to listeners. Clear speech is often elicited through formal instructions asking the speaker to talk as clearly as possible, as if intended for hearing-impaired listeners, listeners in noisy environments or second language learners. In direct connection, HPR speech is elicited in a more natural way by the need for effective communication while minimizing articulatory effort. Clear speech thus induces the talker to produce a relatively consistent degree of clarification across the corpus, while HPR speech induces a continuous compromise between the need of clarifying speech to ensure efficient communication and the minimization of the talker effort [Hazan & Baker 2011]. A comprehensive analysis of acoustic, prosodic, and phonological adaptations to speech during human-computer error resolution was provided, amongst others, by [Oviatt et al. 1998]. A study on the production of clear American English fricatives [Maniwa et al. 2009] demonstrates that there are systematic acoustic-phonetic modifications in the production of clear fricatives. Similar patterns were also observed in other languages [Smiljanić & Bradlow 2005] [Köster 2001] [Wassink et al. 2007]. Chen [Chen 1980] studied the acoustic properties and intelligibility existing between clear and conversational speech. Conversational speech corresponds to the speech produced spontaneously, by opposition to read speech which corresponds to the speech produced while reading prompts. Clear speaking style has been found to induce modifications in vowel space [Picheny et al. 1986] [Ferguson 2002] [Ferguson & Kewley-Port 2002] [Bradlow et al. 2003] [Hazan & Baker 2010], mean fundamental frequency [Bradlow et al. 2003] [Hazan & Baker 2010], phone duration [Picheny et al. 1986], speaking rate [Picheny et al. 1986] [Picheny et al. 1989], number of pauses [Picheny et al. 1986] [Picheny et al. 1989], long term spectra [Picheny et al. 1986] [Krause & Braida 2004] [Hazan & Baker 2010], etc. However, all these differences may not be adopted by all. Moreover, [Krause & Braida 2002] examined whether an alternative form of clear speech exists at faster rates than the traditional slow one, and found it possible until a certain cutoff speaking rate. Nonetheless, experiments reported in [Johnson et al. 1993] support the traditional view that phonetic targets are found in HPR speech. This latter work introduced, amongst other, the hyperspace effect: the representation vowel space is extended compared to the production vowel space. The production vowel space was obtained by speakers producing vowels. The representation one was obtained by the Method of Adjustment (MoA
74 50 Chapter 4. Analysis of Hypo and Hyperarticulated Speech [Scholes 1967]). The MoA consists in asking listeners to adjust one or more parameters of a speech synthesizer until it correctly pronounces the desired speech sound Decrease in the Articulation Effort Acoustic and phonetic comparisons of casual and clear speech styles elicited in read and spontaneous speech were conducted in [Hazan & Baker 2010]. Analysis of casual speech within the framework of articulatory phonology has been performed in [Browman & Goldstein 1986] [Browman & Goldstein 1990]. In this area, the lexical items are specified as a set of coordinated articulatory gestures, representing various linguistically significant vocal tract shapes. The authors suggested that casual speech is produced by increasing temporal overlap between gestures while reducing their magnitudes spatially and temporally. Albeit not discussed, their model seems to suggest that HPR speech could be produced by decreasing overlap between gestures while increasing their magnitudes. The probability of words being HPO or HPR is thought to be related to the amount of information they bring [Lindblom 1996] [Jurafsky et al. 2001] [Aylett & Turk 2004] [Baker & Bradlow 2009]. Typically, words should be HPO if they are either predictable from the sentence context or not essential to the comprehension of the sentence, and conversely for HPR speech. Aylett investigated relationships between language redundancy and care of articulation (i.e. the equivalent of our DoA) in spontaneous speech in [Aylett 2000]. He argued that a smooth signal redundancy and a checking signal could be implemented by prosodic prominence and boundaries. Prosodic prominences are shown to increase the care of articulation, to decrease language redundancy and to coincide with unpredictable sections of speech. Prosodic boundaries play the role of a checking signal by either delineating a potential meaningful section of speech or providing the listener an occasion to request clarification. HPO and HPR speech was also studied in the visual modality field, i.e. the effects brought by an animated head avatar on the perception of HPO and HPR speech [LeGoff & Benoît 1996] [Benoît et al. 1996a]. They found that HPO speech was less intelligible than standard articulation at a conversational rate, but the results were not so clear regarding HPR speech. This was mainly due to the fact that facial models were unnatural and still to be improved. However, synthetic faces may be animated to produce highly controlled stimuli which could not be produced by human speakers Contributions and Structure of the Chapter One can obviously expect important changes during the production of HPO and HPR speech, compared to NEU speech style, both at laryngeal and supralaryngeal level. The goal of this chapter is therefore to have a better understanding of the specific characteristics governing HPO and HPR speech. These modifications could be categorized in two main parts: acoustic (Section 4.2) and phonetic (Section 4.3) variations. The first part is related to the speech production using the vocal tract (Section 4.2.1) and the glottal excitation (Section 4.2.2), while the second part focuses on the changes induced on the phonetic transcriptions. The latter section
75 4.2. Acoustic Analysis 51 analyses respectively glottal stops (Section 4.3.1), phone variations (Section 4.3.2), phone durations (Section 4.3.3), and speech rate (Section 4.3.4) for each DoA. Note that all the results we report throughout this chapter were obtained by an analysis led on the entire original corpora (as described in Chapter 3). Finally Section 4.4 concludes the chapter. This chapter is based upon the following publications [Picart et al. 2010] [Picart et al. 2013a]. 4.2 Acoustic Analysis Acoustic modifications in expressive speech have been extensively studied in the literature [Klatt & Klatt 1990], [Childers & Lee 1991], [Keller 2005]. Important changes related to the response of the vocal tract (also referred to as supralaryngeal structures in articulatory phonetics [Laver 1994]) are expected in this study. Indeed, the articulatory strategy adopted by the speaker may dramatically vary during the production of HPO and HPR speech. Although it is still not clear whether these modifications consist of a reorganization of the articulatory movements, or of a reduction or an amplification of the normal ones, speakers generally tend to consistently change their way of articulating. According to the H and H theory [Lindblom 1983], speakers reduce the dynamic of their articulatory trajectories in HPO speech, resulting in a low intelligibility, while an opposite strategy is adopted in HPR speech. As a consequence, vocal tract (or supralaryngeal) configurations may be strongly affected. The resulting changes are studied in Section In addition, the produced voice quality is also altered. Since voice quality variations are mainly considered to be controlled by the glottal source [Drugman et al. 2012a] [d Alessandro 2006] [Södersten et al. 1995], Section focuses on the modifications of glottal characteristics (also sometimes called laryngeal features [Laver 1994]) with regard to the DoA Vocal Tract-based Modifications Beller analyzed in [Beller 2009] the evolution of the vocalic triangle with the DoA, providing interesting information about the variations of the vocal tract resonances. The vocalic triangle is the shape formed by the vowels /a/, /i/ and /u/ in the space constituted by the two first formant frequencies F 1 and F 2 (here estimated via the Wavesurfer software [Sjölander & Beskow 2000]). Figure 4.1 displays the evolution of the vocalic triangle with the DoA, averaged over all the vowels /a/, /i/ and /u/ for all the original sentences. Dispersion ellipses are also indicated on this figure for information. It is observed that dispersion can be high for vowel /u/ (particularly for F2), while data is relatively well concentrated for vowels /a/ and /i/. A significant reduction of the vocalic triangle area is clearly noticed as speech becomes less articulated. These changes of vocalic space are summarized in Table 4.1, which presents the area defined by the average vocalic triangles. As a consequence of this reduction, acoustic targets become less separated in the vocalic space, confirming that articulatory trajectories are less marked during an HPO strategy (illustrated with the formant target undershoot in [Lindblom 1963] [Wouters 1996]). This also partially explains the lowest intelligibility in HPO speech. The opposite tendency
76 52 Chapter 4. Analysis of Hypo and Hyperarticulated Speech is observed for HPR speech, resulting from the increased articulatory efforts produced by the speaker. Indeed formant frequencies are found to cluster more tightly (see the dispersion ellipses), indicating that the formants reached their target values more closely. These results are corroborated by other studies (e.g. [Chen 1980] [Bradlow et al. 2003] [Harnsberger et al. 2008] [Hazan & Baker 2010]) /a/ HPR NEU HPO F1 [Hz] /u/ /i/ F2 [Hz] Figure 4.1: Vocalic triangle estimated on the original recordings for each DoA, together with dispersion ellipses. Dataset HPR NEU HPO Original Table 4.1: Vocalic space (in khz 2 ) for the three DoA for the original sentences Glottal-based Modifications As the most important perceptual glottal feature, pitch histograms are displayed in Figure 4.2. It is clearly noted that the more speech is articulated, the higher the fundamental frequency on average and the more important its dynamics range. Besides these prosodic modifications, we investigated how some characteristics of the glottal flow are affected. First, the glottal source is estimated by the Complex Cepstrum-based Decomposition algorithm (CCD, [Drugman et al. 2011]) as it was shown in [Drugman et al. 2012a] to provide the best results of the glottal source estimation. This technique relies on the mixed-phase model of speech [Bozkurt & Dutoit 2003] [Drugman et al. 2009a]. According to this model, speech is composed of both minimum-phase and maximum-phase components, where the latter contribution is only due to the glottal flow. By isolating the
77 4.2. Acoustic Analysis 53 maximum-phase component of speech, the CCD method has shown its ability to efficiently estimate the glottal source. This however requires a preliminary determination of the Glottal Closure Instants (GCIs). In this work, GCI positions were estimated using the Speech Event Detection using the REsidual And Mean-based Signals (SEDREAMS) algorithm [Drugman & Dutoit 2009], which was shown in [Drugman et al. 2012b] to provide among the best results. Figure 4.3 displays the averaged magnitude spectrum of the glottal source for each DoA computed using this technique, on the original data contained in our database (Chapter 3). Our conclusive observations are the following: the resulting spectra have a strong similarity with those derived from models of the glottal source (such as the LF model [Fant et al. 1985]), which corroborates the validity of these models and of the estimation process; a high DoA is characterized by a glottal flow containing more energy in the high frequencies, compared to the NEU case; the glottal formant frequency [Bozkurt et al. 2004] increases with the DoA (see the zoom in the top right corner of Figure 4.3), meaning that the glottal open phase is more abrupt in HPR speech HPR NEU HPO Probability Fundamental Frequency (Hz) Figure 4.2: Pitch histograms for each DoA. The glottal formant, together with the spectral tilt, is one of the main components of the amplitude spectrum of the glottal flow. Even if named as such, it does not correspond to an acoustic resonance. Moreover, it has been shown in [Doval et al. 2003] that the glottal flow can be modeled as a causal/anticausal linear filter, where the first phase of the glottal flow constitutes the anticausal part (forming the glottal formant) and spectral tilt constitutes the causal part.
78 54 Chapter 4. Analysis of Hypo and Hyperarticulated Speech Magnitude Speactrum Spectrum (db) HPR NEU HPO Magnitude Spectrum(dB) Frequency (Hz) Frequency (Hz) Figure 4.3: Averaged magnitude spectrum of the glottal source for each DoA (in the top right corner, a zoom on the glottal formant frequency). Secondly, the maximum voiced frequency is analyzed. In some approaches, such as the Harmonic plus Noise Model (HNM, [Stylianou 2001]) or the Deterministic plus Stochastic Model of residual signal (DSM, [Drugman et al. 2009b]) which will be used for synthesis in Chapter 5, the speech signal is considered to be modeled by a non-periodic component beyond a given frequency. This maximum voiced frequency (F m ) demarcates the boundary between two distinct spectral bands, where respectively an harmonic and a stochastic modeling (related to the turbulences of the glottal airflow) are supposed to hold. In this work, F m was estimated using the algorithm described in [Stylianou 2001]. The corresponding histograms are illustrated in Figure 4.4 for each DoA. It turns out that: the more speech is articulated, the higher F m, the stronger the harmonicity, and consequently the weaker the presence of noise in speech; the average values of F m are 4215 Hz (HPR), 3950 Hz (NEU) and 3810 Hz (HPO). Note that this confirms the choice of 4 khz for the synthesis of NEU speech in [Pantazis & Stylianou 2008] or [Drugman et al. 2009b]. 4.3 Phonetic Analysis In complement to the acoustic analysis of HPO and HPR speech in Section 4.2, we also investigate their phonetic modifications compared to NEU style. In the following, glottal stops (Section 4.3.1), phone variations (Section 4.3.2), phone durations (Section 4.3.3) and speech rates (Section 4.3.4) are studied. These results are here reported for the whole database described in Chapter 3 and may thus be specific to our particular speaker, as
79 4.3. Phonetic Analysis HPR NEU HPO 0.01 Probability Maximum Voiced Frequency (Hz) Figure 4.4: Histograms of the maximum voiced frequency for each DoA. such phonetic changes are known to have a certain inter-speaker variability [Beller 2009]. Note that the database was segmented using manually-checked HMM forced alignment [Malfrère et al. 2003] with 36 standard French phones and the SAMPA phonetic alphabet Glottal Stops A glottal stop [Gordon & Ladefoged 2001] [Borroff 2007] is a type of consonantal sound (cough-like) released just after the silence produced by the complete glottal closure. In French, such a phenomenon happens when the glottis closes completely before a vowel. Based on this observation, we chose to detect glottal stops manually in this study. Note that a method for detecting glottal stops in continuous speech was proposed in [Yegnanarayana et al. 2008]. The amount of glottal stops for each vowel is displayed in Figure 4.5, for each DoA. It interestingly appears that HPR speech is characterized by a higher amount of glottal stops (almost always double) than NEU and HPO speech, between which no sensible modification is noticed. This is an expected characteristic of the DoA. Indeed, HPR aims at increasing the intelligibility of a message, compared to NEU style, requiring more articulatory efforts. For example, word emphasis highlighting important information can be produced by the speaker under HPR by inserting a short pause before the emphasis, producing at the same time a glottal stop. Finally, in HPR speech, vowel /a/ is associated with the highest amount of glottal stops, as illustrated in Figure 4.5. This comes mainly from prepositions and verbs (and, in the latter case, essentially from avoir, which means to have in English).
80 56 Chapter 4. Analysis of Hypo and Hyperarticulated Speech HPR NEU HPO Number of Glottal Stops i e E a O o u y 2 e~ a~ o~ 9~ Vowels Figure 4.5: Number of glottal stops for each vowel and for each DoA Phone Variations Compared to NEU speech style, any phonetic insertion, deletion and substitution made by the speaker under HPO and HPR is part of the phone variations. This study has been performed both at the phone level, considering the phone position in the word, and at the phone group level, considering groups of phones that were inserted, deleted or substituted. Table 4.2 presents the total proportion of phone deletions in HPO speech and phone insertions in HPR speech (first row) for each phone (the most significant results are highlighted). The position of these deleted and inserted phones inside the words are also indicated: at the beginning (second row), in the middle (third row) and at the end (fourth row). Note that only the phones with the most relevant differences are shown in this table for the sake of conciseness. Note also that since there is no significant deletion process in HPR, no significant insertion process in HPO and no significant substitution process in both cases, they are not shown in Table 4.2. The most important variations concern pauses /_/, between two words, and Schwa /@/ deletions for HPO speech and insertions for HPR speech. Moreover, HPO speech counts also other significant phone deletions, i.e. /R/, /l/, /Z/ and /z/. Schwa, also called mute e or unstable e, is very important in French. It is the only vowel that can or cannot be pronounced (all other vowels should be clearly pronounced), and several authors have focused on its use in French (e.g. [Browman & Goldstein 1994] or [Adda-Decker et al. 1999]). Besides, it is widely used by French speakers to mark hesitations. These conclusions with the phone Schwa are thus probably specific to French and an extension of this phenomenon to other languages would therefore require further study. Schwa also plays a crucial role in English in [Nicolao et al. 2012], where it corresponds to the vowels low-energy attractor (i.e. the neutral vowel position). This means that all other vowels tend to converge to it
81 4.3. Phonetic Analysis 57 Phone /j/ /H/ /t/ /k/ /z/... Tot Deletions Beg (HPO) [%] Mid End Tot Insertions Beg (HPR) [%] Mid End /Z/ /l/ /R/ /E/ /_/ Table 4.2: Deleted and inserted phone percentage in HPO and HPR speech respectively, compared to NEU style, and their repartition inside the words: total (first row), beginning (second row), middle (third row), end (fourth row). when the phonetic contrast among them is perceived to be reduced (i.e. in HPO speech). The analysis performed at the phone group level revealed similar tendencies. While no significant group insertions in HPR speech have been detected, frequent phone group deletions in HPO speech were found: e.g. at the end of the words, je suis (which means I am ) becoming j suis or even chui, etc. However, no significant phone group substitutions were observed in either cases Phone Durations It is intuitively expected that the DoA influences phone durations, since HPO and HPR speech respectively target different intelligibility goals. This will directly affect the speech rate (Section 4.3.4). Some studies confirm that thought. Evidences for the Probabilistic Reduction Hypothesis are explained in the approach exposed in [Jurafsky et al. 2001]: word forms are reduced when they have a higher probability, and this could be interpreted as evidence that probabilistic relations between words are represented in the mind of the speaker. Similarly, Baker [Baker & Bradlow 2009] examines how probability (lexical frequency and previous occurrence), speaking style, and prosody affect word duration, and how these factors interact between each others.
82 58 Chapter 4. Analysis of Hypo and Hyperarticulated Speech HPR NEU HPO (a) (b) (c) Duration [ms] Figure 4.6: Phone duration histograms. (a) Front, central, back & nasal vowels. (b) Plosive & fricative consonants. (c) Pauses HPR NEU HPO (a) (b) Duration [ms] Figure 4.7: Phone duration histograms. (a) Semi-vowels. (b) Trill consonants. In this work, we have investigated the phone duration variations between NEU, HPO and HPR speech. Vowels and consonants were grouped according to broad phonetic classes. Figure 4.6 shows the histograms of (a) front, central, back and nasal vowels, (b) plosive and fricative consonants, and (c) pauses. Figure 4.7 shows the histograms of (a) semi-vowels and (b) trill consonants. As expected, durations of front, central, back and nasal vowels are shorter during HPO and longer under HPR speech on average. The same conclusion
83 4.3. Phonetic Analysis 59 was also true for plosive and fricative consonants. HPO speech contains shorter and fewer pauses, while HPR speech involves more of them, as long as those in NEU speech. We also interestingly noted a very high number of short-duration (around 30 ms) semi-vowels and trill consonants in HPO speech. In [Harnsberger et al. 2008], authors observed a decrease and an increase in word and sentence duration for HPO and HPR speech, respectively Speech Rate Speaking rate has been found to be related to many factors [Yuan et al. 2006]. It is often defined as the average number of syllables uttered per second (pauses excluded) in a whole sentence [Beller et al. 2006] [Roekhaut et al. 2010]. An automatic method for speaking rate estimation based on an unsupervised segmentation and vowel detection algorithm is proposed in [Rouas et al. 2004]. However, this technique was not used in the present work. Instead, we exploited the phonetic and syllabic segmentation of the database (as already mentioned, the database was segmented using HMM forced alignment [Malfrère et al. 2003]). Table 4.3 compares the speaking styles corresponding to each DoA, following the previous definition. Results HPR NEU HPO Total speech time [s] 6076 ( %) ( %) Total syllable time [s] 5219 ( %) ( %) Total pausing time [s] 857 ( %) ( %) Total number of syllables (+ 7.1 %) (- 5.7 %) Total number of pauses 1213 ( %) (- 7.4 %) Speech rate [syllable/s] 3.8 ( %) ( %) Relative pausing time [%] 14.1 ( %) (- 8.5 %) Table 4.3: Speech rates and related time information for NEU, HPO & HPR speech, together with the positive or negative variation from the NEU style (in [%]). As expected, HPR speech is characterized by a lower speech rate, a higher number of pauses (thus a longer pausing time), more syllables (due to final Schwa insertions in particular), resulting in an increase of the total speech time. On the other side, HPO speech is characterized by a higher speech rate, a lower number of pauses (thus a shorter pausing time), less syllables (due to final Schwa and other phone groups deletions), resulting in a decrease of the total speech time. As in [Keller et al. 1993], an interesting property can be noted: since both the total pausing time and the total speech time vary in about the same proportion (increase in HPR speech and decrease in HPO speech), the relative pausing time (and consequently the relative speaking time) percentage is almost independent of the speaking style. It seems that a speaker modifying his DoA controls unconsciously the relative proportions of speech and pausing time.
84 60 Chapter 4. Analysis of Hypo and Hyperarticulated Speech 4.4 Conclusions In this chapter, we led a study on the speech modifications occurring when the speaker varies his DoA. At the acoustic level, it was shown that both the vocal tract and glottal contributions are affected. More precisely, an increase of articulation is significantly reflected by an augmentation of the vocalic space in the F 1-F 2 plane, by higher F 0 values, by a stronger harmonicity in speech, by a glottal flow containing more energy in the high frequencies and by an increased glottal formant frequency. At the phonetic level, the main variations concern glottal stops, pauses and the Schwa. An increase of articulation reflects the presence of a higher number of glottal stops, pauses and syllables, significant phone variations (especially insertions) and longer phone durations. Finally, although the speaking rate significantly increases when the DoA decreases, it turns out that the proportion between speech and pausing periods remains almost similar.
85 4.4. Conclusions 61 Summary of Chapter 4 From the databases recorded in Chapter 3, we observed that HPO and HPR speech induce modifications on both acoustic and phonetic points of view. HPR speech is characterized (compared to the NEU style) by: Acoustic a larger vocalic space (more efforts to produce speech, with maximum clarity); a higher fundamental frequency on average; a glottal flow containing more energy in the high frequencies and an increased glottal formant frequency. Phonetic a higher number of glottal stops, pauses and syllables; significant phone variations (especially insertions); longer phone durations; a lower speech rate. HPO speech is characterized (compared to the NEU style) by: Acoustic a smaller vocalic space (less efforts to produce speech, resulting in a low intelligibility); a lower fundamental frequency on average; a glottal flow containing less energy in the high frequencies and a decreased glottal formant frequency. Phonetic around the same amount of glottal stops, fewer pauses and syllables; significant phone variations (especially deletions); shorter phone durations; a higher speech rate.
86
87 Chapter 5 HMM-based Synthesis of Hypo and Hyperarticulated Speech Contents 5.1 Introduction Reactive Speech Synthesis Knowledge Integration in Speech Synthesis Contributions and Structure of the Chapter Method Acoustic Analysis Objective Evaluation Subjective Evaluation Conclusions Introduction Lombard was one of the first to observe that humans adapt their way of speaking according to the listener s environment [Lombard 1911]. According to Lindblom s H and H theory [Lindblom 1983], the speaker continuously adapts his way of speaking in order to be clearly understood by the listener with minimal articulatory effort. According to Levelt s Perceptual Loop theory [Levelt 1990], this adaptation is unconsciously performed by the speaker so as to successfully transmit the information to the listener. The validity of this theory was assessed in [Hartsuiker & Kolk 2001]. However, state-of-the-art speech synthesis systems still exhibit a rather limited range of speaking styles as well as an inability to adapt to the listening conditions in which they operate Reactive Speech Synthesis The issue of automatic adaptation of speech synthesis systems to the external environment has been addressed in [Moore & Nicolao 2011] by implementing a reactive speech synthesizer. This synthesizer is able to modify its internal characteristics, according to the listening conditions in which it is performing, through a feedback loop. The feedback loop takes an error signal as input, computed as the difference between i) the speech generated
88 64 Chapter 5. HMM-based Synthesis of Hypo and Hyperarticulated Speech by the reactive synthesizer superimposed with a perturbation and ii) the listener s perception of the synthesized speech in adverse conditions, estimated by an Automatic Speech Recognition (ASR) module. Recently, a computational model of human speech production to provide a continuous adjustment according to environmental conditions was proposed in [Nicolao et al. 2012] [Nicolao & Moore 2012a], based on Levelt s theory and Moore s PRESENCE model [Moore 2007]. This model hypothesized that there are low energy attractors (for both vowels and consonants) for the human speech production system and that interpolation and extrapolation along the key dimension of HPO and HPR speech can be obtained by controlling the distance to such attractors. Low energy attractors refer to acoustic configurations towards which human speech production tends to converge during HPO speech. Conversely, HPR speech could be obtained by moving away in the opposite direction of these low energy attractors Knowledge Integration in Speech Synthesis While speech intelligibility improvement and the Lombard effect were extensively studied in the literature (as will be discussed in Chapter 7), only a few studies were carried on in the context of the DoA synthesis. Wouters made the first attempts within the context of concatenative speech synthesis [Wouters 1996], by modifying the spectral shape of acoustic units according to a predictive model of the acoustic-prosodic variations related to the DoA. He also showed in [Wouters & Macon 2001] that the spectral dynamics of accelerated speech can be successfully controlled to correspond with a more relaxed speaking style (i.e. HPO speech). In [Aylett 2005], Aylett estimated the relative importance of two cost functions (duration of units and language redundancy) used to bias unit selection in favor of HPR speech to give an impression of increased articulatory efforts. An example of such cost functions could be to return zero cost for a unit whose duration and/or redundancy is above and/or below a defined value, and to return a linear weighted cost in the other cases. The duration of units cost function helped in improving the perceived prominence, whereas the language redundancy and the combination of both approaches did not produce significant results. Similarly in [Cer nak 2006], the unit selection cost function was modified by an additional measure for predicting the speech unit intelligibility, namely the Speech Intelligibility Index (SII) [S ], in order to bias the synthesis by choosing more intelligible units from the speech database. Another example is the Loudmouth synthesizer [Patel et al. 2006], which emulates human modifications (both acoustic and linguistic) to speech in noise by manipulating word duration, fundamental frequency and intensity. Lombard synthesis based on Hidden Markov Models (HMMs) has also been performed in [Raitio et al. 2011a], where three different methods were proposed: i) modification of the synthesis vocoder; ii) adaptation of the statistical models; iii) extrapolation of the adapted models. Authors used a modified version of the HMM-based Speech Synthesis System ( H- Triple-S - HTS) [Zen et al. 2009], the GlottHMM [Raitio et al. 2011b] [Suni et al. 2010]. The main principle of HTS has already been explained in Chapter 2. The GlottHMM synthesis vocoder, which models the speech production system with detailed parameterization
89 5.2. Method 65 of the voice source, incorporates various aspects of Lombard speech: e.g. modifications of speech rate, pitch, spectral tilt, etc. Voice adaptation, interpolation and extrapolation techniques will be detailed in Chapter 6. In a related domain, Miller trained Artificial Neural Networks (ANNs) using syntactic and prosodic information to automatically model pronunciation variations for statistical parametric speech synthesis [Miller 1998]. The goal was to convert the linguistic representation of speech into an acoustic one through the ANN, and use this new representation directly through the Mel Log Spectrum Approximation (MLSA) vocoder [Imai 1983]. Audio-visual HPO and HPR speech synthesis has been studied in [Benoît et al. 1996a] [LeGoff & Benoît 1996], in order to assess the intelligibility advantage when the visual modality is used Contributions and Structure of the Chapter In this chapter, we focus on the synthesis of the DoA (integrating the findings of Chapter 4) in the context of statistical parametric speech synthesis, using HTS [Zen et al. 2009]. First of all, a specific HMM-based speech synthesizer is built for each DoA (NEU, HPO and HPR) in Section 5.2, using the databases described in Chapter 3. Then an acoustic analysis is performed in Section 5.3 to assess the effectiveness of the HMM-based modeling and generation processes. After that, both an objective and subjective evaluation are carried out in Sections 5.4 and 5.5, respectively, with the aim of assessing how synthetic speech quality is affected for each DoA. Finally Section 5.6 concludes the chapter. This chapter is based upon the following publications [Picart et al. 2010] [Picart et al. 2013a]. Audio examples for each DoA are available online at picart. 5.2 Method Relying on the implementation of the HMM-based Speech Synthesis System (version 2.1) HTS (a toolkit publicly available 1 ), a synthesizer was built for each DoA (NEU, HPO and HPR). Each database set recorded as explained in Chapter 3 was used for training the corresponding synthesizer. A common practice when dealing with a database consists in keeping 90% of the data for training the models and leaving the rest for testing. Therefore, for each DoA, 1220 sentences sampled at 16 khz were used for training (called the training set), leaving around 10% of the database for synthesis (called the synthesis set). The speech signal is modeled and vocoded by a source-filter approach. The filter is represented by the Mel Generalized Cepstral (MGC [Tokuda et al. 1994] [Fukada et al. 1992]) coefficients (with α = 0.42, γ = 0 and order of MGC analysis = 24), and the excitation signal is based upon the Deterministic plus Stochastic Model (DSM) of the residual signal proposed in [Drugman et al. 2009b] [Drugman & Dutoit 2012]. This model was shown to significantly increase the naturalness of the produced speech. More precisely, both deterministic and stochastic components of DSM were estimated on the training dataset for 1
90 66 Chapter 5. HMM-based Synthesis of Hypo and Hyperarticulated Speech each DoA. Note that only 1000 voiced frames are sufficient for a reliable estimation of these components [Drugman & Dutoit 2010b]. The spectral boundary between these two components was fixed as the averaged value of the maximum voiced frequency described in Section Note also that our version of HTS used 75-dimensional MGC parameters (including and 2 ), and each covariance matrix of the state output and state duration distributions was diagonal. HPO and HPR phonetic transcriptions were obtained by manually modifying, at the input of the synthesizer, the NEU phonetic transcriptions according to the phonetic analysis results (Section 4.3), and more specifically based on the phone variations results (Section 4.3.2). Our future natural language processor should provide the actual HPO, NEU and HPR phonetic transcriptions automatically. Databases HPR Speech Signal Labels Spectral Parameters Extraction Excitation Parameters Extraction Spectral Parameters Excitation Parameters Training Synthesizers HPR Full Data Model NEU Speech Signal Labels Spectral Parameters Extraction Excitation Parameters Extraction Spectral Parameters Excitation Parameters Training NEU Full Data Model HPO Speech Signal Labels Spectral Parameters Extraction Excitation Parameters Extraction Spectral Parameters Excitation Parameters Training HPO Full Data Model Figure 5.1: Standard training of the NEU, HPO and HPR full data models, from the database containing 1220 training sentences for each DoA. Since the three synthesizers implemented at this point are trained on the entire training sets, they will be referred to as full data models in the following work. Figure 5.1 shows the general architecture of our system. The following evaluations were performed on the synthesis set of the database. 5.3 Acoustic Analysis The same acoustic analysis as in Section was performed on the sentences generated by the HMM-based synthesizer. Figure 5.2 and Table 5.1 display the evolution of the vocalic triangle with the DoA, averaged over all the vowels /a/, /i/ and /u/ for all the synthesized sentences. Note the good agreement between vocalic spaces computed on original (see Figure 4.1 and Table 4.1) and synthesized sentences.
91 5.4. Objective Evaluation /a/ HPR NEU HPO F1 [Hz] /u/ /i/ F2 [Hz] Figure 5.2: Vocalic triangle estimated on the generated recordings for each DoA, together with dispersion ellipses. Dataset HPR NEU HPO Synthesis Table 5.1: Vocalic space (in khz 2 ) for the three DoA for the synthesized sentences. The same conclusions as in Section hold for the synthetic examples. In other words, the essential vocalic characteristics are preserved despite the HMM-based modeling and generation process. However, the dispersion of the formant frequencies is lower after generation, especially for F 1. This is mainly due to an over-smoothing of the generated spectra (albeit the Global Variance method [Toda & Tokuda 2007] was used). 5.4 Objective Evaluation An objective evaluation is first conducted in order to assess the quality of the full data models. Yamagishi proposed in [Yamagishi & Kobayashi 2007] the use of the following three objective measures: the average Mel-Cepstral Distortion (MCD, expressed in decibel), the Root-Mean-Square Error (RMSE) of log F0 (RMSE_lf0, expressed in cents), the RMSE of vowel durations (RMSE_dur, expressed in terms of number of frames). These measures reflect differences regarding three complementary aspects of speech. The RMSE_lf0 is obviously computed for regions where both the original recordings and the full data models are voiced, since log F0 is not observed in unvoiced regions. A cent is a logarithmic unit used for musical intervals [Ellis 1885] (100 cents correspond to a semitone; twelve semitones correspond to an octave, which means doubling of the frequency).
92 68 Chapter 5. HMM-based Synthesis of Hypo and Hyperarticulated Speech The MCD between the target and the estimated mel-cepstra coefficients (noted respectively mc (t) d and mc (e) d, and computed from the original and synthesized versions of the same utterance) is expressed as: MCD = ln(10) (mc (t) d d=1 mc(e) d )2 (5.1) Target and estimated frames should have a one-to-one correspondence in order to compute an objective distance, which could either be for cepstrum or pitch with dedicated formulae for each of them (Equation 5.1 in the case of cepstrum). The one-to-one correspondence was achieved by constraining the synthesizers to use original phone durations before generating cepstrum and pitch parameters. This constraint was obviously removed when computing the objective distance on phone duration. These objective measures are computed for all the vowels of the synthesis set of the database. The mean MCD, RMSE_lf0 and RMSE_dur, together with their 95% confidence intervals (CI), are shown in Table 5.2 for each DoA. We observe in this objective evaluation that the MCD increases from HPR to HPO speech, while RMSE_lf0 and RMSE_dur decrease with the DoA. Our conclusions are the following: considering that HPR speech is characterized by longer phone durations and HPO speech by shorter ones (Section 4.3.3), modeling the HPR speech cepstrum seems easier, as more speech data is available to estimate reliably the corresponding models compared to the NEU style (which can be seen with the MCD). On the other hand, modeling HPO speech cepstrum seems more difficult, as less speech data is available. Note that 1 db is usually accepted as the difference limen for spectral transparency [Paliwal & Atal 1993]; the increase in RMSE_lf0 with the DoA can be explained by the fact that the HPO pitch is more flat (not a lot of intonation variation), leading to smaller modeling errors, compared to the HPR one which is more diversified (more intonation variations), inducing larger modeling errors. Indeed, the HPO pitch distribution is relatively well concentrated in Figure 4.2, while the HPR one spreads over a large frequency range. A similar tendency was observed in [Fux et al. 2011]; phone durations increasing with the DoA (as illustrated in Figures 4.6 and 4.7), the modeling errors follow in the same proportion, justifying the increase in RMSE_dur from HPO to HPR speech. For comparison, similar quantitative results were observed for speaker dependent model in [Yamagishi & Kobayashi 2007], despite some differences in the training process and in the language used for training the models (Japanese). Differences between the results reported in the latter work and ours are rather minor, as we get slightly worse values for the MCD, slightly better performance for RMSE_lf0 and comparable values for RMSE_dur.
93 5.5. Subjective Evaluation 69 Results HPR NEU HPO Mean MCD ± CI 5.9 ± ± ± 0.1 RMSE_lf0 ± CI ± ± ± 14.3 RMSE_dur ± CI 9 ± ± ± 0.4 Table 5.2: Objective evaluation of the overall speech quality of the full data models: average MCD [db], RMSE_lf0 [cent] and RMSE_dur [number of frames] (frame shift = 5 ms) with their 95% confidence intervals (CI) for each DoA. 5.5 Subjective Evaluation A subjective evaluation has then been performed in order to confirm the results of the objective test. For this evaluation, participants were asked to listen to three sentences: A, the original sentence; B, the copy-synthesis version of the original sentence using the DSM vocoder; C, the sentence synthesized using DSM and whose parameters are generated by the statistical models trained with HTS. Participants were given a 9-point discrete scale and asked to score the distance, in terms of overall speech quality, of B with regards to both A and C. In other words, this score was allowed to vary from 0 (i.e. B has the same quality as A) to 8 (i.e. B has the same quality as C). The reason for using such a self-designed scale is to ease the listener s ability to evaluate the relative speech synthesis quality. Indeed, evaluating the perceptual position of B between the lower boundary A and the upper one C is more coherent than estimating its position knowing only one of these two boundaries. The latter case brings the problem of the extent until which the score could be set by the listener, and the problem of interlistener variability concerning evaluations. The shift from A to C accounts for two possible sources of degradation: vocoding (from A to B) and HMM-based statistical processing (from B to C). Since we can assume that the vocoding effect is almost the same for each DoA (this assumption was informally verified by some speech experts), the distance of B with regards to A and C is informative about the effectiveness of the statistical process. Indeed, the lower the score, the closer B is to A than it is from C, and consequently the more dominant is the statistical process among the degradation sources. In conclusion, the higher the score, the better the steps of HMM modeling and generation have been performed. The test consists of 15 triplets: 5 sentences per DoA randomly chosen amongst the synthesis set, 3 DoA, giving a total of 45 sentences. Before starting the test, the listener was provided with some reference sentences (without describing them, i.e. with no good or bad labels) covering most of the variations to help him familiarizing with the scale. During the test, he was allowed to listen to the triplet of sentences as many times as wanted (participants were nonetheless advised to listen to A and C before listening to B, in order to know the boundaries of the scale). However they were not allowed to come back to previous sentences after validating their decisions. Twenty six naive listeners participated in this evaluation. The mean scores for each DoA, on the 9-point scale, are shown in Figure 5.3. It is observed that the more speech is
94 Score 70 Chapter 5. HMM-based Synthesis of Hypo and Hyperarticulated Speech HPR NEU HPO Figure 5.3: Subjective evaluation of the overall speech quality of the full data models (mean score with its 95% CI). articulated, the higher the score and therefore the lower the degradation due to the HMM process. It is worth noting that these results corroborate the conclusions of the objective evaluation. The formant trajectories are enhanced in HPR speech and are less marked in HPO speech, compared to the standard NEU style. Due to the intrinsic statistical modeling by HMMs, these trajectories are (over) smoothed, loosing the actual finest information characterizing the DoA. This is particularly true for HPO speech. Moreover, statistical analyses [Howell 2012] were performed using Statistica 2 software, in order to assess to significance of the results. We first checked that the data were (or almost) normally distributed using the Lilliefors test. Mauchly s Test of Sphericity indicated that the assumption of sphericity had not been violated (χ 2 (2) = 2.13, p = 0.34). A repeated measures ANalysis Of VAriance (ANOVA) was used to test for preference differences amongst the three speech synthesizers integrating various DoA (F (2, 54) = 32.44, p = 0, partial η 2 = 0.55). Tukey HSD post-hoc comparisons of all pairs of the three synthesizers indicate statistically significant differences at α < 0.05 (p = 0.01 for NEU vs. HPR, p = 0 for NEU vs. HPO and for HPR vs. HPO). Moreover the effect-size Hedge s g measure are medium for NEU vs. HPR (g = 0.69), and are large for NEU vs. HPO (g = 1.14) and for HPR vs. HPO (g = 1.69). 5.6 Conclusions This chapter introduced our first attempts towards HMM-based speech synthesis integrating various DoA. NEU, HPO and HPR HMM-based synthesizers were built using full specific datasets, relying on HTS. The phonetic transcription was modified at the input of the synthesizer according to the phonetic analysis results, as well as the characteristics 2 Statsoft,
95 5.6. Conclusions 71 of the excitation modeling. Objective and subjective evaluations were proposed in order to assess the quality of the generated speech. The first one showed that the MCD increases from HPR to HPO speech, while RMSE_lf0 and RMSE_dur decrease with the DoA. These results were corroborated by the second evaluation, which proved that the best speech quality was obtained for HPR speech, followed by NEU and HPO speech. Audio examples for each DoA are available online at picart.
96 72 Chapter 5. HMM-based Synthesis of Hypo and Hyperarticulated Speech Summary of Chapter 5 Relying on the databases recorded in Chapter 3, a specific HMM-based speech synthesizer was built for each DoA. Such systems are made of two main parts: the training and the synthesis steps, which are detailed in Sections 2.4 and 2.5 respectively. The training stage takes as inputs natural speech signals combined with their associated labels. It outputs generative models of human speech. The synthesis stage takes as inputs those generative models and the text to be synthesized. It outputs the corresponding synthesized speech waveform. Training data: 90% (1220 sentences) of the NEU, HPO and HPR speech database, resampled at 16kHz; filter: traditional Mel Generalized Cepstral (MGC) coefficients (with frequency warping factor α = 0.42, γ = 0 and order of MGC analysis = 24). Synthesis based on the HMM-based Speech Synthesis System (HTS) toolkit, publicly available at using the last 10% (139 sentences) of the NEU, HPO and HPR speech database, sampled at 16kHz. These sentences are not part of the training set. Their purpose is for evaluation only, to score the synthesized speech quality compared to the original speech; excitation: Deterministic plus Stochastic Model (DSM) of the residual signal. Evaluations objective test: MCD increases from HPR to HPO speech, while RMSE_lf0 and RMSE_dur decrease with the DoA; subjective test: HPR speech has a better rendering than NEU speech after synthesis, and NEU speech has a better quality than HPO speech. Audio examples are available online at picart
97 Chapter 6 Continuous Control of the Degree of Articulation Contents 6.1 Introduction From Source toward Target Speakers Voice Interpolation and Extrapolation between Statistical Models Contributions and Structure of the Chapter Speaking Style Adaptation Method Objective Evaluation Subjective Evaluation Interpolation and Extrapolation of the Degree of Articulation Method Perception of the Degree of Articulation Segmental Quality of the Interpolation and Extrapolation Conclusions Introduction One way to perform Hidden Markov Models (HMMs) based speech synthesis is to train a speaker-dependent full data model using a database containing specific style data. In Chapter 5, an HMM-based speech synthesizer was built for each DoA (NEU, HPO and HPR), relying on the database presented in Chapter 3. Compared to unit-selection speech synthesis, several other speech synthesis techniques exist, relying on the inherent flexibility of HMM-based speech synthesis due to the statistical modeling process [Zen et al. 2009] [Barra-Chicote et al. 2010]: inter- or intra-speaker voice adaptation, Voice Conversion (VC), eigenvoice conversion and voice morphing techniques (Section 6.1.1), and interpolation and extrapolation between statistical models (Section 6.1.2).
98 74 Chapter 6. Continuous Control of the Degree of Articulation From Source toward Target Speakers Voice Voice adaptation techniques [Yamagishi et al. 2009b] can be applied to change voice characteristics [Tamura et al. 1998] [Masuko et al. 1997] and prosodic features of synthetic speech. In order to simultaneously model and adapt the spectral, excitation and duration parameters, Multi-Space probability Distribution Hidden Semi Markov Models (MSD-HSMM) [Tokuda et al. 2002a] have been introduced, together with extended Maximum Likelihood Linear Regression (MLLR) adaptation algorithms [Gales 1998] [Tamura et al. 2001] [Yamagishi & Kobayashi 2007]. Recently, the Constrained Structural Maximum A Posteriori Linear Regression (CSMAPLR), which is a more robust and advanced adaptation technique, has demonstrated its effectiveness in HMM-based speech synthesis [Yamagishi et al. 2009a] [Yamagishi et al. 2009b]. Yamagishi also proposed in [Yamagishi 2006] [Yamagishi & Kobayashi 2007] [Yamagishi et al. 2009a] the adaptation of a specific model, called an average-voice model, to a specific target speaker. The average-voice model is computed once and for all over a database containing many different speakers. However, those speakers exhibit many speaker-dependent characteristics that can influence the quality of synthetic speech generated from the adapted models. Therefore, a model-space Speaker-Adaptive Training (SAT) algorithm [Anastasakos et al. 1996] was developed in order to eliminate these negative perturbations [Yamagishi et al. 2003b]. This technique is thus robust to non-ideal speech data: recording under various conditions (not necessarily inside a sound-proof room), with different microphones, etc [Yamagishi et al. 2008] [Yamagishi et al. 2009b]. This was confirmed when training more than 1500 Text-To-Speech (TTS) voices on various Automatic Speech Recognition (ASR) corpora [Yamagishi et al. 2010]. The SAT algorithm for the MSD-HSMM was deduced in [Yamagishi & Kobayashi 2007]. Training an average-voice model allows to provide a strong prior model for speech generation, with the target adaptation data being used to estimate speaker-specific characteristics, thus allowing the generation of high quality speech synthesis using a limited amount of adaptation data [Yamagishi 2006]. Recently, [Bahmaninezhad et al. 2013] proposed an improvement to the average-voice model training algorithm to ensure that each speaker equally contributes to each leaf of the decision tree in context clustering. This was not necessarily the case in [Yamagishi 2006] and could thus bias some leaves of the decision tree towards a specific speaker or gender. Applications to speaking style adaptation [Tachibana et al. 2006], emotional expression adaptation [Qin et al. 2006] and to multilingual and polyglot text-to-speech systems [Latorre et al. 2006] have also been reported. Vocal Tract Length Normalization (VTLN) [Uebel & Woodland 1999] is a rapid speaker adaptation technique widely used in ASR and implemented in statistical parametric speech synthesis [Saheer et al. 2010] [Saheer et al. 2012a]. Recently, CSMAPLR and VTLN adaptation proved to be an efficient combination for improving the adaptation technique [Saheer et al. 2012b]. Similarly, various other techniques exist to transform a source speaker s voice to sound as if it was pronounced by a target speaker. For example, VC enables to convert a spe-
99 6.1. Introduction 75 cific speaker s voice into another speaker s voice [Kuwabara & Sagisaka 1995]. One of the most popular VC methods is the probabilistic conversion based on a Gaussian Mixture Model (GMM), which models the joint probability density of source and target acoustic features. It is trained with a parallel corpus of utterances produced by the source and target voices, and it allows the conversion of a source speaker s speech into a target speaker s one based on least mean square error [Stylianou et al. 1998] [Kain & Macon 1998] or maximum likelihood criterion [Toda et al. 2007a]. A non-parallel training method based on maximum likelihood constrained adaptation of a GMM which is trained using an existing parallel data set of a different speaker-pair has been proposed in [Mouchtaris et al. 2004]. Other approaches include mapping codebooks [Abe et al. 1988], Artificial Neural Networks (ANN) [Desai et al. 2010], partial least squares regression [Helander et al. 2010], dynamic frequency warping [Godoy et al. 2012], etc. Mixtures of Factor Analyzers (MFA) are also proposed in [Uto et al. 2006], which can be viewed as an intermediate model between GMM with diagonal (large number of mixture components required for accurate spectral estimation) and with full (large number of training data required for accurate model parameters estimation) covariance matrices. In [Song et al. 2013], a VC technique using nonparallel data is proposed. Various recent works focused on improving the quality of VC systems while preserving target speaker identity [Benisty & Malah 2011] [Eslami et al. 2011] [Helander et al. 2012] [Percybrooks & Moore 2012]. A speaker-independent HMM-based VC technique incorporating context-dependent prosodic symbols (adaptive quantization of the fundamental frequency F0) has been described in [Nose & Kobayashi 2011]. However, the proposed system requires phonetic labeling of the input speech. This constraint is alleviated in [Percybrooks et al. 2013]. A real-time voice conversion algorithm based on GMMs was proposed in [Toda et al. 2012]. The method is based on rapid source feature extraction and diagonalization of full covariance matrices, and low-delay conversion algorithm considering dynamic features and global variance. Similarly, eigenvoice conversion techniques [Shichiri et al. 2002] [Toda et al. 2006] [Ohtani et al. 2010] [Smit 2010], originally proposed for ASR [Kuhn et al. 2000], carries out HMM-based speaker adaptation using a small amount of adaptation data by reducing the number of free parameters for controlling speaker dependencies of HMMs. The eigenvoice conversion framework was extended to one-to-many (conversion of a source speaker s voice to arbitrary target speaker s one) and to many-to-one (conversion of arbitrary source speaker s voice to a target speaker s one) VC systems in [Toda et al. 2007b], and finally extended to many-to-many (conversion of arbitrary source speaker s voice to arbitrary target speaker s one) VC systems in [Ohtani et al. 2009]. Following the same idea as for the above-mentioned SAT algorithm, the performance of the one-to-many eigenvoice conversion can be improved by introducing an adaptive training method [Ohtani et al. 2010]. Finally, voice morphing is also a technique for continuously modifying a source speaker s speech to sound as pronounced by another speaker [Abe 1996] [Ye & Young 2004] [Kawahara et al. 2009]. Contrary to automatic morphing (e.g. [Slaney et al. 1996]), [Kawahara & Matsui 2003] proposed a morphing technique based on interpolation and extrapolation of manually placed anchor points in the reference and in the target speech representations. [Lavner & Porat 2005] proposes a voice morphing system based on the
100 76 Chapter 6. Continuous Control of the Degree of Articulation interpolation between two speakers modeled by the residual error signal as a 3D Prototype Waveform Interpolation (PWI) surface and by the vocal tract as a lossless tube area function. The PWI surface incorporates the characteristics of the excitation signal and the area function reflects an intermediate configuration of the vocal tract between the two speakers Interpolation and Extrapolation between Statistical Models Thanks to both the statistical and parametric representations used in HMM-based speech synthesis, interpolation between the speaking styles is possible. In most of the past and current studies, the term speaking style corresponds to various emotions, while it refers to the DoA in our case. Three main methods for modeling and interpolating between speaking styles have been proposed: style-dependent modeling and style mixed modeling [Yamagishi et al. 2003a]; model interpolation technique [Yoshimura et al. 2000] [Yamagishi et al. 2004]; MLLR-based model adaptation technique [Tamura et al. 2001]. [Iwahashi & Sagisaka 1995] describes a speech spectrum transformation method by interpolating multi-speakers spectral patterns and multi-functional representation with radial basis function networks. The idea was also applied to HMM-based speech synthesis where speaker interpolation is performed [Yoshimura et al. 2000] by interpolating HMM parameters amongst some representative speakers HMM sets, resulting in synthesized speech with intermediate voice characteristics. The authors assumed that each HMM state has a single Gaussian output distribution, reducing the problem to the interpolation amongst N Gaussian distributions. Speaking style control is proposed in [Miyanaga et al. 2004] by training style-specific HMM models and interpolating between them using a style vector. In [Tachibana et al. 2005], a model interpolation technique is implemented to obtain intermediate speaking styles in-between some representative speaking style models, and to control the degree of expressivity of a speaking style by interpolating between the latter and the NEU style. In [Hsu & Chen 2012], speaker-dependent model interpolation is achieved by combining the NEU model set of the target speaker and an emotional model set selected from a pool of speakers. Multiple-Regression Hidden Semi-Markov Model (MRHSMM) are used in [Nose et al. 2007] to model multiple emotional expressions and speaking styles. The degree of expressivity of a desired speaking style is controlled by a style vector, whose coordinates represent various expressions in the style space. The latter method is combined to voice adaptation techniques to control the intensity of emotional expressions and speaking styles of an arbitrary speaker s synthetic speech by using a small amount of his or her speech data [Nose et al. 2009]. This method has been recently improved with the introduction of subjective style intensities and Multiple-Regression Global Variance (MRGV) models into HMM-based speech synthesis [Nose & Kobayashi 2013]. Dialect interpolation has been performed in [Pucher et al. 2010b] using dialectdependent and dialect-independent modeling. [Kazumi et al. 2010] suggested factoranalyzed voice models for creating various voice characteristics in HMM-based speech synthesis. [Turk et al. 2005] interpolates the intended vocal effort in three existing (diphone) databases (soft, modal and loud voices) in order to create new databases with intermedi-
101 6.2. Speaking Style Adaptation 77 ate levels of vocal effort. As already mentioned in Chapter 5, a computational model of human speech production to manage phonetic contrast along the H and H continuum [Lindblom 1983] has been recently proposed and implemented in [Nicolao et al. 2012], allowing speaking style modification in HMM-based speech synthesis according to the external acoustic conditions [Moore & Nicolao 2011]. This model hypothesized that there are low energy attractors (for both vowels and consonants) for the human speech production system and that interpolation and extrapolation along the key dimension of HPO and HPR speech can be obtained by controlling the distance to such attractors Contributions and Structure of the Chapter In this chapter, we focus on the implementation of a continuous control of the DoA in the framework of HMM-based speech synthesis. Two main parts explain the methodology followed in order to reach this goal. The first part introduces, in Section 6.2, the adaptation of the NEU full data model trained in Chapter 5 to generate HPO and HPR speech, using a limited amount of speech data. This is done using voice adaptation techniques in the spirit of [Yamagishi & Kobayashi 2007] [Yamagishi et al. 2009b], but applied here to intraspeaker voice adaptation [Yamagishi et al. 2004] [Tachibana et al. 2003] [Nose et al. 2007] [Nose et al. 2009]. In particular, we study the efficiency of speaking style adaptation as a function of the size of the adaptation database. Based on the fact that decision trees of the NEU full data model are not modified during the adaptation process, interpolation and extrapolation can be achieved between the NEU, HPO and HPR models. Therefore, the second part addresses, in Section 6.3, the implementation of a continuous control of the DoA, which is manually adjustable by the user to obtain not only NEU, HPO and HPR speech, but also any intermediate, interpolated or extrapolated DoA in a continuous way. Finally Section 6.4 concludes the chapter. This chapter is based upon the following publications [Picart et al. 2011a] [Picart et al. 2013a]. Audio examples for speaking style adaptation, for interpolation and for extrapolation of the DoA are available online at picart/. 6.2 Speaking Style Adaptation In this section, we focus on the adaptation of a specific source speaker (i.e the NEU full data model trained in Chapter 5) such that the system is able to generate HPO and HPR speech Method The NEU full data HMM-based speech synthesizer, trained in Section 5.2, was adapted using the Constrained Maximum Likelihood Linear Regression (CMLLR) transform [Leggetter & Woodland 1995] [Digalakis et al. 1995] [Gales 1998] in the framework of Hidden Semi Markov Model (HSMM) [Ferguson 1980a] with HPO and HPR speech data in order to produce respectively a HPO and HPR HMM-based synthesizer. The linearly-
102 78 Chapter 6. Continuous Control of the Degree of Articulation transformed models were further optimized using MAP adaptation [Gauvain & Lee 1994] [Yamagishi et al. 2009b]. As spectrum, pitch and state duration are modeled simultaneously in a unified framework [Yoshimura et al. 1999] [Yamagishi 2006], speaker adaptation techniques are applied simultaneously to spectrum, pitch and state duration (Figure 6.1). Synthesizers Training HPR Full Data Model Databases HPR Speech Signal Labels Spectral Parameters Extraction Excitation Parameters Extraction Spectral Parameters Excitation Parameters Intra-speaker adaptation (CMLLR + MAP) HPR Adapted Model HPR NEU NEU HPO Speech Signal Labels Speech Signal Labels Spectral Parameters Extraction Excitation Parameters Extraction Spectral Parameters Extraction Excitation Parameters Extraction Spectral Parameters Excitation Parameters Spectral Parameters Excitation Parameters Training Intra-speaker adaptation (CMLLR + MAP) NEU Full Data Model NEU HPO HPO Adapted Model TUNER Continuous control of the degree of articulation Training HPO Full Data Model Figure 6.1: Standard training of the NEU, HPO and HPR full data models (Chapter 5), from the database containing 1220 training sentences for each DoA. Adaptation of the NEU full data model using CMLLR transform with HPO and HPR speech data to produce HPO and HPR adapted models (Section 6.2). Implementation of a tuner, manually adjustable by the user, for a continuous control of the DoA (Section 6.3). In HSMM-based speech synthesis [Zen et al. 2007], state duration distributions are modeled explicitly, allowing a better representation of the temporal structure of human speech (see Section for more details). The HSMM has also the advantage of incorporating state duration models explicitly in the first step of the Expectation-Maximization (EM) algorithm. Finally, the HSMM is more convenient during the adaptation process to simultaneously transform both state output and state duration distributions. MLLR adaptation is the most popular linear regression adaptation technique. The mean vectors and covariance matrices of state output distributions of the target speaker s model are obtained by linearly transforming the mean vectors and covariance matrices of state output distributions of the source speaker s model [Yamagishi & Kobayashi 2007].
103 6.2. Speaking Style Adaptation 79 The same idea holds for CMLLR. While MLLR is a model adaptation technique, CMLLR is a feature adaptation technique (although it is most commonly implemented in model space, as detailed in Section 2.6). In a model adaptation technique, a set of linear transformations is estimated to shift the means and alter the covariances in the source speaker s model so that each state in the HMM system is more likely to generate the adaptation data. In a feature adaptation technique, a set of linear transformations is estimated to modify the feature vectors in the source speaker s model so that each state in the HMM system is more likely to generate the adaptation data. The implementation of our synthesizers is summarized in Figure 6.1. Since the two synthesizers implemented in this section are created by adapting the NEU full data model using HPO and HPR data, they will be referred to as adapted models in the following work. The efficiency of the adaptation process will now be assessed through both an objective and a subjective evaluation on the synthesis set of the database, composed of sentences which were neither part of the training set nor of the adaptation set Objective Evaluation The goal of this objective evaluation is to assess the quality of the adapted synthesized speech when the number of adaptation sentences increases. For this, we use the measures introduced in Section 5.4, namely the average MCD, the RMSE_lf0 and the RMSE_dur. As an illustration, Figure 6.2 presents the average MCD, computed for all the vowels of the synthesis set, between the adapted and the full data models. The actual amount of adaptation data on which the MCD is computed is also indicated with black dots HPR HPO Mel cepstral distortion [db] Speech Time [minutes] Figure 6.2: Objective evaluation - Average MCD [db] computed between the adapted and the full data models. Black dots indicate actual measures. Figure 6.2 clearly shows that the MCD decreases when more speech data is used for adaptation. The distance between the HPR full data and adapted models is bigger than
104 80 Chapter 6. Continuous Control of the Degree of Articulation the gap between the HPO full data and adapted models, which could be explained by the adaptation process itself. On one hand, the HPR speech spectrum is richer, more variable, complex and enhanced, compared to the NEU style. On the other hand, HPO speech spectrum is smoother and more flat than the NEU speech one. This difference could explain why the HPR spectrum is harder to adapt from the NEU style (leading to a higher MCD) than the HPO spectrum. Note that the results in Figure 6.2 were obtained using up to 1220 adaptation sentences for both HPO and HPR speech. Nonetheless, since the speaking rate in HPR speech is known to be much slower than in HPO speech (almost the half - see Section 4.3.4), this explains why these curves do not cover the same total adaptation duration. We also observed in Figures 6.3 and 6.4 a decrease of RMSE_lf0 and RMSE_dur when the amount of speech data available for adaptation increases (again, actual amount of adaptation data on which the RMSE_lf0 and RMSE_dur are computed is also indicated with black dots). They both were found higher for HPR speech than for HPO speech. However, while the MCD is continuously decreasing when more speech data is used for adaptation, it is observed that RMSE_lf0 and RMSE_dur decrease until around 7 minutes of HPO speech or 13 minutes of HPR speech and saturate to specific values when more speech data is used. It can be noted from Figures 6.2, 6.3 and 6.4 that around 7 minutes of HPO speech or 13 minutes of HPR speech are needed to adapt cepstra correctly, while around 3 minutes of HPO speech or 7 minutes of HPR speech are sufficient to adapt F0 and phone duration with a good quality HPR HPO RMSE of logf0 [cent] Speech Time [minutes] Figure 6.3: Objective evaluation - RMSE of log F0 [cent] computed between the adapted and the full data models. Black dots indicate actual measures. Figures 6.2, 6.3 and 6.4 also show some imperfections of the HMM-based adaptation process. Indeed, the curves are saturating towards non-zero values. Slight differences could be heard between the HPO or HPR full data models and the models adapted from the NEU
105 6.2. Speaking Style Adaptation 81 RMSE of phone duration [frame] HPR HPO Speech Time [minutes] Figure 6.4: Objective evaluation - RMSE of vowel durations [number of frame] (frame shift = 5 ms) computed between the adapted and the full data models. Black dots indicate actual measures. full data model using the entire HPO or HPR training set. However, informal listening tests showed that these slight differences cannot be said to give worse or better speech synthesis results. As already stated, 1 db is usually accepted as the difference limen for spectral transparency [Paliwal & Atal 1993]. As for comparison purpose, the same kind of trends were observed for inter-speaker voice adaptation [Yamagishi & Kobayashi 2007], despite some differences in the training process and in the number of training and adaptation data Subjective Evaluation A Comparison Category Rating (CCR) evaluation is now performed in order to confirm the conclusions of the objective test. For this evaluation, participants were asked to listen to two sentences: A, the sentence synthesized by the full data model; B, the sentence synthesized by the adapted models using 10, 20, 50, 100 or 1220 sentences (with respectively 347, 545, 1318, 2619 or 8220 CMLLR transforms, associated to decision tree classes). CCR values range on a gradual scale varying from 1 (meaning that A and B are very dissimilar) to 5 (meaning the opposite), as illustrated in Table 6.1. A score of 3 is given if the two versions are found to be slightly similar. Listeners were asked to score the overall speech quality of B compared to A. The higher the CCR score, the more effective the adaptation process. Unlike the objective evaluation, there is no need here to have a one-to-one correspondence between the target and the estimated frames. Therefore audio examples used for this evaluation were entirely generated (i.e. cepstrum, F0 and phone duration) by the full data and adapted HMM-based speech synthesizers.
106 CCR Score 82 Chapter 6. Continuous Control of the Degree of Articulation Table 6.1: Grades in the CCR scale. Meaning Score Very similar 5 Similar 4 Slightly similar 3 Dissimilar 2 Very dissimilar 1 The test consists of 30 pairwise comparisons. The same listening protocol as in Section 5.5 was applied. Twenty six naive listeners participated in this evaluation. Figure 6.5 displays the mean CCR scores for both DoA. The same kind of tendency as in the objective evaluation can be seen, i.e. HTS is able to produce better adapted HPO speech than adapted HPR speech. As expected, we also see that the speech synthesis quality of the adapted models increases with the number of adaptation sentences, independently of the DoA. Nonetheless, a reasonably high-quality HMM-based speech synthesis can be achieved for both DoA with around 100 HPO or HPR adaptation sentences. It can indeed be seen from Figure 6.5 that this corresponds to CCR scores around 3.5, which means that the adapted voice, compared to the full data model, is perceived to have a quality between slightly similar and similar HPR HPO Number of Adaptation Sentences Figure 6.5: Subjective evaluation of the overall speech quality of the adapted models - Effect of the number of adaptation sentences on CCR scores (mean scores with their 95% confidence intervals).
107 6.3. Interpolation and Extrapolation of the Degree of Articulation Interpolation and Extrapolation of the Degree of Articulation This section is devoted to the implementation and quality assessment of a continuous control of the DoA in HMM-based speech synthesis, in order to continuously and smoothly change the DoA of the NEU voice towards and possibly beyond our adapted HPO or HPR voices Method Our implementation for a continuous control of the DoA makes use of 3 models: i) the NEU full data model; ii) the adapted HPO model; iii) the adapted HPR model, as it is illustrated in Figure 6.1. Both adapted models were obtained using the entire training HPO and HPR sets (1220 sentences) in order to obtain the finest quality for model interpolation and extrapolation and consequently for the resulting delivered speech synthesis. Because decision trees of the NEU full data model are not modified during the adaptation process, there is a one-to-one correspondence between the probability density functions (i.e. the leaf nodes of the decision trees) of the NEU full data model and the adapted HPO or HPR models. Therefore the continuous control of the DoA is achieved by linearly interpolating or extrapolating the mean and the diagonal covariance matrices of each state output and state duration probability density functions (mel-cepstrum, log F0 and duration distributions). Apart from the MAP adaptation step, and as discussed in Section 8.2, this interpolation or extrapolation method will provide slightly different results compared to the technique consisting in applying scaled transforms. Since no reference speech data is available to evaluate objectively the quality of interpolation and extrapolation, only two subjective tests are conducted. The way listeners perceive the interpolation and extrapolation of the DoA is first assessed in Section This evaluation is then complemented with a Comparative Mean Opinion Score (CMOS) test in Section 6.3.3, to assess the quality of this interpolation and extrapolation Perception of the Degree of Articulation For the evaluation of the perception of the DoA, listeners were asked to listen to four sentences: the three reference sentences A (HPO), B (NEU) and C (HPR); the test sentence X, which could be either interpolated between A and B or B and C, or extrapolated beyond A or C. Sentences A and C were synthesized by the models adapted from the NEU full data model using the entire training HPO and HPR sets. Sentence B was synthesized by the full data model. Then they were given a discrete scale, ranging from -1.5 to 1.5 by a 0.25 step. A, B and C were placed at -1, 0 and 1 respectively. Finally, participants were asked to tell where X should be located on that scale, X being different from A, B or C. The test consisted of 10 quadruplets. Sentences were randomly chosen amongst the synthesis set of the database. Thirty four naive listeners participated in this evaluation, under the same listening conditions as in Section 5.5.
108 Perceived Degree of Articulation 84 Chapter 6. Continuous Control of the Degree of Articulation HPO Actual Degree of Articulation HPR Figure 6.6: Subjective evaluation of the adapted models - Perceived interpolation and extrapolation ratio as a function of the actual interpolation and extrapolation ratio, together with its 95% confidence interval. Figure 6.6 displays the evolution of the average perceived interpolation and extrapolation ratio, as a function of the actual ratio which is applied. A good linear correspondence is achieved between the perceived and the reference DoA. As expected, this graph is monotonically increasing, showing that listeners were able to perceive and recognize the continuous control of the DoA. Our interpolation and extrapolation method thus proved to be effective by providing realistic DoA. However, due to the constraints imposed on the discrete scale, i.e. the user was not allowed to select reference (-1, 0, 1) or extreme (lower than -1.5, higher than 1.5) values, we may have introduced a small bias in the assessment of the perceived DoA. This bias leads the results to suffer from border effects. Indeed, as participants do not know in advance the maximum variability during the test, they tend to naturally keep out of the border values composing the scale. Extending this scale one point further on both side of a discrete scale, or the use of a continuous scale also extended beyond the range of usual values, should give more accurate results Segmental Quality of the Interpolation and Extrapolation In a second subjective test assessing the segmental quality of the interpolation and extrapolation, participants were asked to score the overall speech quality of X versus B (the NEU synthesis), leaving aside the difference in DoA between X and B. For this, we used a CMOS test in order to assess the quality of the interpolated and extrapolated speech synthesis. CMOS values range on a gradual scale varying from -3 (meaning that X is much worse than B) to +3 (meaning the opposite), as illustrated in Table 6.2. A score of 0 is given if the quality of both versions is found to be equivalent.
109 6.4. Conclusions 85 Table 6.2: Grades in the CMOS scale. Meaning Score Much better +3 Better +2 Slightly better +1 About the same 0 Slightly worse -1 Worse -2 Much worse -3 Table 6.3 presents the averaged CMOS scores of the perceived synthesis quality for each DoA. The methods proposed in this work provides a high-quality rendering of the DoA. It can be observed that interpolated HPR speech (with a DoA between 0 and 1) seems to have about the same quality as NEU speech, while a slight degradation is observed for all other DoA (on the CMOS scale, a score of -1 means slightly worse ). Similarly to Chapter 5, HTS provides a better rendering of HPR speech, compared to the HPO speech case. Note also the large size of the 95% confidence intervals for each DoA, mainly when extrapolating. This could be explained by the difficulty to compare speech quality alone, leaving aside the fact that the DoA of X and B could be different. Table 6.3: Subjective evaluation of the adapted models (CMOS test) - Perceived synthesis quality of the test sentence X vs. the NEU sentence B (CMOS scores with their 95% confidence intervals). HPR HPO DoA Quality DoA Quality ± ± ± ± ± ± ± ± ± ± Conclusions This chapter focused on the implementation of a continuous control of the DoA (HPO and HPR speech) in the framework of HMM-based speech synthesis. In a first step, we performed the adaptation of a NEU synthesizer to generate HPO and HPR speech with a limited amount of speech data. An objective evaluation showed that, for intra-speaker adaptation, around 7 (for HPO) and 13 (for HPR) minutes of speech are needed to adapt cepstra with a good quality, while only half of it is sufficient to adapt F0 and phone duration correctly, which is similar to the tendency observed for inter-speaker adaptation.
110 86 Chapter 6. Continuous Control of the Degree of Articulation These results were confirmed by a subjective test. In a second step, the implementation of a continuous control of the DoA was proposed. Subjective evaluation showed that good quality NEU, HPO and HPR speech, but also any intermediate, interpolated or extrapolated DoA, can be obtained from a HMM-based speech synthesizer. Audio examples for speaking style adaptation, for interpolation and for extrapolation of the DoA are available online at picart/.
111 6.4. Conclusions 87 Summary of Chapter 6 Starting from the NEU full data model trained in Chapter 5, voice adaptation techniques, applied here to intra-speaker voice adaptation, are implemented to generate HPO and HPR speech directly from the latter synthesizer, using a limited amount of speech data. Training of the NEU full data model: same parameters as in Chapter 5 Adaptation in the framework of HSMM, CMLLR adaptation transforms spectrum, pitch and state duration simultaneously; from 5 (i.e. around 14 or 26 seconds of HPO or HPR speech respectively) to 1220 (i.e. around 44 or 93 minutes of HPO or HPR speech respectively) adaptation sentences; further optimized using MAP adaptation. Synthesis: same parameters as in Chapter 5 Evaluations: efficiency of speaking style adaptation as a function of the size of the adaptation database objective test: around 200 adaptation sentences (7 minutes of HPO and 13 minutes of HPR speech) allow good quality cepstrum adaptation, while only half of these data is sufficient to adapt F0 and phone duration; subjective test: around 100 adaptation sentences (3 minutes of HPO and 7 minutes of HPR speech) allow reasonably high-quality HMM-based speech synthesis, confirming the objective evaluation results. Implementation of a continuous control of the DoA, manually adjustable by the user to obtain any interpolated or extrapolated DoA in a continuous way. Method decision trees of the NEU full data model are not modified during the adaptation process, so there is a one-to-one correspondence between the probability density functions (i.e. the leaf nodes of the decision trees) of the NEU full data model and the adapted HPO or HPR models; linear interpolation and extrapolation of the mean and diagonal covariance matrices of each state output and state duration probability density functions (mel-cepstrum, log F0 and duration distributions) between the NEU full data,
112 88 Chapter 6. Continuous Control of the Degree of Articulation Evaluations the adapted HPO and the adapted HPR models. perception of the DoA: realistic continuous control of the DoA provided by our interpolation and extrapolation method, confirmed by the listeners s perception and recognition; quality of the DoA: interpolated HPR speech (with a DoA between 0 and 1) has about the same quality as NEU speech, while a slight degradation is observed for extrapolated HPR speech (with a DoA between 1 and 1.5) and for HPO speech (with a DoA between 0 and -1.5). Audio examples are available online at picart.
113 Chapter 7 Subjective Assessment of Hypo and Hyperarticulated Speech Contents 7.1 Introduction Speech Intelligibility Estimation Speech Intelligibility Enhancement Contributions and Structure of the Chapter Effects Influencing the Perceived Degree of Articulation Method Experiments Intelligibility and Quality Assessments of Hypo and Hyperarticulated speech Method Semantically Unpredictable Sentences Test Absolute Category Rating Test Conclusions Introduction Hidden Markov Models (HMMs) based speech synthesis is convenient for creating a synthesizer whose speaker s characteristics and speaking styles can be easily modified. As already explained in Chapter 6, this can be obtained by adapting a source speaker s model to a target speaker s model, using inter or intra-speaker voice adaptation techniques. Contrarily to human speakers, speech synthesis is an effective and reliable method for studying the perception of the DoA, as each parameter can be carefully controlled. Therefore, this chapter focuses on a deeper understanding of the phenomena responsible in the perception of the DoA by listeners, as well as how intelligibility is affected when the synthesizer is embedded in adverse environments. The HMM-based speech synthesis flexibility can be exploited in particular to modify synthetic speech intelligibility when listening conditions degrade. In such adverse environments, two main factors can affect the way the speakers talk to their interlocutor: the background perturbation and the visual field. On the one hand, and in direct connection
114 90 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech with HPR speech, the Lombard effect [Lombard 1911] refers to the speech changes due to the immersion of the speaker in a noisy environment. It has been shown to significantly improve the intelligibility of speech in background noise [Summers et al. 1988]. Unsupervised analysis, i.e. without any contextual information, of Normal (i.e. speech generated in quiet conditions) and Lombard speech (i.e. speech generated in noisy conditions), was conducted in [Godoy & Stylianou 2012]. They also proposed spectral envelope transformation by applying a correction filter to Normal speech in order to render it more intelligible in noisy environments. On the other hand, Fitzpatrick reported in [Fitzpatrick et al. 2011] that speakers modify their speech production strategies in noise depending on whether their interlocutor could or could not be seen. Later [Fitzpatrick et al. 2012], they showed that a greater audio-visual intelligibility benefit could be obtained by: the production of Lombard speech in face-to-face conditions compared to non-visual conditions, and the production of Lombard speech compared to speech produced in quiet conditions. These findings were also observed in [Kim et al. 2011]. Many past and current studies showed that, in noisy environments, clear speech, HPR speech and Lombard speech are more intelligible than conversational and casual speech for both normal-hearing and hearing-impaired listeners [Smiljanić & Bradlow 2005] [Krause & Braida 2004] [Hazan & Baker 2011]. Noise-induced speech (i.e. speech produced in noisy conditions) and reverberant-induced speech (i.e. speech produced in a reverberant environment) are reported [Hodoshima et al. 2010] to be more intelligible than speech produced in quiet conditions when heard in adverse environment. This conclusion holds even if the actual reverberant environment is different from the one used when recording. Objectively evaluating and enhancing speech intelligibility in adverse environments has been widely studied. An overview is described in Sections and respectively Speech Intelligibility Estimation Regarding evaluation of synthetic speech intelligibility, the work presented in [Valentini-Botinhao et al. 2011] interestingly investigated several objective measures. They found that the Dau measure [Christiansen et al. 2010], based on the Dau model [Dau et al. 1996], and the Glimpse Proportion (GP) measure [Cooke 2006] are good intelligibility predictors for modifications that take place in the spectral domain. Various other objective intelligibility measurements exist (e.g. [Taal et al. 2010] [Valentini-Botinhao et al. 2011]): amongst others, the Articulation Index (AI) [French & Steinberg 1947] [Kryter 1962] [Mueller & Killion 1990], the Speech Transmission Index (STI) [Steeneken & Houtgast 1980], Frequency Weighted Segmental SNR (FWS) [Tribolet et al. 1978], Weighted Spectral Slope metric (WSS) [Klatt 1982], Speech Intelligibility Index (SII) [S ], Short-Time Objective Intelligibility (STOI) [Taal et al. 2010], Template Constrained Generalized Posterior Probability (TCGPP) [Wang et al. 2012], etc. In [Liu et al. 2008], 9 well known objective quality measures are assessed for their potential in intelligibility estimation. Their results showed that most quality measures poorly correlate with intelligibility, especially when degradations (e.g. additive noises, enhancement schemes) are encountered, except for WSS. The relationship between
115 7.1. Introduction 91 the prosodic modifications occurring in clear speech, compared to plain speech, and the resulting increased intelligibility found in clear speech is investigated in [Mayo et al. 2012]. They compared five speech styles (plain, infant-, computer- and foreigner-directed, and shouted), and found that the total amount of speech which escaped masking was a good intelligibility predictor for objective tests Speech Intelligibility Enhancement According to [Erro et al. 2012], two main approaches can be implemented to render synthetic speech more intelligible in adverse environments: i) database recording in the desired conditions, and ii) enhancing the synthesizer output with signal processing techniques (parametric and non-parametric approaches). Albeit the second method may not be as effective as the first one in the rendering of the desired speech signal or speaker s characteristics, it avoids the process of database recording which is time- and resource-consuming. Therefore, many studies have been conducted on intelligibility improvement with signal processing techniques and expert knowledge. For instance, [Erro et al. 2012] proposed to improve the intelligibility of speech by manipulating the parameters (spectral slope and amplification of low-energy parts of the signal) of an harmonic speech model. Several signal processing methods (post-processing), mainly based on energy reallocation, addressing the improvement of intelligibility in adverse noise conditions have been proposed [Tang & Cooke 2010] [Tang & Cooke 2011] [Skowronski & Harris 2006] [Niederjohn & Grotelueschen 1976] [Hall & Flanagan 2010]. In particular, five energy reallocation strategies to increase speech intelligibility in noisy conditions are compared in [Tang & Cooke 2010]. Yoo [Yoo et al. 2007] suggested that modifications of transient parts of speech significantly impact intelligibility. While this kind of signal processing techniques have been shown to improve speech intelligibility, they are not able to model the spectral changes occurring when natural speech is produced (e.g. increase of vocal effort to enhance loudness). To overcome this issue, [Jokinen et al. 2012] proposed a noise-adaptive post-filtering algorithm mimicking the spectral effects observed in natural Lombard Speech to improve the quality and intelligibility of speech in mobile communications. A recent study successfully applied Spectral Shaping and Dynamic Range Compression to modify casual speech in order to reach the intelligibility of clear speech [Koutsogiannaki et al. 2012]. This idea was similarly implemented in [Zorila et al. 2012]. Speech intelligibility improvement has been performed for a limited domain task in [Langner & Black 2005] based on voice conversion techniques. Another example is the Loudmouth synthesizer [Patel et al. 2006], which emulates human modifications (both acoustic and prosodic) to speech in noise by manipulating word duration, fundamental frequency and intensity. In [Bonardo & Zovato 2007], it is proposed to tune dynamic range controllers (e.g. compressors and limiters) and some user controls (e.g. speaking rate and loudness) to improve the intelligibility of synthesized speech. Various methods allowing automatic modification of speech in order to achieve the same goal are investigated in [Anumanchipalli et al. 2010] (e.g. boosting the signal amplitude in important frequency bands, modification of prosodic and spectral properties, etc).
116 92 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech In [Cer nak 2006], the unit selection cost function was modified by an additional measure for predicting the speech unit intelligibility, namely the SII, in order to bias the synthesis by choosing more intelligible units from the speech database. In [Valentini-Botinhao et al. 2012a] [Valentini-Botinhao et al. 2012b], a new method for extracting or modifying mel-cepstral coefficients based on the GP intelligibility measure for speech in noise has been proposed. As already mentioned in Chapters 5 and 6, a computational model of human speech production to manage phonetic contrast along the H and H continuum [Lindblom 1983] has been recently proposed and implemented in [Nicolao et al. 2012], allowing speaking style modification in HMM-based speech synthesis according to external acoustic conditions [Moore & Nicolao 2011]. They showed that such HPO and HPR control can successfully affect the intelligibility of synthetic speech in noise, by respectively decreasing and increasing it by around 25% on average. The same idea was also applied to the Italian language in [Nicolao et al. 2013]. In [Syrdal et al. 2012], the speech intelligibility of eight TTS systems together with a linearly time-compressed human reference speech was measured as a function of the speech rate (varying from 200 to 450 words per minute). The eight TTS systems consist of four synthesis methods (formant, diphone concatenation, unit-selection and HMM-based speech synthesis) for a female and a male American English voice. They found, through a Semantically Unpredictable Sentences (SUS) test [Benoît 1990] [Benoît et al. 1996b], that the HMM-based synthesizer was slightly less intelligible than the unit-selection one at low speech rates, and that the intelligibility difference between these two was growing with the speech rate Contributions and Structure of the Chapter The current work relies on Lindblom s H and H theory [Lindblom 1983], in which speakers are expected to vary their output along a continuum of HPO and HPR speech. Compared to the NEU case, HPR speech tends to maximize the clarity of the speech signal by increasing the articulation efforts to produce it, while HPO speech is produced with minimal articulation efforts. In adverse conditions, speakers generally reach a compromise between the need for effective communication while minimizing the articulation efforts. Note that our work belongs to the first category described in [Erro et al. 2012], i.e. recording of a database in specific conditions. This chapter focuses on a deeper understanding of the phenomena induced by and responsible in the perception of the DoA by listeners. Indeed, it was shown in Chapter 6 that high-quality adapted HPO and HPR HMM-based models can be obtained using speaker adaptation techniques. We also proved in Chapter 4 that the DoA induces modifications in the cepstrum, pitch, phone duration and phonetic transcriptions. However, the state-of-the-art voice adaptation process used in the present work does not bring any information about the internal mechanisms responsible in the perception of the DoA. Therefore, the first part of this chapter analyzes these induced modifications separately in the complete adaptation process (Section 7.2), in order to provide the contributions of each of them in the perception of the DoA by listeners. The implementation of our synthesizers is first detailed in Section Subjective evaluations are then conducted in
117 7.2. Effects Influencing the Perceived Degree of Articulation 93 Section We quantify the effects of each factor influencing the DoA in Section using a Perceived Degree of Articulation (PDA) test, which is complemented by an Absolute Category Rating (ACR) test in Section to evaluate various aspects of speech, i.e. quality, comprehension, non-monotony, fluidity and pronunciation. This perceptual study is necessary as a preliminary step towards performing speaker-independent control of the DoA. After that, the second part of this chapter is devoted to synthetic speech intelligibility assessment (Section 7.3), when a varying DoA is implemented into the HMM-based speech synthesizer, embedded in adverse environment. The intelligibility of the synthesizers described in Section is assessed through the Semantically Unpredictable Sentences (SUS - Section 7.3.2) test, in which words composing the sentences cannot be predicted by listeners based on sentences meaning. This is then complemented by an Absolute Category Rating (ACR - Section 7.3.3) test in order to evaluate various aspects of speech. As already mentioned in Chapter 1, increasing the speech intelligibility of a synthesizer performing in adverse conditions has a lot of daily life applications: perceiving GPS voice inside a moving car; understanding train or flight information in stations or halls; adapting the difficulty level when learning foreign languages, etc. Finally Section 7.4 concludes the chapter. This chapter is based upon the following publications [Picart et al. 2011b] [Picart et al. 2012a] [Picart et al. 2013a] [Picart et al. 2013c]. All audio examples used in the experimental evaluations of this study are available online at picart/. 7.2 Effects Influencing the Perceived Degree of Articulation As already mentioned, we focus in this section on high-quality HMM-based speech synthesis in which the DoA can be modified, and more specifically on the internal mechanisms leading to the perception of the DoA by listeners Method The synthesizers implemented in Chapters 5 (i.e. the full data models) and 6 (i.e. the adapted models) are tested. As a reminder, the entire training sets were used to obtain the full data models (Section 5.2) as well as the adapted model (Section 6.2.1). We showed in Section 6.2 that around 7 or 13 minutes of HPO or HPR speech are needed to adapt cepstra with a good quality, while only half of it is sufficient to adapt F0 and phone duration. On the other hand, the more adaptation data, the better the quality independently of the DoA. This is why we chose to use the models adapted from the NEU full data models using the entire HPO and HPR training sets, as the perceptual effect induced by the amount of adaptation data was already studied in Section 6.2. Four main questions can be drawn in order to analyze the internal mechanisms leading to the perception of the DoA by listeners: Question 1: Does adapting pitch and phone duration by a simple ratio operation (while not adapting the cepstrum) sound like HPO or HPR speech?
118 94 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech Question 2: What is the effect of cepstrum (NEU vs. HPO or HPR) on the perception of the DoA? Question 3: What is the effect of the phonetic transcription (NEU vs. HPO or HPR) on the perception of the DoA? Question 4: Will the complete adaptation improve the perception of the DoA compared to previous cases? By combining the full data and adapted models described above, four synthesizers are created as follows (summarized in Table 7.1): Synthesizer 1: The first synthesizer (Case 1 ) is our baseline system and corresponds to the NEU full data model, where a straightforward phone-independent constant ratio is applied to decrease or increase pitch and phone durations to sound like HPO or HPR speech respectively. This ratio is computed once and for all over the HPO and HPR databases (the reader is referred to Chapter 3 for more details) by adapting the mean values of the pitch and phone duration from the NEU style. The phonetic transcription is manually adjusted to fit the real HPO and HPR transcription (i.e. as actually produced by the speaker in our database, see Section 4.3 for more details). Synthesizer 2: The second synthesizer (Case 2 ) is constructed by only adapting pitch and phone duration distributions from the NEU full data model. The phonetic transcription is the same as from original HPO and HPR recordings. Synthesizer 4: The last synthesizer (Case 4 ) is built by adapting cepstrum, pitch and phone duration distributions from the NEU full data model. The phonetic transcription is the same as from original HPO and HPR recordings. Table 7.1: Four different synthesizers, so as to analyze the internal mechanisms leading to the perception of the DoA by listeners. Synthesizer 3: The third synthesizer (Case 3 ) is constructed by adapting cepstrum, pitch and phone duration probability density functions from the NEU full data model. The phonetic transcription is not manually adjusted to fit real HPO and HPR transcription. Full Data Model (NEU) Adapted Model (HPO or HPR) Ceps- Phon. Ceps- Phon. Pitch Duration Pitch Duration trum Transcr. trum Transcr. Case 1 X Ratio Ratio X Case 2 X X X X Case 3 X X X X Case 4 X X X X
119 7.2. Effects Influencing the Perceived Degree of Articulation 95 These synthesizers are used through the experimental evaluation described in Section to answer the above-mentioned questions, by comparing their performance amongst each others as described in Table 7.2. For instance, Question 1 is answered by comparing the performance of Cases 1 and 2. Table 7.2: Answering the questions by comparing the synthesizers performance. Case 1 Case 2 Case 3 Case 4 Question 1 (prosody: ratio vs. adaptation) X X Question 2 (cepstrum) X X Question 3 (phonetic transcription) X X Question 4 (complete adaptation) X X X X Experiments In order to assess the performance of our synthesizers, two separate subjective experiments are conducted. Section is dedicated to the evaluation of the influence of each factor explained in Section on the perception of the DoA. Section complements the first evaluation by performing an Absolute Category Rating (ACR) test on other perceptual aspects of the synthetic speech Evaluation of the Perceived Degree of Articulation To evaluate the Perceived Degree of Articulation (PDA), listeners were asked to listen to three sentences: the two reference sentences A (NEU) and B (HPO or HPR) synthesized by the full data models; the test sentence X synthesized by one of the four synthesizers described in Table 7.1 (randomly chosen, but still balanced), which could be either HPO or HPR depending on the articulation of B. Participants were given a continuous scale, ranging from to A and B were placed at 0 and 1 respectively. They were asked to tell where X should be located on that scale. Evaluation was performed on the synthesis set, in which sentences were neither part of the training nor of the adaptation sets. The test consisted of 20 triplets. For each DoA, 10 sentences were randomly chosen from the synthesis set. During the test, listeners were allowed to listen to each triplet of sentences as many times as wanted, in the order they preferred. However they were not allowed to come back to previous sentences after validating their decision. Twenty four naive listeners participated in this evaluation. The mean PDA scores, together with their 95% confidence intervals (CI) are shown in Figure 7.1. Because target value 1 corresponds to the speech synthesized using the HPO or HPR full data models, the closer to 1 the PDA scores, the better the synthesizer as it leads to an effective rendering of the intended DoA. From this figure, we clearly see the advantage of using an HMM to generate prosody (pitch and phone duration) instead of applying a straightforward phone-independent constant ratio to the NEU synthesizer prosody, in order to get as close as possible to real HPO or HPR speech (Case 1 vs Cases 2, 3, 4 ).
120 Mean PDA Score 96 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech HPO HPR Case Figure 7.1: Subjective evaluation of the perception of the DoA - Mean PDA scores with their 95% confidence intervals (CI) for each DoA. The effects of cepstrum adaptation (Case 2 vs Case 4 ) and phonetic adaptation (Case 3 vs Case 4 ) are also obvious. It can be noted that adapting the cepstrum has a higher impact on the rendering of the DoA than adapting the phonetic transcription (the gap between Case 2 and Case 4 is bigger than the gap between Case 3 and Case 4 ). Moreover, this conclusion is particularly true for HPR speech, while the difference is less marked for HPO speech. Therefore Case 2 indicates that spectral features have a larger influence for HPR speech. This might be explained by the fact that spectral changes (compared to the NEU style) induced by an HPR strategy are important to be modeled by the HMMs. Although significant spectral modifications are also present for HPO speech, it seems that their impact on the listener perception is marked to a lesser extent. When analyzing Case 3, it is observed that a lack of appropriate phonetic transcription is more severe for HPO speech. Indeed, we have shown in Section that HPO speech is characterized in particular by a high number of phone deletions, which is more important than the effect of phone insertions for HPR speech. This effect being stronger for HPO speech, we can easily understand that it will lead to a greater degradation of the speech signal perceived by the listeners. Finally, it is noted that a high performance is achieved by the complete adaptation process (Case 4 vs. ideal value 1, which is the speech synthesized using the HPO or HPR full data models). This proves the effectiveness of the CMLLR adaptation technique based on HMMs for the DoA Absolute Category Rating Test This Absolute Category Rating (ACR) experiment is based on the framework described in [de Mareüil et al. 2006]. A Mean Opinion Score (MOS) test was complemented with
121 7.2. Effects Influencing the Perceived Degree of Articulation 97 an evaluation of various aspects of speech: comprehension, non-monotony, fluidity and pronunciation. For this evaluation, 22 listeners were asked to listen to 20 test sentences, synthesized by each of the four synthesizers described in Table 7.1 (randomly chosen, but still balanced), which could be either HPO or HPR. These sentences were randomly chosen amongst the held-out set of the database (used neither for training nor for adaptation). Sentences were played one at a time. For each of them, listeners were asked to rate the 5 aspects cited above. Table 7.3 displays how the listeners were requested to respond. Listeners were given 5 continuous scales (one for each question to answer) ranging from 1 to 5 (these marks are associated with the extreme category answers in Table 7.3). These scales were extended one point further on both sides (ranging therefore from 0 to 6) in order to limit border effects within the scale. During the test, listeners were allowed to listen to each sentence as many times as wanted. However they were not allowed to come back to previous sentences after validating their decision. Table 7.3: Question list asked to listeners during the ACR test, together with their corresponding extreme category responses [de Mareüil et al. 2006]. Test MOS Comprehension Non-monotony Fluidity Pronunciation Questions (Extreme Answers) How did you appreciate globally what you just heard? (Very bad - Very good) Did you find it difficult to understand the message? (Very difficult - Very easy) How would you characterize the speech intonation? (Very monotonous - Very varied) How would you characterize the speech fluidity? (Very jerky - Very fluid) Did you hear some pronunciation problems? (Serious problems - No problem) Mean scores are shown in Figure 7.2. The MOS test shows an improvement in speech quality from Case 1 to Case 4, for both HPO and HPR speech. This proves again the efficiency of the CMLLR adaptation process for producing high-quality synthetic speech. We clearly see an increase and a decrease in the intelligibility of respectively HPR and HPO speech from Case 1 to Case 4 when analyzing the comprehension test. These results were expected considering our definition of HPO and HPR speech, and corroborate our findings of Section The intelligibility of HPR speech is much higher for the complete adaptation process (Case 4 ) than for the baseline (Case 1 ). A dramatic increase in monotony is observed for HPO speech (from Case 1 to Case 4 ), while no significant variations were noticed for HPR speech. Going from Case 1 to Case 4 means getting closer to the target HPO or HPR speech. The sentence-wise intonation and variations, called the suprasegmental features, are reduced to the minimum in HPO speech because of the fastest speech rate (see Section 4.3.4), explaining the dramatic increase in
122 98 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech HPO - Case 1 HPO - Case 2 HPO - Case 3 HPO - Case 4 HPR - Case 1 HPR - Case 2 HPR - Case 3 HPR - Case 4 Figure 7.2: Subjective evaluation of the perception of the DoA - ACR test. monotony observed for HPO speech (from Case 1 to Case 4 ). They could be enhanced in HPR speech because of the slowest speech rate. However, no significant differences are observed here, because the suprasegmental features were not amplified by our speaker from the NEU style to the HPR one. The fluidity test shows that HPO speech is more fluid that HPR speech. This is due to the fact that HPO speech is characterized by a lower number of pauses and glottal stops, shorter phone durations and higher speech rate (as proven in Section 4.3). All these effects lead to an impression of fluidity in speech, while the opposite tendency is observed in HPR speech. This explains also the fact that starting from our baseline (Case 1 ) and moving to the target HPO and HPR speaking styles, the speech becomes respectively more or less fluid (albeit no progressive degradation of fluidity across cases is reported for HPR speech). Surprisingly enough, Case 2 gives the higher result in the comprehension and pronunciation tests for HPO speech. This means that in order to decrease the comprehension of a message, it is required to adapt cepstrum from the NEU style, so as to model the weaker articulatory efforts in HPO speech. In this latter case, formant targets will be marked to a lesser extent. Finally, HPR speech exhibits no significant pronunciation differences amongst the different cases. 7.3 Intelligibility and Quality Assessments of Hypo and Hyperarticulated speech The perceptual prevalence of phonetic, prosodic and spectral envelope information has been studied in Section 7.2. As a complement, this section focuses on the intelligibility evaluation (Section 7.3.2) of the synthetic speech generated by the synthesizers described in Section 7.3.1, when the latter are performing in adverse environments. Moreover, a multidimension assessment of both original and synthesized speech is conducted in Section
123 7.3. Intelligibility and Quality Assessments of Hypo and Hyperarticulated speech Method Five HMM-based speech synthesizers were implemented following the same procedure as in Section 6.3. As a reminder, the NEU full data model was adapted using the entire HPO and HPR training sets, in order to remove the effect of the number of adaptation sentences from our results. The five synthesizers were created using interpolation ratios ranging from -1 (HPO) to +1 (HPR), including 0 (NEU), with a 0.5 step: -0.5 and +0.5 correspond to models right between the NEU full data model and respectively, the adapted HPO model, or the adapted HPR model. The intelligibility of these five synthesizers (-1, -0.5, 0, +0.5, +1) will be studied in Section 7.3.2, while a general assessment will be performed on the three major synthesizers (-1, 0, +1) in Section Semantically Unpredictable Sentences Test In order to evaluate the intelligibility of a voice, the Semantically Unpredictable Sentences (SUS) test was performed on speech degraded alternatively by an additive or a convolutive noise, as these two types of adverse conditions are temporally and spectrally different. The advantage of such sentences is that they are unpredictable, meaning that listeners cannot determine a word in the sentence by the meaning of the whole utterance or the context within the sentence Building the SUS Corpus The same corpus as the one built in [de Mareüil et al. 2006] was used in our experiments. This corpus is part of the ELRA package (ELRA-E0023). Basically, 288 semantically unpredictable sentences were generated following 4 syntactic structures containing 4 target words (nouns, verbs or adjectives, here written with a capital initial letter): adverb determiner Noun 1 Verb-t-pronoun determiner Noun 2 Adjective? determiner Noun 1 Adjective Verb determiner Noun 2. determiner Noun 1 Verb 1 determiner Noun 2 qui ( that ) Verb 2. determiner Noun 1 Verb preposition determiner Noun 2. Structure 4 originally proposed by [Benoît 1990] was not kept, because it only contained 3 target words instead of 4. For more details about the generation of this corpus, the reader is referred to [de Mareüil et al. 2006] Procedure Nineteen naive listeners participated in this evaluation. They were asked to listen to 40 SUS, randomly chosen from the SUS corpus built in the previous paragraph. The SUS were played one at a time, and each SUS was used only once. For each of them, listeners were asked to write down what they heard. During the test, they were allowed to listen to
124 100 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech each SUS at most two times. They were not allowed to come back to previous sentences after validating their decision. The SUS were synthesized using the five synthesizers described in Section Two types of degradation were then applied to the synthesized SUS: additive noise and reverberation. For simulating the noisy environment, car noise was added to the original speech waveform at two Signal-to-Noise Ratios (SNRs): -5 db and -15 db. The car noise signal was taken from the Noisex-92 1 database, and was added so as to control the overall SNR without silence removal. Since the spectral energy of the car noise is mainly concentrated in the low frequencies (<400 Hz), the formant structure of speech was only poorly altered, and voices remained somehow understandable even for SNR values as low as -15 db. When the speech signal s(n) is produced in a reverberant environment, the observation x(n) at the microphone is: x(n) = h(n) s(n), (7.1) where h(n) is the L-tap Room Impulse Response (RIR) of the acoustic channel between the source and the microphone. RIRs are characterized by the value T 60, defined as the time for the amplitude of the RIR to decay to -60 db of its initial value. In order to produce reverberant speech, a room measuring 3x4x5 m with two levels of reverberation (T 60 of 100 and 300 ms) was simulated using the source-image method [Allen & Berkley 1979], and the simulated impulse responses were convolved with original speech signals. The word level recognition accuracy was used as the performance metric for the SUS test. In order to cope with orthographic mistakes, this accuracy was computed by manually counting the number of erroneous phonemes for each word written by the listeners, in comparison with the correct word. The same procedure was also applied for the accuracies at the sentence level. Therefore, a sentence could be considered as wrong while some of its words could be considered as correct. A strong correlation was noted between the recognition accuracy at the sentence and word levels Results The mean recognition accuracies at the word and sentence level (for each DoA, for each type and level of perturbation) are shown in Figure 7.3. The higher the score, the better the synthesizer intelligibility as it leads to a higher word recognition. Interestingly, it is observed that accuracy generally increases with DoA. For example, in the strongest reverberation, the word recognition rate increases from around 48% for HPO speech, to 83% in HPR (i.e. an absolute gain of 35%). It is also worth noting that in the presence of car noise, there is no need to over-articulate: using values of 0.5 or 1 for the DoA leads to almost exactly the same intelligibility performance. This conclusion however does not hold in a reverberant environment. Comparing the effect of the perturbation on the message understandability, it turns out that the most reverberant condition clearly leads to the highest degradation. In HPR, increasing the level of noise from -5 db to -15 db 1
125 Sentence Recognition Accuracy [%] Word Recognition Accuracy [%] 7.3. Intelligibility and Quality Assessments of Hypo and Hyperarticulated speech Car -5 db Car -15 db Reverb 100 ms Reverb 300 ms Degree of Articulation Car -5 db Car -15 db Reverb 100 ms Reverb 300 ms Degree of Articulation Figure 7.3: Subjective intelligibility evaluation of the DoA (SUS Test) - Mean word (top) and sentence (bottom) recognition accuracies [%], together with their 95% CI. SNR results in a reduction of the word recognition rate of around 7%. Finally, it is noticed that, on average, the weakest reverberation is the less adverse condition, with recognition rates ranging from 75% to 96% when increasing the DoA. These latter results are curiously observed to be about 9% better than in a car noise with -15 db SNR, whatever the DoA Absolute Category Rating Test Finally, an Absolute Category Rating (ACR) test was conducted in order to assess several dimensions of the generated speech. As in [de Mareüil et al. 2006], the Mean Opinion Score (MOS) was complemented with six other categories: comprehension, pleasantness, non-monotony, naturalness, fluidity and pronunciation.
126 102 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech Procedure Seventeen naive listeners participated in this evaluation. They were asked to listen to 18 meaningful sentences, randomly chosen amongst the held-out set of the database (used neither for training nor for adaptation). The sentences were played one at a time. For each of them, listeners were asked to rate the 7 aspects cited above. The detailed questions list is displayed in Table 7.3, complemented with Table 7.4. Listeners were given 7 continuous scales (one for each question to answer) ranging from 1 to 5. These scales were extended one point further on both sides (ranging therefore from 0 to 6) in order to limit border effects within the scale. The sentences corresponded either to the original speech or to the synthesized speech with a variable DoA (NEU, HPO or HPR). We used the same listening protocol as in Section 5.5. Table 7.4: Question list (complement to Table 7.3) asked to listeners during the ACR test, together with their corresponding extreme category responses [de Mareüil et al. 2006]. Test Pleasantness Naturalness Questions (Extreme Answers) How would you described this voice? (Very unpleasant - Very pleasant) How would you characterize the naturalness of this voice? (Very artificial - Very natural) Results Results are shown in Figure 7.4. In all cases, original speech is preferred to synthetic speech. The MOS test shows that original NEU speech is preferred to HPO and HPR speech, while synthetic NEU and HPR speech are almost equivalent, leaving synthetic HPO speech slightly below. Note that as the MOS score of the original HPO speech seems to reach a limit, it is therefore not surprising to obtain all the remaining scores of the test in about the same proportion. The comprehension test points out that NEU and HPR speech are clearly more understandable than HPO speech, both on the original and synthetic side. Differences of comprehension between original and synthesized speech are interestingly rather weak. The pleasantness test indicates a preference of the listeners for original NEU speech, followed by HPR and HPO speech, while all the types of synthetic speech received similar scores. Despite the HMM modeling, the intonation and dynamics of the voice is well reproduced at synthesis time, as illustrated with the nonmonotony test. A major problem with HMM-based speech synthesis is the naturalness of the generated speech compared to the original speech. This is a known problem related in many studies (e.g. [Cabral et al. 2008] [Yamagishi & King 2010] [Drugman 2011] [Raitio et al. 2011b] [Kawahara & Morise 2011] [Astrinaki et al. 2012]), and it is still an ongoing research topic. The naturalness test underlines again this conclusion. The fluidity test has an inverse tendency compared to other tests. Indeed HPO speech has a higher score than the others. This is due to the fact that HPO speech is characterized by a lower
127 7.4. Conclusions 103 number of pauses and glottal stops, shorter phone durations and higher speech rate (as proven in Section 4.3). All these effects lead to an impression of fluidity in speech, while the opposite tendency is observed in HPR speech. Finally, the pronunciation test correlates with the comprehension test in the sense that the more pronunciation problems are found, the harder the understandability of the message. Albeit NEU and HPR speech are perceived equivalently in this ACR test from the comprehension and pronunciation points of view, the SUS test proved that HPR speech was much more intelligible than NEU speech in adverse environments Original - HPO Original - NEU Original - HPR Synthesis - HPO Synthesis - NEU Synthesis - HPR Figure 7.4: Subjective quality evaluation of the DoA (ACR Test) - Mean scores together with their 95% CI. 7.4 Conclusions This chapter aimed at performing a comprehensive perceptual evaluation of the flexible HMM-based speech synthesizers obtained in Chapters 5 and 6. The goal was twofold: i) analyzing the internal mechanisms of the complete voice adaptation process; ii) analyzing the effects, on synthetic speech intelligibility, of the integration of the DoA in HMM-based speech synthesis when the synthesizer is operating in adverse environments. The first part aimed at analyzing the adaptation process, and the resulting speech quality, of the NEU speech synthesizer to generate HPO and HPR speech. The goal was to have a better understanding of the factors leading to high-quality HMM-based speech synthesis with various DoA (NEU, HPO and HPR). This is why the adaptation process was subdivided into four factors: cepstrum, prosody, phonetic transcription adaptation as well as the complete adaptation. All these factors have their own importance. First the perceptual impact of these factors was studied through a Perceived Degree of Articulation (PDA) test. It was observed that effective prosody adaptation cannot be achieved by a simple ratio operation. It was also shown that adapting prosody alone, without adapting cepstrum highly degrades the rendering of the DoA. The impact of cepstrum adaptation turned out to be more important than the effect of phonetic transcription adaptation. Besides, the importance of having a Natural Language Processor able to create
128 104 Chapter 7. Subjective Assessment of Hypo and Hyperarticulated Speech automatically realistic HPO and HPR transcriptions has been emphasized. This evaluation also highlighted the fact that high-quality HPO and HPR speech synthesis requires the use of an effective statistical adaptation technique such as Constrained Maximum Likelihood Linear Regression (CMLLR). Secondly, an Absolute Category Rating (ACR) test was conducted in complement to the PDA evaluation. For HPR speech, it was observed that the more complete the adaptation process (in the sense of the PDA scores), the higher the quality and comprehension of speech. Nonetheless, no significant differences in monotony and pronunciation were found. Regarding HPO speech, Mean Opinion Score (MOS) scores and results of comprehension, monotony and fluidity were interestingly in line with the conclusions of the PDA test. The second part was devoted to the evaluation of the benefits to integrate a variable DoA in a HMM-based speech synthesis system when embedded in adverse conditions. First, a Semantically Unpredictable Sentences (SUS) test revealed that playing on the articulation significantly improves the intelligibility of the synthesizer in adverse environments (both noisy and reverberant conditions). In presence of a perturbation, this evaluation showed that HPR speech enhances the comprehension of synthetic speech. Moreover, a DoA of 0.5 (instead of 1) is sufficient to improve the recognition of the message in a car noise. The same conclusion was drawn in reverberant environments, except that a DoA of 1 is in this case necessary. Secondly, an Absolute Category Rating (ACR) test was used to assess the synthesizer through various voice dimensions. Although a loss is noticed between natural and synthesized speech regarding its naturalness and segmental quality, several perceptual features like comprehension, non-monotony and pronunciation are relatively well preserved after statistical and parametric modeling. All audio examples used in the experimental evaluations of this study are available online at picart/.
129 7.4. Conclusions 105 Summary of Chapter 7 Breaking down the complete voice adaptation process, to quantify the effect of each factor on the perception of the DoA by listeners: cepstrum, prosody, phonetic transcription adaptation as well as the complete adaptation. It was demonstrated that: an effective prosody adaptation outperforms a straightforward phone-independent constant ratio applied to the NEU full data model to decrease or increase pitch and phone durations to sound like HPO or HPR speech respectively; adapting prosody alone (i.e. without cepstrum) highly degrades the DoA rendering; the impact of cepstrum adaptation turned out to be more important than the effect of phonetic transcription adaptation; having a Natural Language Processor able to create automatically realistic HPO and HPR transcriptions is an advantage; the more complete the adaptation process, the better the quality and comprehension of synthetic speech; high-quality HPO and HPR speech synthesis requires the use of an effective statistical adaptation technique such as CMLLR. Integrating a variable DoA in the HMM-based speech synthesis system when the latter is performing in adverse environments (noise and reverberation) proved to improve the intelligibility of synthetic speech. It turned out that: HPR speech enhances the comprehension of the message; a DoA of +0.5, i.e. halfway between NEU and HPR speech, (instead of 1, i.e. HPR speech) is sufficient to improve the recognition of the generated speech in car noise; a DoA of 1, i.e. HPR speech, is necessary in reverberant environments. Speech synthesizers integrating a variable DoA, this time embedded in clean conditions, were assessed through various voice dimensions: MOS, comprehension, pleasantness, non-monotony, naturalness, fluidity and pronunciation. It turned out that: a gap still exists between natural and synthesized speech regarding its naturalness and segmental quality; the statistical and parametric modeling process based on HMMs preserves relatively well the comprehension, non-monotony and pronunciation of the generated speech. Audio examples are available online at picart.
130
131 Chapter 8 Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Contents 8.1 Introduction Creating Target Style Model without any Target Style Speech Data Contributions and Structure of the Chapter Creation of the Articulation Model Techniques for the Transposition of the Articulation Model to a New Speaker Prosody Transposition Experimental Framework Speech Quality of the Prosody Model Transposition Perception of the Degree of Articulation Filter Transposition Experimental Framework Speech Quality of the Filter Model Transposition Perception of the Degree of Articulation Identity Preservation Assessment Conclusions on Filter Transposition Generalization to Other Voices Experimental Framework Speech Quality of the Prosody and Filter Models Transposition Perception of the Degree of Articulation Identity Preservation Assessment Conclusions
132 108 Chapter Introduction Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis The ultimate goal of this research is to be able to continuously control the DoA of an existing standard NEU voice for which no HPO and HPR recordings are available. Several methods can be implemented to transform a source speaker s voice into a target speaker s voice, when the speech data from the target speaker is available. For instance, (intra-speaker) voice adaptation techniques [Yamagishi et al. 2009b] [Nose et al. 2009] already proved their effectiveness in Chapter 6 by providing various DoA of speech directly from the NEU style. Similarly, eigenvoice conversion techniques [Toda et al. 2006] [Ohtani et al. 2010] [Smit 2010] carry out Hidden Markov Models (HMMs) based speaker adaptation using a small amount of adaptation data by reducing the number of free parameters for controlling speaker dependencies of HMMs. Voice Conversion (VC) techniques are also an alternative. One of the most popular VC methods is the probabilistic conversion based on Gaussian Mixture Models (GMMs) [Stylianou et al. 1998] [Toda et al. 2007a]. Finally, voice morphing is also a technique for continuously modifying a source speaker s speech to sound as pronounced by another speaker [Abe 1996] [Ye & Young 2004] [Kawahara et al. 2009]. A thorough review of various voice adaptation, VC, eigenvoice and voice morphing techniques was detailed in Chapter 6. Unfortunately, all these methods cannot be applied to our case because we do not have any target data (i.e. HPO and HPR speech data) for the existing standard NEU voice. Instead we propose i) to model the DoA transforms on a voice for which NEU, HPO and HPR speech data are available (Voice A) through speaking style adaptation, and ii) to apply these transforms to an existing standard NEU voice (Voice B) with no HPO and HPR recordings through speaking style transposition Creating Target Style Model without any Target Style Speech Data Similar problems were observed in the past, in the framework of unit-selection speech synthesis. Indeed, this synthesis technique produces high-quality synthetic speech, but highly relies on the examples that can be selected within the database. Synthesizing a speaking style which is different from the one contained in the database is not possible without recording a new corpus with the desired speaking style, which is time- and resourceconsuming. This is the reason why Langner [Langner & Black 2005] proposed, for a limited domain task, to model speech generated in noise (i.e. Lombard speech) on a voice for which NEU speech and speech in noise were available. They showed the possibility of improving synthetic speech intelligibility of existing (NEU) voices without requiring any extra speech in noise recordings. For this, they used: i) a database where both NEU speech and speech in noise were available; ii) voice conversion techniques, applied to style conversion, based on a Gaussian Mixture Model (GMM), to learn a mapping between these two speaking styles, and to transpose it to the existing target voice. Their method obtained encouraging results for diphone synthesis, but improvements were required for unit selection synthesis. However, nothing is mentioned about the target speaker identity preservation.
133 8.1. Introduction 109 More recently, Hsu [Hsu & Chen 2012] created an emotional model associated with a NEU target speaker s voice for which such kind of speech data are unavailable, using speaker-dependent model interpolation, in the framework of HMM-based speech synthesis. In order to achieve this, a pool of source speakers voices was available, each speaker associated with its NEU and emotional models. The technique consisted first in finding: i) the closest NEU source model A1 (with its associated emotional source model A2); ii) the closest source emotional model B2 (which may be different from A2) to the NEU target model C1. Then the target emotional model C2 was obtained by interpolating between C1 and B2 until the minimum distance between C2 and A2 was reached. In [Kanagawa et al. 2013], the problem of creating a target style model on a voice for which such kind of target speech data are unavailable was addressed by combining techniques of speaker or style adaptation [Tachibana et al. 2006] and average voice model [Yamagishi 2006]. For this they used multiple speaker s NEU and target styles speech data, from a database composed of parallel speech data of five female professional narrators. They trained an average voice model with each speaker s NEU speech data (thus obtaining a NEU style average voice model), and estimated linear transforms by adapting this model to the target style using each speaker s target speech data (thus obtaining a target style average voice model). These transforms were eventually applied to an existing NEU acoustic model for which no target style speech data were available. Their experiments demonstrated the effectiveness of the Speaker-Adaptive Training (SAT) normalization in the transforms estimation. Moreover, the proposed technique provided good quality target speaker s speech synthesis in terms of naturalness, style reproducibility and speaker similarity, with MOS scores ranging between 3 (meaning fair ) and 4 (meaning good ). A similar idea is followed in the cross-lingual speaker adaptation domain [Liang et al. 2010], in which source speaker s speech in one language is used to produce speech in another target language that still sounds like the source speaker s voice, even if the latter does not speak the target language. For this, Wu [Wu et al. 2009] proposed to establish a state mapping between the voice models in both source and target languages using Kullback-Leibler (KL) divergence and, based on this mapping, to conduct cross-lingual speaker adaptation. The state mapping consists in finding a correspondence between each leaf node of the decision tree of a model and each leaf node of the decision tree of another model, as depicted in Figure 8.5. This can be achieved in two ways: data and transforms approaches. Both methods start with average voice models which are trained using respective source and target speech data but differ in the way of adapting the target voice model to sound as the source speaker s voice. In the data method, the source speech adaptation data is attached to the target language model, based on the state mapping between the source and target languages, and intra-lingual speaker adaptation is conducted for the target model regardless of the speech data language. The transform methods adapt the source average voice model using the source speech data, and the resulting transformation matrices are applied to the target average voice model based on the state mapping between the source and target languages. In [de Franca Oliveira et al. 2012], the mapping between languages is provided by an intermediary space: the language-independent space of perceptual characteristics. This idea is motivated by the fact that some voice characteristics
134 110 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis are inherent to specific speakers (e.g. gender, age, etc.) and that these characteristics are usually not affected by the spoken language. This technique relies on two language spaces of speaker s voices, in the source and target language respectively, in which each speaker is represented by a point (i.e. a supervector). The idea, when a new source speaker s speech appears in the source language space, is to represent it as a weighted linear combination of all the other source speakers. The weights obtained at the previous step are then used to represent the new source speaker in the space of perceptual characteristics (i.e. to find a new point in the latter space). This new point is language independent and can thus be used to estimate, similarly as before, the weights representing the target speaker in the intermediate space. Finally, speaker interpolation is performed in the target language space to obtain the target speaker s voice in that latter space Contributions and Structure of the Chapter In Chapter 7, we have demonstrated that altering the DoA of synthetic speech allows intelligibility improvements when the synthesizer is performing in adverse (noisy or reverberant) environments (the interested reader is referred to Section 7.3 for detailed results). In addition, modifying the DoA of synthetic speech has many daily life applications, as detailed in Chapter 1. This motivates the need for varying the DoA of existing NEU voices without requiring any additional HPO and HPR speech data. This chapter is therefore devoted to finding new methods to transpose, to other voices, the DoA model estimated on one voice, in the framework of HMM-based speech synthesis [Zen et al. 2009]. These methods should be model-independent, in the sense that they could be applied to both the prosody (pitch and phone duration) and filter models independently. Furthermore, we investigate various parametric spaces for representing the spectral envelope in order to find out the most appropriate for our purpose: Mel-Generalized Cepstral coefficients (MGC [Fukada et al. 1992] [Tokuda et al. 1994]), Line Spectral Pairs coefficients (LSP [Itakura 1975] [Dutoit 1997]), PARtial CORrelation coefficients (PARCOR [Rabiner & Juang 1993] [Boite et al. 1999]) and Log Area Ratio coefficients (LAR [Rabiner & Juang 1993] [Dutoit 1997]). A priori, there are no clues allowing to tell which of the proposed methods will performed better than the others, if such a method exists. However, an interesting study regarding the interpolation properties of the Linear Prediction (LP) parametric representations has been conducted in [Paliwal 1995]. They showed that the interpolation performance can vary amongst the investigated representations, albeit each of them provides equivalent information about the LPC spectral envelope. The best interpolation performance was achieved by the Line Spectral Frequency (LSF) representation. Note that LSFs correspond to the angles of the LSP polynomial roots. The stability of the LP model is guaranteed with simple criteria in the LSF domain [Bäckström & Magi 2006]. This chapter is structured as follows. Section 8.2 details the creation of the DoA model on Voice A. Section 8.3 presents different methods investigated for the application of the DoA model learned on Voice A to an existing standard NEU voice (Voice B). The efficiency of these techniques is assessed for the transposition of prosody (Section 8.4) and of filter (Section 8.5) coefficients separately. We also investigate in the latter
135 8.1. Introduction 111 section which representation of the spectral envelope is the most suited for this purpose: MGC, LSP, PARCOR and LAR coefficients. As no reference speech data are available, subjective evaluations are performed to assess the speech quality, the effectiveness in DoA transposition as well as the target speaker identity preservation after modification of its DoA. The method providing the highest performance for the prosody and filter models transposition is then generalized on another male (Voice M) and female (Voice F) voices in Section 8.6. For this latter step, we hypothesized that speakers B, M and F would adopt a similar articulatory strategy as speaker A. Although this could possibly be fallacious, this hypothesis cannot be confirmed as no HPO and HPR speech data are available for Voices B, M and F. Finding the exact articulatory strategy adopted by speakers B, M and F is not the purpose of this work. The goal is to apply an HPO and HPR model, learned on Voice A, to Voices B, M and F, in order to get the listeners believe that it is actually Voices B, M and F HPO and HPR speech. This goal should also be reached by finding (if possible) a single technique achieving the best results independently of the DoA, even if specific methods could have DoA-dependent performance. Based on Lindblom s H and H theory on the one hand, and on the fact that the NEU vocalic triangles of each speaker are similar, in the sense that they can match each other with simple translations and scaling (see Figure 8.1) on the other hand, those vocalic triangles should therefore expand or shrink in a similar way as for speaker A. Finally Section 8.7 concludes the chapter. This chapter is based upon the following publications [Picart et al. 2012b] [Picart et al. 2013b]. Audio examples for each voice (A, B, M and F) and for each DoA are available online at picart/ /a/ Voice A Voice B Voice M Voice F F1 [Hz] /u/ 300 /i/ F2 [Hz] Figure 8.1: Vocalic triangles estimated on the original NEU recordings for Voices A, B, M and F, together with dispersion ellipses.
136 112 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis 8.2 Creation of the Articulation Model For each filter parameterization to be evaluated (MGC, LSP, PARCOR and LAR), three HMM-based speech synthesizers [Zen et al. 2009] were built for Voice A: NEU, HPO and HPR speech synthesizers. Voice A corresponds to the voice recorded in Chapter 3. Models relying on the MGC as filter representation were already built for each DoA in Chapter 5. The remaining ones are trained following the same procedure as in Section 5.2, with the corresponding coefficients (LSP, PARCOR and LAR) extracted from the database (with α = 0.42 and order of the analysis = 24). Figure 6.1 shows the general architecture of the system for each filter representation type. In this section, we propose two methods for creating the articulation model on Voice A. The NEU HMM-based speech synthesizer was adapted following the same procedure as in Section 6.2 (i.e. CMLLR + MAP adaptation), using the entire training HPO and HPR databases (8220 CMLLR transforms, associated to decision tree classes). The effect of the amount of adaptation sentences on the synthesized speech quality was already studied in Section 6.2. At the end of this step, we obtained: i) Voice A NEU full data model; ii) Voice A HPO and HPR adapted models; iii) two sets of CMLLR transforms for Voice A HPO and HPR adapted models, respectively. In this chapter, speaking style adaptation from NEU to HPO and HPR speech, is performed on Voice A in two alternative ways, as illustrated in Figure 8.2: i) CMLLR adaptation, as described in Section 6.2 (and summarized in the paragraph here above); ii) model-space Linear Scaling (LS) adaptation. HPO NEU HPR Yes? No Yes? No Yes? No?? Yes No Yes No?? Yes No Yes No?? Yes No Yes No LS or CMLLR transforms Figure 8.2: Creation of the articulation model on Voice A. Transforms are computed in two alternative ways, using LS or CMLLR adaptation. In constrained MLLR (CMLLR) adaptation, the mean vector µ and the covariance matrix Σ of a distribution are transformed simultaneously using the same transformation matrix ζ R L L (see Section 2.6 for more details). This model-space transform is equivalent to an affine transform of the feature-space, as shown by the following equations: b Ah (o) = N (o; ζ µ An ε, ζ Σ An ζ ) (8.1a) = ζ N (ζo + ε; µ An, Σ An ) (8.1b)
137 8.2. Creation of the Articulation Model 113 where A n is the Voice A NEU model, A h equally represents the Voice A HPO or HPR models, ε R L is the bias term of the mean vector transform, ζ = ζ 1, ε = ζ 1 ε. In model-space Linear Scaling (LS) adaptation, the transformation matrix Z is diagonal, and computed so as to obtain the same adapted mean vector as in CMLLR. Note that this model-space transform is also equivalent to an affine transform of the feature-space, as shown by the following equations: b Ah (o) = N (o; Z µ An, Z Σ An Z ) (8.2a) = Z N (Zo; µ An, Σ An ) (8.2b) where Z = Z 1 and Z diagonal such that Z µ An = ζ µ An ε. From the above-mentioned equations, we see that differences between the two proposed approaches are the following: from the mean vectors point of view (as schematically illustrated in Figure 8.3) µ Ah is exactly the same whatever the method; µ Bh, where B h equally represents the Voice B HPO or HPR models, is different depending on the chosen method; from the covariance matrices point of view Σ Ah is different depending on the chosen method, because Z ζ ; Σ Bh also differs depending on the chosen method, for the same reason. μ h Z μn ζ μ n ε μ Ah μ Bh = ζ μ Bn ε μ Bh = Z μ Bn μ Bn μ An μ n Figure 8.3: Comparison of mean vector µ adaptation in CMLLR and model-space LS.
138 114 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis 8.3 Techniques for the Transposition of the Articulation Model to a New Speaker In this section, the prosody (pitch and phone duration) and filter adaptation transforms computed on Voice A (see Section 8.2) are applied to an existing standard NEU voice (Voice B) with no HPO or HPR recordings available, to automatically control its DoA. This is illustrated in Figure 8.4. In order to find the most effective technique for varying the DoA, we used recordings of a native French speaker as Voice B first. This Text-to-Speech (TTS) voice was kindly provided by Acapela Group S.A.. Voice B was trained using 2400 NEU sentences sampled at 16 khz, following the same procedure as for the NEU full data model of Voice A (see Figure 5.1): same settings for the filter (type of coefficients and parameters) and for the excitation. Figure 8.1 displays the vocalic triangle for the original Voice B NEU sentences, and Table 8.1 provides the speech rate, the mean and standard deviation of F 0 values for Voice B NEU recordings. The vocalic triangle, as well as some other information, related to Voice A, are here given for comparison purpose (see Chapter 4 for more details). HTS was forced to use the same decision trees as those of Voice A in order to have a one-to-one mapping between the Probability Density Functions (PDFs) of Voices A and B. Imposing decision trees has a potential impact on generated speech quality, as the training process is no more allowed to construct the best trees considering the actual data, thus leading to a non-optimal clustering of the observations. However, informal listening tests did not show any significant degradation in output speech quality. Table 8.1: Speech rates, mean and standard deviation of F 0 values for Voices A NEU, HPO and HPR recordings and for Voice B, M and F NEU recordings. Speech rate F 0 [Hz] [syllable/s] Mean Std dev. HPR Voice A NEU HPO Voice B NEU Voice M NEU Voice F NEU Speaking style transposition is performed on Voice B, by applying the adaptation transforms learned on Voice A during the speaking style adaptation step. Since we know the mapping between each PDF and each transformation matrix on Voice A, only the mapping information between each PDF of Voice A and each PDF of Voice B is missing. Here again, and as illustrated in Figure 8.5, two techniques are investigated in this work: phonetic mapping and acoustic mapping (see Figure 8.4). Phonetic mapping implements the mapping between each PDF of Voice A and Voice B using decision trees only (as each PDF is associated with a full context label). Acoustic mapping is inspired by the crosslingual speaker adaptation domain [Liang et al. 2010]. Here, the matching between each
139 8.3. Techniques for the Transposition of the Articulation Model to a New Speaker 115 Voice B NEU Full Data Model Voice A HPO or HPR Adapted Model Voice A NEU Full Data Model Prosody Mapping (Phonetic or KL) Filter Mapping (Phonetic or KL) Prosody Transposition Filter Transposition Voice B HPO or HPR Transposed Model Voice B HPO or HPR Transposed Model Evaluations: CMOS & CPDA Evaluations: 2 CMOS & CPDA & ID Best Method for Filter Model Transposition 1 Best Method for Prosody Model Transposition Voice M or F NEU Full Data Model Prosody & Filter Mapping (Phonetic or KL) Prosody & Filter Transposition Voice M or F HPO or HPR Transposed Model Evaluations: 3 CMOS & CPDA & ID Figure 8.4: Prosody and filter adaptation transforms computed on Voice A are applied to an existing standard NEU Voice B with no HPO or HPR recordings available for generating Voice B HPO and HPR adapted models. The most successful method (selected through various evaluations) is then used for automatically modifying the DoA of two other speakers (Voices M and F). PDF of Voice A and Voice B is computed by finding a leaf-to-leaf correspondence using the Kullback-Leibler (KL) divergence between two distributions. Voice A NEU? Yes No?? Yes No Yes No Voice B NEU? Yes No?? Yes No Yes No Phonetic or Acoustic mapping Figure 8.5: Transposition of the articulation model learned on Voice A to Voice B. Leaf nodes mapping is performed in two alternative ways, using phonetic (based on decision trees) or acoustic (based on KL divergence) mapping. By combining the two speaking style adaptation techniques (detailed in Section 8.2) and the two speaking style transposition techniques, four methods are defined (and summarized in Table 8.2) to apply the prosody and filter transposition transforms, learned on Voice A, to Voice B. This further holds for the four filter parameterizations considered in this work.
140 116 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Table 8.2: Methods for applying the prosody and filter transposition transforms from Voice A to Voice B. Transforms Linear Scaling (LS) Decision trees mapping Phonetic Kullback-Leibler LS_Phn LS_KL CMLLR CMLLR_Phn CMLLR_KL The aforedescribed DoA model transposition can be applied to prosody and filter coefficients independently. As illustrated in Figure 8.4, these two contributions are studied separately in the following as they have different properties and should perceptually affect synthesis differently. Section 8.4 focuses on prosody model transposition from Voice A to Voice B (the top part of the workflow leading to Evaluation #1 ). The resulting conclusions will be used for filter model transposition in Section 8.5 (the middle part of the workflow leading to Evaluation #2 ). As no target data (HPO and HPR) are available for Voice B, objective measurements cannot be used and our approach consequently solely relies on subjective evaluations. Section 8.6 will finally confirm the efficiency of the best technique, resulting from the two previous steps, on two other voices: a male (Voice M) and a female (Voice F) French speakers (the bottom part of the workflow leading to Evaluation #3 ). 8.4 Prosody Transposition The first step we analyze is the contribution of prosodic model (pitch and phone duration) transposition from Voice A to Voice B (see Figure 8.4). The experimental framework is detailed in Section and a subjective assessment is performed: a Comparative Mean Opinion Score (CMOS) test, to evaluate the segmental quality after prosody transposition (Section 8.4.2). CMOS evaluation is then complemented with a Comparative Perception of the DoA (CPDA) test, to quantify the (positive or negative) effects of prosody transposition on the perceived DoA (Section 8.4.3). These two tests related to the prosody model transposition are referred to as Evaluation #1 on the right-hand side of Figure Experimental Framework In order to study the contribution of prosody model transposition alone, we fixed the filter parameterization to be the standard MGC coefficients for: i) the NEU full data model, the HPO and HPR adapted models of Voice A (see Figure 6.1); ii) the NEU full data model of Voice B. The phonetic transcription was manually adjusted to fit the real HPO and HPR transcription (see Section 4.3 for more details about phonetic insertions or deletions in HPR or HPO speech respectively). For each DoA, the four methods described in Table 8.2 were applied to adapt and transpose the prosody model from Voice A to Voice B.
141 8.4. Prosody Transposition 117 The baseline system was chosen to be the NEU full data model of Voice B, where a straightforward and carefully tuned phone-independent constant ratio is applied to decrease or increase pitch and phone durations to respectively sound like HPO or HPR speech. This ratio is computed once and for all over the Voice A HPO and HPR databases (see Chapter 3) by adapting the mean values of the pitch and phone duration from the NEU style. The phonetic transcription was also manually adjusted to fit the real HPO and HPR transcription It is important to note that the baseline is here assumed to be a reference with a high segmental quality as it is based on a full data model without any statistical postprocessing. Although it makes sense to apply this baseline technique on prosody features, such a straightforward ratio approach is obviously not generalizable for filter coefficients. This motivates the need for finding the most appropriate statistical adaptation method amongst the techniques presented in Section Speech Quality of the Prosody Model Transposition For the evaluation of the speech quality of prosody model transposition, listeners were asked to compare pairs of sentences (X, Y) from the overall speech quality point of view: i) the sentence generated by the Voice B HPO or HPR speech synthesizer, whose prosody has been transposed using one of the four methods described in Table 8.2; ii) the corresponding sentence synthesized by the baseline system. These two sentences were randomly presented as either X or Y all along the test. CMOS values range on a 7-point gradual scale varying from 3 (meaning that X is much better than Y) to -3 (meaning the opposite). A score of 0 is given if the quality of both versions are found to be similar. Each listener was presented 24 pairs, randomly chosen from the synthesis set, including samples from each DoA and each method. During the test, listeners were allowed to listen to each pair of sentences as many times as wanted, in the order they preferred. However they were not allowed to come back to previous sentences after validating their decision. Twenty nine naive listeners participated in this evaluation. The mean CMOS score, for each method and each DoA, is displayed in Figure 8.6. A positive or negative score means that the considered method gives respectively better or worse results than the baseline. A score of around 0 means that the considered method and the baseline are found to provide an equivalent quality. Figure 8.6 shows that the LS_Phn method interestingly achieves the best performance, for both HPO and HPR speech. It performs significantly similarly to the baseline which was considered as a golden reference in terms of segmental quality. For HPR speech, the use of Kullback-Leibler (KL) divergence (instead of phonetic mapping) for speaking style transposition (LS_KL method vs. LS_Phn method) leads to a dramatic drop in performance (the mean value of the distributions drops from to -1.77). This implies that the knowledge of the phonetic environment is essential for estimating the transforms, and that the acoustic information is not sufficient on its own. The major problem perceived by listeners mainly comes from pitch contour issues. Indeed, pitch contour is most of the time increasing on phones Schwa located at the end of the words,
142 CMOS Score 118 Chapter Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis HPR HPO LS_Phn LS_KL CMLLR_Phn CMLLR_KL Figure 8.6: CMOS test for prosody transposition - Mean CMOS score for each method and each DoA, together with their 95% confidence intervals (CI). which is totally unrealistic. Compared to LS adaptation (LS_Phn and LS_KL methods), CMLLR adaptation (CMLLR_Phn and CMLLR_KL methods) gives intermediate results, further from the baseline than the LS_Phn method but closer than the LS_KL method. Again, using the KL divergence (CMLLR_KL method vs. CMLLR_Phn method) leads here to a decrease in performance, but slighter in this case: mean values decrease from (CMLLR_Phn method) to (CMLLR_KL method). For HPO speech, the overall performance of LS adaptation (LS_Phn and LS_KL methods, with mean values of 0.02 and respectively) is higher than that of CMLLR adaptation (CMLLR_Phn and CMLLR_KL methods, with mean values of and respectively). The same conclusion as for HPR speech can be drawn for HPO speech, i.e. that the use of KL divergence (instead of phonetic mapping) for speaking style transposition leads to a slight reduction in performance. Even with the slight confidence intervals (CI) overlap observed in Figure 8.6, the LS_Phn method is the only technique achieving the best performance for both HPO and HPR speech. As for comparison purpose and as already mentioned in Section 8.1.1, the technique proposed in [Kanagawa et al. 2013] achieved good quality target speaker s speech synthesis in terms of naturalness, style reproducibility and speaker similarity, with MOS scores ranging between 3 (meaning fair ) and 4 (meaning good ). Figure 8.7 displays the detailed preference scores, averaged for all the participants and utterances used in the test, for each method compared to the baseline and each DoA. For example: for HPR speech, the LS_Phn method has been preferred in 26%, disliked in 23% and equivalently preferred in 51% of the cases, compared to the baseline system; for HPO speech, the LS_Phn method has been preferred in 22%, disliked in 22% and equivalently
143 8.4. Prosody Transposition 119 HPR HPO CMLLR_KL CMLLR_Phn LS_KL LS_Phn Disliked Equivalent Preferred 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% Figure 8.7: CMOS test for prosody transposition - Detailed preference scores (expressed in [%]), averaged for all the participants and utterances used in the test, for each method compared to the baseline, for HPR speech (left) and HPO speech (right). preferred in 56% of the cases, compared to the baseline system. It is again noticed that the LS_Phn method is the best technique, leading to preference scores equivalent to the baseline which was considered to be a golden reference at this level Perception of the Degree of Articulation The CMOS test performed in Section provided useful information about the synthetic speech quality that could be obtained when applying transposition transforms. However it does not bring any information about the effective production of the desired DoA. This is why we here complement the results of the CMOS evaluation with a Comparative Perception of the DoA (CPDA) test. Listeners were given two pairs of sentences. The first pair was composed of: i) the NEU sentence synthesized by Voice A full data model; ii) the HPO or HPR sentence (randomly shuffled) synthesized by Voice A adapted HPO or HPR model. The second pair was composed of: i) the NEU sentence synthesized by Voice B full data model; ii) the HPO or HPR sentence (same DoA as the second sentence of the first pair) synthesized by one of the four methods or the baseline investigated in this work (as explained in Section 8.4.1). Listeners were also given a continuous scale ranging from 0 to 1. This scale was extended further on both sides (thus ranging from -0.2 to 1.2) in order to limit border effects within the scale. The NEU sentence synthesized by the Voice A full data model was set to 0, while the Voice A HPO or HPR sentence was set to 1. Given the distance between the sentences composing the first pair, listeners were asked to perceptually estimate the distance between the two sentences of the second pair. For this, the NEU sentence synthesized by the Voice B full data model was placed at 0, and listeners had to estimate the position of the HPO or HPR sentence synthesized using one of the four methods or the baseline. This should then reflect the extent to which the DoA is effectively reproduced on Voice B, a value of 1 being the target. The test consisted of 10 quadruplets. For each DoA and for each method to be tested, 30 sentences were randomly chosen from the synthesis set. The same listening protocol as in
144 CPDA Score 120 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Section was implemented. Twenty four naive listeners participated in this evaluation. Figure 8.8 shows the resulting mean scores. As a reminder, the closer to 1 these scores, the better the rendering of the generated DoA. Moreover, as in Section 5.5, statistical analyses [Howell 2012] were performed in order to assess to significance of the results. We first checked that the data were (or almost) normally distributed using the Lilliefors test HPR HPO LS_Phn LS_KL CMLLR_Phn CMLLR_KL Baseline Figure 8.8: CPDA test for prosody transposition - Mean score of the perceived DoA using the four methods or the baseline (1 being the reference DoA, defined on Voice A), together with their 95% CI. It can be observed from Figure 8.8 that all methods outperform the baseline regarding the reproduction of HPR speech. Indeed, a repeated measures ANalysis Of VAriance (ANOVA) was used to test for preference differences amongst four methods (LS_Phn, LS_KL, CMLLR_Phn and CMLLR_KL) and the baseline system (F (4, 92) = 3.79, p = 0.007, partial η 2 = 0.14). Dunnett post-hoc pairwise comparisons of each of the four methods vs. the baseline indicate that LS_Phn, LS_KL, CMLLR_Phn and CMLLR_KL methods achieve significantly higher scores (with, respectively, p = 0.003, p = 0.024, p = and p = 0.016) compared to the baseline system. On the contrary, a slight advantage is noticed in favor of the baseline for the rendering of HPO speech although no statistically significant differences were observed. For this latter speaking style, LS_Phn, LS_KL and CMLLR_KL methods turn out to provide equivalent results, while the CMLLR_Phn method provides slightly worse results. As a complement, following a similar procedure as for HPR speech, a repeated measures ANOVA was used. Comparisons between our four methods and the baseline system were not statistically significant at α = The width of 95% CIs in this test is rather important, particularly for HPO speech. This could be explained by the intrinsic difficulty of this evaluation. Indeed it is not easy to compare the two perceptual distances between the two sentence pairs, as the mean pitch
145 8.5. Filter Transposition 121 of Voice A and Voice B is different. However, this test should be taken as an indication that on overall the prosody modification is well reproduced after the application on Voice B of the adaptation transforms learned on Voice A. The fact that the proposed methods are observed to outperform the baseline for the production of HPR speech is of interest in several applications where it is aimed at increasing the intelligibility of synthetic voices (while keeping an equivalent naturalness). For example, HPR speech has been shown in Section 7.3 to enhance the comprehension of synthetic speech in a degraded environment (car noise and reverberant conditions). 8.5 Filter Transposition After the modeling of prosody in HPR and HPO speech, the second step targets the transposition to Voice B of the filter model learned on Voice A (see Figure 8.4). As already mentioned in Section 8.3, only subjective evaluations can be performed here since no reference is available for an objective assessment. After the description of the experimental framework (Section 8.5.1), we follow the same evaluation protocol as in Section 8.4, i.e. the assessment of: i) the quality of speech after filter model transposition through a Comparative Mean Opinion Score (CMOS) test (Section 8.5.2); ii) the quantification of the (positive or negative) effects of filter model transposition on the perception of the DoA using a Comparative Perception of the DoA (CPDA) test (Section 8.5.3). These two evaluations are then complemented by a third one to verify that the identity of Voice B is preserved after modifying its DoA via filter model transposition (Section 8.5.4). Finally Section summarizes the conclusions drawn from these 3 subjective tests. All the evaluations related to the filter model transposition are referred to as Evaluation #2 on the right-hand side of Figure Experimental Framework As already mentioned, the baseline system of Section (straightforward ratio applied to pitch and phone durations so as to sound like HPO and HPR speech) cannot be generalized to filter coefficients. The baseline system is thus here chosen to be the NEU full data model of Voice B to which prosody was modified using the LS_Phn technique (since this method provided the best results in Section 8.4). For each DoA and each type of filter coefficients (MGC, LSP, PARCOR and LAR), the four methods described in Table 8.2 are applied to adapt and transpose the filter model. The prosody of these models is also modified using the LS_Phn method. All possible combinations lead to a total of 16 models to assess (4 types of filter coefficients x 4 methods), which would require a prohibitive subjective evaluation. Therefore two preliminary pruning steps are used in order to select the four best models (filter coefficients and transposition methods) achieving the highest segmental quality. In the first pruning step, we identified through an informal test the 8 models providing the lowest quality. We observed three sources of artefacts in the generated models: filter unstability (u); occurrence of glitches (g); complete target speaker identity loss (i). The
146 122 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis presence of these artefacts amongst the 16 models is summarized in Table 8.3 independently of the DoA, as no discrepancies were observed between HPO and HPR speech results. We observed that most of the rejected methods induce filter unstability, partially (i.e. the PARCOR_LS_KL, LAR_LS_KL, LSP_CMLLR_Phn, PARCOR_CMLLR_Phn methods) or totally (i.e. the LAR_CMLLR_Phn and LSP_CMLLR_KL methods). The only rejected methods providing a quite stable filter (i.e. the PARCOR_CMLLR_KL and LAR_CMLLR_KL methods) introduce glitches and target speaker identity loss. These 8 models are therefore discarded in the following. Table 8.3: Selected methods after the first pruning step ( and ) and after the second one ( ). Observed artefacts on the rejected methods are also indicated (u: filter unstability; g: occurrence of glitches; i: complete target speaker identity loss). Methods Filter Parameterization Coefficients MGC LSP PARCOR LAR LS_Phn LS_KL u-g-i u-g-i CMLLR_Phn u-g-i u-g-i u CMLLR_KL u g-i g-i The second pruning step aims at determining the 4 models achieving the highest quality amongst the 8 models selected after the first pruning step. For this, a Mean Opinion Score (MOS) test was performed by 10 naive listeners. They were asked to listen to the test sentence (synthesized using one of the 8 best models) and score the overall speech quality on a MOS scale, ranging from 1 (meaning bad quality ) to 5 (meaning excellent quality ). As no discrepancies were observed between HPO and HPR speech results, Figure 8.9 displays averaged MOS scores independently of the DoA. This figure clearly shows that half of the methods (i.e. in Table 8.3) provide a better overall speech quality than the other half ( in Table 8.3). Statistical analyses were here again performed, in order to assess to significance of the results. We first checked that the data were (or almost) normally distributed using the Lilliefors test. Mauchly s Test of Sphericity indicated that the assumption of sphericity had not been violated (χ 2 (5) = 7.80, p = 0.17). Then, a repeated measures ANOVA was used to test for preference differences amongst four methods (LS_Phn, LS_KL, CMLLR_Phn et CMLLR_KL) with MGC coefficients (F (3, 27) = 3.69, p = 0.02, partial η 2 = 0.29). Tukey HSD post-hoc comparisons of the four methods indicate that the LS_Phn method gave significantly higher preference ratings (p = 0.03) compared to the CMLLR_KL method. Comparisons between all other pairs of methods were not statistically significant at α = After that, a repeated measures ANOVA was used to test for preference differences amongst four filter coefficient (MGC, LSP, PARCOR and LAR) representations for the LS_Phn method (F (3, 27) = 4.98, p = 0.007, partial η 2 = 0.36). Tukey HSD post-hoc comparisons of the four filter coefficient representations indicate that the LSP coefficients
147 MOS Score 8.5. Filter Transposition LS_Phn LS_KL CMLLR_Phn CMLLR_KL MGC LSP PAR LAR Figure 8.9: MOS test for the second pruning step - Overall speech quality (with its 95% CI) of the sentences synthesized by the HPO and HPR transposed models of Voice B. gave significantly higher preference ratings w.r.t. to the PARCOR and LAR coefficients (p = 0.02 for both). Comparisons between MGC, PARCOR and LAR coefficients on the one hand, and between PARCOR and LAR coefficients on the other hand, were not statistically significant at α = Finally, a repeated measures ANOVA was used to test for preference differences amongst two methods (LS_Phn and LS_KL) and two filter coefficient (MGC and LSP) representations (F (coefficients) = 1.98, p = 0.19, partial η 2 = 0.18 and F (methods) = 3.8, p = 0.08, partial η 2 = 0.29). No post-hoc required as the ANOVA test does not show significant differences. Based on the above mentioned observations, it can be concluded that MGC and LSP coefficients are more appropriate for our purpose, and that PARCOR and LAR parameters can be discarded. It also turns out that the 4 best models involve the use of a model adaptation technique (LS), which is therefore more suited than a feature adaptation approach (CMLLR). These 4 methods will be referred to as MGC_LS_Phn, MGC_LS_KL, LSP_LS_Phn and LSP_LS_KL and will be subject to deeper analysis in the next subsections Speech Quality of the Filter Model Transposition For this CMOS evaluation of the filter model transposition, listeners were asked to compare 24 pairs of sentences from the overall speech quality point of view. These sentences were randomly chosen from the synthesis set and were synthesized by Voice B HPO or HPR speech synthesizers using the four methods selected in Section 8.5.1, taken two by two. This difference aside, the same listening protocol as in Section was applied.
148 CMOS Score 124 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Twenty six naive listeners participated in this evaluation. The mean CMOS score is displayed in Figure 8.10 for each method and for each DoA. A positive score means that the considered method gives better results than the other ones. A negative score means the opposite MGC LSP MGC LSP HPO HPR LS_Phn LS_KL Figure 8.10: CMOS test for filter transposition - Mean CMOS score for each method and each DoA, together with their 95% CI. Figure 8.10 shows that, for both HPO and HPR speech, the best technique turns out to be LSP_LS_Phn, while the worst is MGC_LS_KL. As in Section 8.4.2, the use of KL divergence leads to a degradation of the results, for both HPO (-0.65 and respectively for MGC and LSP) and HPR (-0.37 and respectively for MGC and LSP) speech, stronger for the LSP coefficients (and particularly for HPO speech). From a speech quality perspective, LSPs are observed to outperform MGCs in all cases. A similar conclusion holds for phonetic clustering over the use of KL distance. As a consequence, LSP_LS_Phn method gives the highest segmental quality. As for comparison purpose and as already mentioned in Section 8.1.1, the technique proposed in [Kanagawa et al. 2013] achieved good quality target speaker s speech synthesis in terms of naturalness, style reproducibility and speaker similarity, with MOS scores ranging between 3 (meaning fair ) and 4 (meaning good ) Perception of the Degree of Articulation As already stated in Section 8.4.3, the CPDA test complements the CMOS evaluation about the effective production of the desired DoA. The same experimental protocol as in Section was applied. Listeners were given two pairs of sentences. The sentences were randomly chosen from the synthesis set. The first pair was composed of: i) the NEU sentence synthesized by the Voice A full data model; ii) the HPO or HPR sentence
149 CPDA Score 8.5. Filter Transposition 125 (randomly shuffled) synthesized by the Voice A adapted HPO or HPR model. The second pair was composed of: i) the NEU sentence synthesized by the Voice B full data model; ii) the HPO or HPR sentence (same DoA as the second sentence of the first pair) synthesized by one of the four methods or the baseline described in Section Twenty one naive listeners participated in this evaluation. Figure 8.11 shows the mean score, corresponding to the perceived DoA using the four methods or the baseline, 1 being the target value corresponding to the reference DoA (defined on Voice A) LS_Phn LS_KL Baseline 0.00 MGC LSP MGC LSP HPO HPR Figure 8.11: CPDA test for filter transposition - Mean score of the perceived DoA using the four methods or the baseline (1 being the reference DoA, defined on Voice A), together with their 95% CI. After checking that the data were (or almost) normally distributed using the Lilliefors test, a repeated measures ANOVA was used to test for preference differences amongst four methods (MGC_LS_Phn, MGC_LS_KL, LSP_LS_Phn and LSP_LS_KL) and the baseline system. They were unfortunately not statistically significant at α = Although no statistical significance is observed across all the methods in HPO speech in Figure 8.11, a slight advantage is noticed in favor of LSP-based methods compared to the three other techniques, including the baseline. This advantage is even slightly stronger for LSP_LS_Phn. Nonetheless, an opposite tendency is observed for HPR speech: methods based on MGC seem to perform better than methods relying on LSP, but still not significantly. Therefore HPR speech seems to be less properly rendered than HPO speech, whose results are much closer to the target value Identity Preservation Assessment In the previous sections, we modified the DoA of Voice B, and assessed the quality of the synthesized speech as well as the proper modification of the DoA. These tests do not
150 ID Score 126 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis provide any information about the speaker identity preservation or loss. Indeed since our methods are modifying the filter transmittance (and consequently the vocal tract response to a great extent), we have to ensure that they do not convey any residual information about the speaker identity of Voice A. The last evaluation is thus designed to assess the preservation of the identity of the Voice B speaker after modification of its DoA. Listeners were given three sentences: i) the NEU sentence synthesized by Voice A full data model; ii) the NEU sentence synthesized by Voice B full data model; iii) the HPO or HPR sentence (randomly shuffled) synthesized by Voice B transposed HPO or HPR model, synthesized by one of the four methods or the baseline described in Section Listeners were also given a continuous scale ranging from 0 to 1. The NEU sentence synthesized by the Voice A full data model was set to 0, while the NEU sentence synthesized by the Voice B full data model was set to 1. Listeners were then asked to guess who was speaking in the test sentence (sentence synthesized by the Voice B transposed HPO or HPR model). The continuous scale represents the decision confidence: the closer the score to the scale extremities, the more confident the decision. A score of 0.5 implies that listeners were not able to determine who was speaking. The test consisted of 30 triplets. For each DoA and for each method to be tested, 50 sentences were randomly chosen from the synthesis set. Twenty one naive listeners participated in this evaluation. The same listening protocol as in Section was implemented. Figure 8.12 shows the mean score, corresponding to the Voice B speaker identity, for each method and each DoA. A score of 1 means that the considered method perfectly preserves the identity of Voice B speaker. A score of 0 means that the considered method totally loses the identity of Voice B speaker MGC LSP MGC LSP HPO HPR LS_Phn LS_KL Baseline Figure 8.12: ID test for filter transposition - Mean score for each method and each DoA (Voice A = 0, Voice B = 1), together with their 95% CI.
151 8.5. Filter Transposition 127 In Figure 8.12, it can be seen that the baseline system provides better identity preservation compared to all others methods. We first checked that the data were (or almost) normally distributed using the Lilliefors test. A repeated measures ANOVA was used to test for preference differences amongst four methods (MGC_LS_Phn, MGC_LS_KL, LSP_LS_Phn and LSP_LS_KL) and the baseline system, for HPO speech (F (4, 80) = 2.73, p = 0.04, partial η 2 = 0.12) and for HPR speech (F (4, 80) = 7.41, p = 0, partial η 2 = 0.27). Dunnett post-hoc pairwise comparisons of each of the four methods vs. the baseline indicate that, for HPO speech, MGC_LS_KL and LSP_LS_Phn methods achieve significantly lower scores (with, respectively, p = and p = 0.023) compared to the baseline system, and for HPR speech, MGC_LS_Phn and MGC_LS_KL methods achieve significantly lower scores (with p = 0 for both) w.r.t. to the baseline system. This result was obviously expected since the baseline makes use of the original Voice B filter (Voice B NEU full data model) on which only prosody was modified. Despite the large 95% CI we obtained, this figure shows better identity preservation results when the filter is represented using LSP instead of MGC. This is particularly true for HPR speech. Similarly to the CPDA test, very good overall results are noticed in HPO speech while, although still positive, they are more moderated in HPR speech Conclusions on Filter Transposition Our experiments showed that LSP and MGC are the most suited filter representations, PARCOR and LAR coefficients being discarded after the pruning tests. Compared to the MGC coefficients, the use of LSPs: significantly improves the overall generated speech quality for both HPO and HPR speech; slightly degrades the perception of the desired DoA for HPR speech, while LSPs provide a modest advantage in its perception for HPO speech; better preserves the target speaker identity for both HPO and HPR speech, and particularly for the latter case. Moreover, the use of phonetic mapping for the speaking style transposition technique instead of KL divergence: significantly increases the overall generated speech quality for both HPO and HPR speech. This gain is particularly true when using LSP coefficients; provides similar high quality results in the perception of the desired DoA, independently of the DoA; preserves in the same manner the target speaker identity in HPO and HPR speech. We therefore select the LSP coefficients for filter representation combined with the phonetic mapping method for the speaking style transposition (i.e. LSP_LS_Phn method) as the best method for transposing to Voice B the filter model learned on Voice A. As a proof
152 128 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis of concept, Figure 8.13 displays the vocalic triangles for Voices A and B (with dispersion ellipses), computed on the synthesized HPR, NEU and HPO sentences. We can indeed see a reduction of the vocalic triangle area as speech becomes less articulated. F1 [Hz] F1 [Hz] Voice A F2 [Hz] /u/ /u/ F2 [Hz] /a/ Voice M /a/ /i/ /i/ F1 [Hz] F1 [Hz] Voice B F2 [Hz] /u/ /u/ Voice F F2 [Hz] /a/ /a/ /i/ /i/ HPR NEU HPO Figure 8.13: Vocalic triangles estimated on the synthesized HPR, NEU and HPO speech for Voices A, B, M and F, together with dispersion ellipses. 8.6 Generalization to Other Voices At this stage, methods for the automatic modification of the DoA have been thoroughly studied in Sections 8.4 and 8.5. This study was led by learning a transformation on Voice A and by applying it to Voice B. It turned out that the best technique for both prosody and filter transformation makes use of a model-based (LS) adaptation with a phonetic mapping, and that LSP coefficients are the most suited. The goal of this section is to confirm the efficiency of the resulting techniques on two other voices: a male (Voice M) and a female (Voice F) French speakers. As in Sections 8.4 and 8.5, only subjective evaluations can be performed here since no reference is available for an objective assessment. Section details the experimental framework. The same evaluation protocol as in Section 8.5 is followed, i.e. the assessment of: i) the quality of speech after prosody and filter models transposition through a Comparative Mean Opinion Score (CMOS) test (Section 8.6.2); ii) the quantification of the (positive or negative) effects of prosody and filter models transposition on the perception of the DoA using a Comparative Perception of the DoA (CPDA) test (Section 8.6.3); iii) the verification of the speaker identity preservation after modifying its DoA (Section 8.6.4).
153 8.6. Generalization to Other Voices 129 All the evaluations related to the generalization to other voices are referred to as Evaluation #3 on the right-hand side of Figure Experimental Framework Two new French databases were recorded 1, under the same conditions as for NEU Voice A (see Chapter 3). They consist of utterances produced by respectively a male (Voice M) and a female (Voice F) speaker, both native French speakers. Voices M and F were trained using 1220 NEU sentences (same set of sentences as for Voice A) sampled at 16 khz, following the same procedure as for the NEU full data model of Voice A (see Figure 5.1). The only difference concerns filter parameterization, which was chosen to be LSP coefficients as they were shown to achieve the best performance in Section 8.5. For the original Voices M and F NEU recordings, Figure 8.1 displays the vocalic triangle, and Table 8.1 provides the speech rate, the mean and standard deviation of F 0 values. HTS was forced to use the same decision trees as those of Voice A in order to have a one-to-one mapping between the probability density functions of Voices A, M and F (see Section 8.3 for a discussion about the potential impact on the generated speech quality). Informal listening tests did not reveal any significant degradation in speech quality due to this process. The prosody and filter of Voices M and F were transformed using the LS_Phn method, in order to produce HPO and HPR speech. Unfortunately, our first attempts in the filter model transposition on the female Voice F did not lead to convincing results for HPR speech. This could be explained by the fact that the filter transformations learned on Voice A were too important, i.e. that the Voice F HPO and HPR vocalic spaces were respectively shrunk and expanded in a way that would lead to impossible vocal tract configurations by the considered speaker. This resulted in filter instability when applied to a female voice. We therefore chose to linearly interpolate the filter model transforms with an empirical ratio of 0.6, which was informally verified to provide a proper rendering of the DoA with good quality. As no specific problems were found in the prosody model transposition, we applied the complete transformation as in Sections 8.4 and 8.5. As an illustration, Figure 8.13 displays the vocalic triangles for Voices M and F (with dispersion ellipses), computed on the synthesized HPR, NEU and HPO sentences. A clear reduction of the vocalic triangle area is noticed as speech becomes less articulated Speech Quality of the Prosody and Filter Models Transposition For this CMOS evaluation of the speech quality of the prosody and filter models transposition, the same experimental protocol as in Section was applied. Listeners were asked to compare 20 pairs of sentences (X, Y) from the overall speech quality point of view. These sentences were randomly chosen amongst the synthesis set. The sentences were generated either by the Voice B NEU, HPO or HPR speech synthesizer using the 1 Four new French databases were recorded in the framework of the Re: Walden artistic project. The project is based on Henry David Thoreau s book (Walden, or Life in the Woods), and was directed by Jean-François Peyret. The actors are Clara Chabalier, Jos Houben, Victor Lenoble and Lyn Thibault.
154 CMOS Score 130 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis LSP_LS_Phn method, or by the Voice M or F NEU, HPO or HPR speech synthesizer (constraining the same DoA in X and Y for a given pair of test sentences). Nineteen naive listeners participated in this evaluation. The mean CMOS score is displayed in Figure 8.14 for each DoA. A positive score means that the synthesis for Voices M and F gives better results than what was achieved for Voice B. A negative score implies the opposite HPO HPR NEU Voice F Voice M Figure 8.14: CMOS test for the generalization of the prosody and filter transposition - Mean CMOS scores each DoA using the LSP_LS_Phn method, together with their 95% CI. A slight degradation is observed on Figure 8.14 for the NEU speech of both Voices M and F (mean scores of and respectively) compared to Voice B. This may be explained by the fact that the recording conditions during the acquisition of Voice B and of Voices M and F differ: Voice B was recorded in an anechoic room while Voices M and F were acquired in a soundproof booth; besides the microphones used differ. Moreover, Voice B was trained on a corpus which is twice the size of the Voices M and F corpora, leading to a higher generated speech quality. Another explanation could simply be the higher pleasantness intrinsic to Voice B compared to Voices M and F. Based on this observation, we clearly see that HPO and HPR speech of Voice M are well rendered (mean scores of and respectively) as the CMOS scores are even better than for the NEU voice. A similar conclusion is drawn for HPO speech of Voice F (mean score of -0.26), while a dramatic performance decrease is noted for HPR speech of Voice F (mean score of -1.21). As a reminder, a score of -1 on a CMOS scale stands for Slightly Worse Perception of the Degree of Articulation The same experimental protocol as in Section was applied for the evaluation of the DoA perception. Listeners were given two pairs of sentences. The first pair was composed
155 CPDA Score 8.6. Generalization to Other Voices 131 of: i) the NEU sentence synthesized by Voice A full data model; ii) the HPO or HPR sentence (randomly shuffled) synthesized by Voice A adapted HPO or HPR model. The second pair was composed of: i) the NEU sentence synthesized by Voice M or F full data model; ii) the HPO or HPR sentence (same DoA as the second sentence of the first pair) synthesized by Voice M or F whose DoA is transformed using the LS_Phn method. Eighteen naive listeners participated in this evaluation. Figure 8.15 shows the mean score of the perceived DoA using the LS_Phn method, 1 being the target value corresponding to the reference DoA (defined on Voice A) HPO HPR Voice F Voice M Figure 8.15: CPDA test for the generalization of the prosody and filter transposition - Mean score of the perceived DoA using the LSP_LS_Phn method (1 being the reference DoA, defined on Voice A), together with their 95% CI. Figure 8.15 confirms the efficiency of the LS_Phn method in the rendering of the desired DoA for both Voices M and F. As explained in Section 8.6.1, the filter model transforms were linearly interpolated before their transposition on Voice F. Here we obtained better results than expected, since the interpolation ratio was equal to 0.6 and the mean CPDA scores for Voice F are around An explanation could be that the prosody model transforms, which were transposed unmodified, counterbalance the linear interpolation applied to the filter model transforms. As in Section 7.2, this figure proves again that the transposition of the filter impacts on the DoA rendering. Compared to Figure 8.11, we obtained a better rendering of the desired DoA for both Voices M and F, which is particularly interesting. This is especially true for Voice M, whose scores almost reach target value 1. Note that these two figures are of course not directly comparable, as the participants behind the results differ. However, this should be taken as an indication that the proposed LSP_LS_Phn method seems to be effective for rendering the desired DoA and can be applied to various new NEU voices, so as to automatically modify their DoA.
156 ID Score 132 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Identity Preservation Assessment The ID evaluation complements the CMOS and CPDA tests about the preservation of the identity of Voices M and F after modification of their DoA. The same listening and experimental protocols as in Section were applied. Listeners were given three sentences: i) the NEU sentence synthesized by the Voice A full data model; ii) the NEU sentence synthesized by the Voice M or F full data model; iii) the HPO or HPR sentence (randomly shuffled) synthesized by the Voice M or F transposed HPO or HPR model, synthesized using the LSP_LS_Phn method. The test consisted of 20 triplets. Seventeen naive listeners participated in this evaluation. Figure 8.16 shows the mean score, corresponding to the Voices M and F identity, for each DoA. A score of 1 means that the identity of Voice M or F speakers is well preserved. A score of 0 means that the identity of Voice M or F speakers is completely lost HPO HPR Voice F Voice M Figure 8.16: ID test for the generalization of the prosody and filter transposition - Mean score for each DoA using the LSP_LS_Phn method (Voice A = 0, Voice M or F = 1), together with their 95% CI. Figure 8.16 clearly shows that the speaker identity of Voice F is perfectly preserved. This result was straightforwardly expected as Voice F is a female voice, which is clearly distinguished from the male Voice A, even after modification of its DoA. Nonetheless, results are not so clear for Voice M: the mean values are 0.56 for HPR speech and 0.64 for HPO speech, slightly higher than 0.5, i.e. the limit under which the speaker identity starts being lost. These results can be explained by the fact that Voices A and M are acoustically very similar to each other (see Figure 8.1). Indeed, we noticed that, even for NEU sentences synthesized for these two speakers, the risk of confusion was rather high. It is therefore not surprising that these ambiguities remain (and are possibly even more pronounced) after the DoA modification.
157 8.7. Conclusions Conclusions This chapter focused on the automatic modification of the DoA of an existing standard NEU voice, without requiring any additional HPO and HPR speech data, in the framework of HMM-based speech synthesis. For this, the chapter was divided into four main parts. In the first one, we proposed four statistical methods for the creation of the DoA model on a voice for which NEU, HPO and HPR speech data are available (Voice A) and for the transposition of this model on another voice with no HPO or HPR speech data (Voice B). These statistical methods differ in the speaking style adaptation technique (LS vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. These methods are stream-independent, in the sense that they could be applied to both prosody (pitch and phone duration) and filter model independently. The second and third parts of the chapter aimed at finding, on Voice B, the best technique amongst the speaking style adaptation and transposition approaches. The second part focused on transposing to Voice B the prosody model learned on Voice A. The baseline system was chosen to be the NEU full data model of Voice B, where a straightforward and carefully tuned phone-independent constant ratio is applied to decrease or increase pitch and phone durations to sound respectively like HPO or HPR speech. We found out that: i) the method combining LS adaptation and phonetic mapping (LS_Phn method) achieves the best segmental quality for both HPO and HPR speech, by providing results similar to the golden reference baseline; ii) the use of Kullback-Leibler (KL) divergence (instead of phonetic mapping) for speaking style transposition leads to a drop in performance, and this independently of the speaking style adaptation technique and the DoA; iii) the overall perception of the DoA is well reproduced after transposition. The LS_Phn method was therefore chosen for the prosody model transposition. The third part focused on transposing to Voice B the filter model learned on Voice A. We concluded that: i) the LS_Phn method combined with LSP filter coefficients outperformed all other methods, providing the highest speech synthesis quality for both HPO and HPR speech; ii) as for prosody model transposition, the use of KL divergence degrades the performance; iii) a slight advantage is observed in favor of LSP coefficients for the reproduction of HPO speech compared to the other methods and to the baseline, while a slight advantage is noticed with MGC coefficients for the reproduction of HPR speech; iv) Voice B speaker identity is better preserved with the LSP filter coefficients for both HPO and HPR speech, and particularly for the latter case. The LS_Phn method using LSPs as filter representation was therefore chosen for the filter model transposition. The fourth and last part was devoted to the generalization of the prosody and filter models transposition learned on Voice A, to other voices: a male (Voice M) and a female (Voice F) speakers. Following the conclusions of the second and third parts of the chapter, we applied the LS_Phn method for transposing the prosody and filter models (with LSP coefficients) of the NEU synthesizer of both Voices M and F to generate HPO or HPR models. We observed that the filter transformations learned on Voice A were too important and resulted in filter instability when applied to the female voice. This meant, on the one hand, that the Voice F HPO and HPR vocalic spaces were respectively shrunk and
158 134 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis expanded in a way that would lead to impossible vocal tract configurations by the considered speaker. A linear filter model interpolation was thus computed for Voice F in order to apply 60% of the complete filter transposition (while using 100% of the prosody model transformation). This also meant, on the other hand, that the filter transformations could be gender dependent, but hiring a new professional female speaker, and conducting again the same recordings, analysis and synthesis as for Voice A, is tedious and do not guarantee better performance than our empirical ratio. The results showed that: i) high-quality HPO or HPR speech synthesis can be obtained for male (Voice M) and female (Voice F) voices, albeit a slight degradation is observed for Voice F HPR speech; ii) an excellent rendering of the desired DoA can be reached; iii) the speaker identity results are satisfying. Starting from an existing standard NEU voice with no HPO or HPR recordings available, the automatic modification of its DoA is therefore possible by combining the LSP coefficients as filter parameterization with the prosody and filter models transposition LS_Phn method. Audio examples for each voice (A, B, M and F) and for each DoA are available online at picart/.
159 8.7. Conclusions 135 Summary of Chapter 8 Automatic modification of the DoA of an existing standard NEU voice, with no HPO and HPR recordings available, in the framework of HMM-based speech synthesis. Creation of the articulation model on an existing speaker (Voice A) Training one HMM-based speech synthesizer for each DoA and each filter parameterization to be evaluated: MGC, LSP, PARCOR and LAR; models relying on the MGC as filter representation were already built for each DoA in Chapter 5; same procedure as in this latter chapter applied for the three remaining filter representations. Adaptation NEU HMM-based speech synthesizer adapted following the same procedure as in Chapter 6, for each filter parameterization type (MGC, LSP, PARCOR and LAR); using the entire training HPO and HPR databases (the effect of the amount of adaptation sentences on the synthesized speech quality was already studied); but this time, speaking style adaptation performed in two alternative ways: LS and CMLLR adaptation transforms; articulation model = the resulting adaptation transforms combined with the mapping between each PDF and each transformation matrix; Transposition of the articulation model learned on Voice A to a new speaker (Voice B) Training data: 90% (2400 sentences) of the NEU speech database, resampled at 16kHz; filter: same procedure as for Voice A NEU full data model, for each filter parameterization type (MGC, LSP, PARCOR and LAR). Automatic modification of its DoA only the mapping information between each PDF of Voice A and each PDF of Voice B is missing, since the mapping between each PDF and each transformation matrix on Voice A is included in the articulation model;
160 136 Chapter 8. Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis two speaking style transposition techniques are investigated: phonetic mapping (using decision trees only) and acoustic mapping (KL divergence between two distributions); these mappings are stream-independent, so they can be applied to prosody and filter model independently. Prosody transposition the best segmental quality for both HPO and HPR speech is achieved by LS adaptation + phonetic mapping (LS_Phn method); nonetheless, similar results to the golden reference baseline: i.e. the NEU full data model of Voice B, where a straightforward and carefully tuned phone-independent constant ratio is applied to decrease or increase pitch and phone durations to sound respectively like HPO or HPR speech; the use of KL divergence (instead of phonetic mapping) for speaking style transposition leads to a drop in performance, independently of the speaking style adaptation technique and the DoA; the LS_Phn method was therefore chosen for prosody model transposition, as the DoA is well reproduced after transposition. Filter transposition the highest speech synthesis quality for both HPO and HPR speech is achieved by LS_Phn method + LSP filter coefficients; as for prosody model transposition, the use of KL divergence degrades the performance; a slight advantage is observed in favor of LSP and MGC coefficients for the reproduction of HPO and HPR speech, respectively; Voice B speaker identity is better preserved with the LSP filter coefficients for both HPO and HPR speech, and particularly for the latter case; the LS_Phn method using LSPs as filter representation was therefore chosen for the filter model transposition. Generalization to Other Voices (Voices M and F) Training data: 90% (1220 sentences) of the NEU speech database, for both Voice M (male) and Voice F (female), resampled at 16kHz; filter: LSPs coefficients, as they proved to be the most effective for the filter model transposition on Voice B; same procedure as for Voice A NEU full data model (Chapter 5). Automatic modification of their DoA prosody and filter models modified by LS_Phn, as it was shown to provide the best results on Voice B;
161 8.7. Conclusions 137 problem: it seems that the filter transformations learned on Voice A are too important and result in filter instability when applied to a female voice, especially for HPR speech; solution: linear interpolation of the filter model transforms for Voice F (ratio of 0.6, instead of 1 for Voice M), informally verified to provide a proper rendering of the DoA with good quality; Evaluations high-quality HPO or HPR speech synthesis can be obtained for male (Voice M) and female (Voice F) voices, albeit a slight degradation is observed for Voice F HPR speech; excellent rendering of the desired DoA can be reached; the speaker identity results are satisfying. Audio examples are available online at picart.
162
163 Chapter 9 General Conclusion and Future Works Contents 9.1 Conclusions Creation of a Database with various Degrees of Articulation Analysis of Hypo and Hyperarticulated Speech Continuous Control of the Degree of Articulation Subjective Assessment of Hypo and Hyperarticulated Speech Varying the Degree of Articulation of Any Voice within HMM-based Speech Synthesis Thesis Contributions Perspectives In Direct Continuity Average-Voice-based Speech Synthesis integrating the Degree of Articulation Generalization to other types of Data and Languages Conclusions The present PhD thesis focused on the analysis and synthesis of hypo (HPO) and hyperarticulated (HPR) speech, compared to neutral (NEU) speech. Integrating a continuous variable Degree of Articulation (DoA) within HMM-based speech synthesis is of interest in several applications: expressive voice conversion in embedded systems or for video games, reading speed control for visually impaired people, improving intelligibility performance in adverse environments (e.g. perceiving GPS voice inside a moving car, understanding train or flight information in stations or halls), adapting the difficulty level when learning foreign languages with the student s progresses (i.e. from HPR to HPO speech), etc. It is also necessary to more accurately mimic humans who constantly adapt their speaking style to the communication context. Indeed, when talkers speak, they also listen to each other [Cooke et al. 2012]. According to Lindblom s H and H theory [Lindblom 1983], speakers are expected to vary their output along a continuum of HPO and HPR speech. Compared to the NEU
164 140 Chapter 9. General Conclusion and Future Works case, HPR speech tends to maximize the clarity of the speech signal by increasing the articulation efforts to produce it, while HPO speech is produced with minimal articulation efforts. Therefore the DoA provides information on the relationship between the speaker and the listeners, as well as on the speaker s introversion and extroversion in real life situation [Beller 2009]. This status can be induced by contextual factors (like the listener s emotional state) or simply by the speaker s own expressivity. Our research work was divided into six main parts Creation of a Database with various Degrees of Articulation In the first one, we recorded a new French database consisting of three distinct and parallel sets. For each set, the speaker was asked to pronounce the same 1359 phonetically balanced sentences (but with a varying DoA, i.e. NEU, HPO and HPR speech), as emotionless as possible. In order to obtain the most reliable scientific results, two constraints were imposed before the database construction: i) its particular structure had to allow a thorough analysis of the effects caused and induced by the DoA; ii) its recordings had to be of high-quality and noise or perturbation-free, in order to generate high-quality HMMbased speech synthesis with a varying DoA. Moreover, a standard recording protocol was created in order to obtain repeatable conditions if required in the future. The speaker was provided with a headset and was listening to either a high level of reverberation (for HPR speech) or a high voice amplification (for HPO speech). We implemented such a recording protocol because defining exactly and precisely HPO and HPR speech is a very difficult task Analysis of Hypo and Hyperarticulated Speech In the second part, we led a study on the speech modifications occurring when the speaker varies his DoA. At the acoustic level, it was shown how both the vocal tract and glottal contributions are affected. More precisely, an increase of articulation is significantly reflected by an augmentation of the vocalic space in the F 1-F 2 plane, by higher F 0 values, by a stronger harmonicity in speech and by a glottal flow containing more energy in the high frequencies. At the phonetic level, the main variations concern glottal stops, pauses and Schwa /@/. Finally, although the speaking rate significantly increases when the DoA decreases, it turns out that the proportion between speech and pausing periods remains almost constant Continuous Control of the Degree of Articulation The third and fourth parts of the work aimed to develop a HMM-based speech synthesis system incorporating a continuous control of the DoA. This goal was subdivided into three tasks: i) building a HMM-based synthesizer for each DoA using the full specific datasets; ii) for HPO and HPR speech, being able to create a HMM-based synthesizer by the adaptation of the NEU synthesizer and using a limited amount of data; iii) being able to continuously control the DoA by interpolating and extrapolating existing models. Both objective and
165 9.1. Conclusions 141 subjective tests were conducted to validate each of these three tasks. Our conclusions showed that: i) HPR speech is synthesized with a better quality; ii) about 7 minutes of HPO or 13 minutes of HPR speech are required to adapt correctly cepstral features, while only half of it can be used for pitch and phone duration adaptation; iii) the continuous modification of articulatory efforts is correctly perceived by listeners, while keeping an overall quality comparable to what is produced by the NEU synthesizer Subjective Assessment of Hypo and Hyperarticulated Speech In the fifth part, we have performed comprehensive perceptual evaluations of the resulting flexible speech synthesizer. On the one hand, the effects of cepstrum, prosody, phonetic transcription adaptation as well as the complete adaptation process, leading to high-quality HMM-based speech synthesis with various DoA (NEU, HPO and HPR), were analyzed. It turns out that: i) effective prosody adaptation cannot be achieved by a simple ratio operation; ii) adapting prosody alone, without adapting cepstrum, highly degrades the rendering of the DoA; iii) the impact of cepstrum adaptation is more important than that of phonetic transcription adaptation; iv) nonetheless, it is also important to have a Natural Language Processor able to create automatically realistic HPO and HPR transcriptions; v) high-quality HPO and HPR speech synthesis requires the use of an efficient statistical adaptation technique such as Constrained Maximum Likelihood Linear Regression (CMLLR). On the other hand, the investigation of the resulting flexible speech synthesizer intelligibility when performing in adverse conditions, as well as the generated speech quality, revealed that: i) through a Semantically Unpredictable Sentences (SUS) test, increasing the articulation efforts significantly improves the intelligibility of the synthesizer in adverse environments (both noisy and reverberant conditions); ii) a DoA of 0.5 (instead of 1) is sufficient to improve the recognition of the generated speech in car noise while a DoA of 1 is necessary in reverberant environments; iii) the traditional gap between natural and synthesized speech regarding naturalness and segmental quality is still present; iv) several perceptual features like comprehension, non-monotony and pronunciation are relatively well preserved after statistical and parametric modeling Varying the Degree of Articulation of Any Voice within HMMbased Speech Synthesis The sixth and last part focused on the automatic modification of the DoA of an existing standard NEU voice in the framework of HMM-based speech synthesis. In order to achieve the ultimate goal of our research, four statistical methods were proposed to modify the DoA of a target voice (Voice B) with no HPO or HPR speech data available. These methods were learnt on a source speaker (Voice A) who recorded HPO or HPR data. They differ in the adaptation technique they use (Linear Scaling LS vs. CMLLR) and in the way the model is transposed (phonetic vs. acoustic correspondence). These methods are stream-independent, in the sense that they can be applied to prosody (pitch and phone duration) and filter models independently. It turned out that the method
166 142 Chapter 9. General Conclusion and Future Works combining LS adaptation, phonetic mapping (LS_Phn method) and LSP filter coefficients achieved: i) a better segmental quality, compared to what was obtained using Kullback- Leibler (KL) divergence; ii) a good reproduction of the perceived DoA (particularly true for HPO speech); iii) a satisfactory Voice B speaker identity preservation. Based on the above-mentioned conclusions, we performed the generalization of the prosody and filter model transpositions learnt on Voice A to a male (Voice M) and a female (Voice F) speakers. We therefore applied the LS_Phn method for transposing prosody and filter models (with LSP coefficients) of the NEU synthesizer of both Voices M and F to generate HPO or HPR models. The results showed that: i) high-quality HPO or HPR speech synthesis can be obtained for male (Voice M) and female (Voice F) voices, albeit a slight degradation is observed for Voice F HPR speech; ii) an excellent rendering of the desired DoA can be reached; iii) the speaker identity results are satisfying. Starting from an existing standard NEU voice with no HPO or HPR recordings available, the automatic modification of its DoA is therefore possible by combining the LSP coefficients as filter parameterization with the prosody and filter models transposition LS_Phn method. 9.2 Thesis Contributions With regard to the state-of-the-art, the main contributions of the present PhD thesis can be summarized as follows: the creation, the recording protocol and the specifications of a specific database; the analysis of the specific characteristics governing HPO and HPR speech; the synthesis of NEU, HPO and HPR speech in the framework of HMM-based speech synthesis; the implementation of a continuous control of the DoA in the framework of HMMbased speech synthesis; the understanding of the internal mechanisms leading to high-quality HMM-based speech synthesis with various DoA, as well as how intelligibility is affected when the synthesizer is embedded in adverse environments; the automatic modification of the DoA of an existing standard NEU voice for which no HPO or HPR recordings are available, in the framework of HMM-based speech synthesis; and the objective and subjective assessment of the tested methods. Audio examples used throughout this PhD thesis are available online at picart/.
167 9.3. Perspectives Perspectives In Direct Continuity An obvious further work of this thesis is the correction of the filter instabilities occurring when transposing the articulation model, learned on Voice A, to the female Voice F (as explained in Chapter 8). Two ways could be considered to reach this goal: using filter stabilization techniques. LSPs are computed as the roots of P (z) and Q(z), which are respectively the reciprocal and anti-reciprocal parts of the vocal tract inverse filter A(z). One of the properties of these roots is to be located on the unit circle of the z-plane and alternated (i.e. one root of P (z) followed by one root of Q(z), etc.). If for some reason the roots break one of these properties, the resulting filter becomes unstable. The idea would be to impose constraints directly at the output of the speech synthesizer, in order to force the filter to be stable; imposing constraints directly to the transformations which are applied to the filter, in order to keep the generated spectral parameters in an acceptable range at synthesis time for female voices Average-Voice-based Speech Synthesis integrating the Degree of Articulation Average-voice-based speech synthesis has been proved to provide a strong prior model for speech generation, with the target adaptation data being used to estimate speaker-specific characteristics, thus allowing the generation of high quality speech synthesis using a limited amount of adaptation data. This average-voice model should be trained on a large corpus of speech signal. The number of speakers taken into account to train the average-voice model, the number of sentences to adapt each average-voice model, as well as the way of selecting those speakers and adaptation sentences, could be optimized so as to maximize the synthesized speech quality of NEU, HPO and HPR speech Generalization to other types of Data and Languages The study described in this work focused on the analysis and synthesis of the DoA variations produced by a French male speaker (Voice A). Nonetheless, the approach we adopted and the methods we have developed can be transposed to: other types of expressivity in speech (e.g. data); emotional speech with happy and sad other languages; modalities other than speech (e.g. expressive walk or singing voice synthesis). The first point is straightforward as long as the French language is considered. Indeed as this data still concerns the speech modality, no modifications are required in the
168 144 Chapter 9. General Conclusion and Future Works speech parameter extraction, training and synthesis of the HMM-based speech synthesizer, adaptation and transposition methods developed in this work. The second point will probably raise some phonetic issues. As the phoneme set varies from one language to the another, the transposition step of the articulation model learned on Voice A should be adapted consequently. As a reminder, two methods were investigated in the present thesis: phonetic mapping (based on decision trees only) and acoustic mapping (based on KL divergence). The acoustic mapping technique can be straightforwardly applied, as it does not involve any phonetic information. The phonetic mapping approach could be an issue, as a leaf-to-leaf correspondence between the decision trees of the two languages has to be computed. The problem lies in the fact that the phoneme sets of the source (i.e. French) and the target languages are different. A solution could be to manually create a phonetic correspondence between those phoneme sets, i.e. to find, for each phone of the source language, the acoustically closest phone in the target language. The third and last point requires to completely adapt the modeling environment (e.g. finding a suitable data representation and determining a new label set), to model the new data with HMMs and to compute the adaptation transforms to be transposed. The transposition method based on the KL divergence developed in this work could finally be applied.
169 Bibliography [Abdel-Hamid et al. 2006] Ossama Abdel-Hamid, Sherif Mahdy Abdou and Mohsen Rashwan. Improving Arabic HMM Based Speech Synthesis Quality. In Proceedings of Interspeech, pages , Pittsburgh, Pennsylvania, USA, September [Abe et al. 1988] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano and Hisao Kuwabara. Voice conversion through vector quantization. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , New York, USA, April [Abe 1996] Masanobu Abe. Speech morphing by gradually changing spectrum parameter and fundamental frequency. In Proceedings of the International Conference on Spoken Language Processing, pages , , 108 [Adda-Decker et al. 1999] Martine Adda-Decker, Philippe Boula de Mareüil and Lori Lamel. Pronunciation variants in French : schwa & liaison. In Proceedings of the 14th International Conference on Phonetic Science, San Francisco, [Airas 2008] Matti Airas. Methods and Studies of Laryngeal Voice Quality Analysis in Speech Production. PhD thesis, Helsinki University of Technology, Espoo, Finland, [AKGC3000B 1999] AKGC3000B. [Online] [Allen & Berkley 1979] Jont B. Allen and David A. Berkley. Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America, vol. 65, no. 4, pages , April [Anastasakos et al. 1996] Tasos Anastasakos, John McDonough, Richard Schwartz and John Makhoul. A Compact Model for Speaker-Adaptive Training. In Proceedings of the International Conference on Spoken Language Processing, pages , Philadelphia, Pennsylvania, USA, [Anumanchipalli et al. 2010] Gopala Krishna Anumanchipalli, Prasanna Kumar Muthukumar, Udhyakumar Nallasamy, Alok Parlikar, Alan W. Black and Brian Langner. Improving Speech Synthesis for Noisy Environments. In Proceedings of the Speech Synthesis Workshop 7, pages , Kyoto, Japan, September , 91 [Astrinaki et al. 2012] Maria Astrinaki, Nicolas d Alessandro, Benjamin Picart, Thomas Drugman and Thierry Dutoit. Reactive and Continuous Control of HMM-based Speech Synthesis. In Proceedings of the IEEE Workshop on Spoken Language Technology, pages , Miami, Florida, USA, December
170 146 Bibliography [Aylett & Turk 2004] Matthew P. Aylett and Alice Turk. The Smooth Signal Redundancy Hypothesis: A Functional Explanation for Relationships between Redundancy, Prosodic Prominence, and Duration in Spontaneous Speech. Language and Speech, vol. 47, no. 1, pages 31 56, [Aylett 2000] Matthew P. Aylett. Stochastic Suprasegmentals: Relationships between Redundancy, Prosodic Structure and Care of Articulation in Spontaneous Speech. PhD thesis, University of Edinburgh, Scotland, [Aylett 2005] Matthew P. Aylett. Synthesising Hyperarticulation in Unit Selection TTS. In Proceedings of Interspeech, pages , Lisbon, Portugal, September [Bäckström & Magi 2006] Tom Bäckström and Carlo Magi. Properties of line spectrum pair polynomials - A review. Signal Processing, vol. 86, no. 11, pages , [Bahmaninezhad et al. 2013] Fahimeh Bahmaninezhad, Soheil Khorram and Hossein Sameti. Average Voice Modeling Based on Unbiased Decision Trees. In Proceedings of the Non-Linear Speech Processing Workshop, pages T. Drugman and T. Dutoit (Eds.), Springer-Verlag, [Baker & Bradlow 2009] Rachel E. Baker and Ann R. Bradlow. Variability in Word Duration as a Function of Probability, Speech Style, and Prosody. Language and Speech, vol. 52, no. 4, pages , December , 57 [Banos et al. 2008] Eleftherios Banos, Daniel Erro, Antonio Bonafonte and Asuncion Moreno. Flexible harmonic/stochastic modeling for HMM-based speech synthesis. In Proceedings of the 5th Jornadas en Tecnología del Habla, pages , Bilbao, Pays Basque, November [Barra-Chicote et al. 2010] Roberto Barra-Chicote, Junichi Yamagishi, Simon King, Juan Manuel Montero and Javier Macias-Guarasa. Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech. Speech Communication, vol. 52, no. 5, pages , [Beller et al. 2006] Grégory Beller, Thomas Hueber, Diemo Schwarz and Xavier Rodet. Speech Rates in French Expressive Speech. In Proceedings of the Third International Conference on Speech Prosody, Dresden, Germany, May [Beller et al. 2008] Grégory Beller, Nicolas Obin and Xavier Rodet. Articulation Degree as a Prosodic Dimension of Expressive Speech. In Proceedings of the Fourth International Conference on Speech Prosody, Campinas, Brazil, May , 48 [Beller 2007] Grégory Beller. Influence de l expressivité sur le degré d articulation. In Rencontres Jeunes Chercheurs de la Parole, France, Juillet , 48
171 Bibliography 147 [Beller 2009] Grégory Beller. Analyse et Modèle Génératif de l Expressivité - Application à la Parole et à l Interprétation Musicale. PhD thesis, Université Paris VI - Pierre et Marie Curie, IRCAM, Paris, France, , 48, 51, 55, 140 [Benisty & Malah 2011] Hadas Benisty and David Malah. Voice Conversion Using GMM with Enhanced Global Variance. In Proceedings of Interspeech, pages , Florence, Italy, August [Benoît et al. 1996a] Christian Benoît, A. Fuster-Duran and Bertrand LeGoff. An Investigation of Hypo- and Hyper-Speech in the Visual Modality. In Proceedings of the 1st ESCA Tutorial and Research Workshop on Speech Production Modeling: From Control Strategies to Acoustics, pages , , 65 [Benoît et al. 1996b] Christian Benoît, Martine Grice and Valérie Hazan. The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences. Speech Communication, vol. 18, no. 4, pages , [Benoît 1990] Christian Benoît. An intelligibility test using semantically unpredictable sentences: towards the quantification of linguistic complexity. Speech Communication, vol. 9, no. 4, pages , August , 99 [Black & Lenzo 2000] Alan W. Black and Kevin A. Lenzo. Limited Domain Synthesis. In Proceedings of the International Conference on Spoken Language Processing, pages , Beijing, China, October [Black & Lenzo 2003] Alan W. Black and Kevin Lenzo. Optimal Utterance Selection for Unit Selection Speech Synthesis Databases. International Journal of Speech Technology, vol. 6, no. 4, pages , October [Black & Taylor 1997] Alan W. Black and Paul Taylor. Automatically Clustering Similar Units for Unit Selection in Speech Synthesis. In Proceedings of Eurospeech, pages , Rhodes, Greece, September [Black 2003] Alan W. Black. Unit Selection and Emotional Speech. In Proceedings of Eurospeech, pages , Geneva, Switzerland, September [Boite et al. 1999] René Boite, Hervé Bourlard, Thierry Dutoit, Joël Hancq and Henri Leich. Traitement de la parole. Presses Polytechniques et Universitaires Romandes, , 14, 110 [Bonardo & Zovato 2007] Davide Bonardo and Enrico Zovato. Speech synthesis enhancement in noisy environments. In Proceedings of Interspeech, pages , Antwerp, Belgium, August , 91 [Borroff 2007] Marianne L. Borroff. A landmark underspecification account of the patterning of glottal stop. PhD thesis, Stony Brook University, New York, May
172 148 Bibliography [Bozkurt & Dutoit 2003] Baris Bozkurt and Thierry Dutoit. Mixed-phase speech modeling and formant estimation, using differential phase spectrums. In Proceedings of the Voice Quality: Functions, Analysis and Synthesis, pages 21 24, Geneva, Switzerland, August [Bozkurt et al. 2004] Baris Bozkurt, Thierry Dutoit, Boris Doval and Christophe D Alessandro. A Method for Glottal Formant Frequency Estimation. In Proceedings of the International Conference on Spoken Language Processing, pages , Jeju Island, Korea, October [Bradlow et al. 2003] Ann R. Bradlow, Nina Kraus and Erin Hayes. Speaking Clearly for Children With Learning Disabilities: Sentence Perception in Noise. Journal of Speech, Language, and Hearing Research, vol. 46, pages 80 97, February , 52 [Brink et al. 1998] James Brink, Richard Wright and David B. Pisoni. Eliciting Speech Reduction in the Laboratory: Assessment of a New Experimental Method. Technical report, Speech Research Laboratory, Department of Psychology, Indiana University, Bloomington, Indiana, [Browman & Goldstein 1986] Catherine P. Browman and Louis Goldstein. Towards an articulatory phonology. Phonology Yearbook, vol. 3, pages , [Browman & Goldstein 1990] Catherine P. Browman and Louis Goldstein. Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, vol. 18, no , [Browman & Goldstein 1994] Catherine P. Browman and Louis Goldstein. Targetless schwa: an articulatory analysis. Laboratory Phonology II: Gesture, Segment, Prosody, vol. 4, no. 956, pages , [Burkhardt et al. 2005] Felix Burkhardt, A. Paeschke, M. Rolfes, Walter F. Sendlmeier and Benjamin Weiss. A Database of German Emotional Speech. In Proceedings of Interspeech, pages , Lisbon, Portugal, September [Cabral et al. 2007] Joao P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi. Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis. In Proceedings of the Speech Synthesis Workshop 6, pages , Bonn, Germany, August , 29 [Cabral et al. 2008] Joao P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi. Glottal Spectral Separation for Parametric Speech Synthesis. In Proceedings of Interspeech, pages , Brisbane, Australia, September , 29, 102 [Cer nak 2006] Milo s Cer nak. Unit Selection Speech Synthesis in Noise. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, May , 64, 92
173 Bibliography 149 [Chen 1980] Francine Robina Chen. Acoustic characteristics and intelligibility of clear and conversational speech at the segmental level. Master s thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, USA, , 41, 49, 52 [Chien et al. 1997] Jen-Tzung Chien, Chin-Hui Lee and Hsiao-Chuan Wang. Improved Bayesian learning of hidden Markov models for speaker adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Munich, Germany, April [Childers & Lee 1991] D. Childers and C. Lee. Vocal Quality Factors: Analysis, Synthesis, and Perception. Journal of the Acoustical Society of America, vol. 90, no. 5, pages , November [Christiansen et al. 2010] Claus Christiansen, Michael Syskind Pedersen and Torsten Dau. Prediction of speech intelligibility based on an auditory preprocessing model. Speech Communication, vol. 52, no. 7-8, pages , [Cooke et al. 2012] Martin Cooke, Simon King, Bastiaan Kleijn and Yannis Stylianou. The Listening Talker - An interdisciplinary workshop on natural and synthetic modification of speech in response to listening conditions. Edinburgh, Scotland, May , 139 [Cooke 2006] Martin Cooke. A glimpsing model of speech perception in noise. Journal of the Acoustical Society of America, vol. 119, no. 3, pages , [d Alessandro 2006] Christophe d Alessandro. Voice Source Parameters and Prosodic Analysis. In Methods in Empirical Prosody Research, pages Edited by Stefan Sudhoff. Walter de Gruyter, , 51 [Dau et al. 1996] Torsten Dau, Dirk Püschel and Armin Kohlrausch. A quantitative model of the effective signal processing in the auditory system. I. Model structure. Journal of the Acoustical Society of America, vol. 99, no. 6, pages , [de Franca Oliveira et al. 2012] Viviane de Franca Oliveira, Sayaka Shiota, Yoshihiko Nankaku and Keiichi Tokuda. Cross-lingual Speaker Adaptation for HMM-based Speech Synthesis based on Perceptual Characteristics and Speaker Interpolation. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September [de Mareüil et al. 2006] Philippe Boula de Mareüil, Christophe d Alessandro, Alexander Raake, Gérard Bailly, Marie-Neige Garcia and Michel Morel. A joint intelligibility evaluation of French text-to-speech synthesis systems: the EvaSy SUS/ACR campaign. In Proceedings of the 5th International Conference on Language Resources and Evaluation, pages , Genoa, Italy, xvii, 96, 97, 99, 101, 102
174 150 Bibliography [Dempster et al. 1977] Arthur P. Dempster, Nan M. Laird and Donald B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, vol. 39, no. 1, pages 1 38, [Desai et al. 2010] Srinivas Desai, Alan W. Black, B. Yegnanarayana and Kishore Prahallad. Spectral Mapping Using Artificial Neural Networks for Voice Conversion. IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pages , [Digalakis & Neumeyer 1995] Vassilios Digalakis and Leonardo Neumeyer. Speaker adaptation using combined transformation and Bayesian methods. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Detroit, Michigan, USA, May [Digalakis et al. 1995] Vassilios Digalakis, D. Rtischev and Leonardo Neumeyer. Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 5, pages , September [Donovan & Woodland 1995] R. E. Donovan and Philip C. Woodland. Improvements in an HMM-Based Speech Synthesiser. In Proceedings of Eurospeech, pages , Madrid, Spain, September [Doval et al. 2003] Boris Doval, Christophe d Alessandro and Nathalie Henrich. The voice source as a causal/anticausal linear filter. In Proceedings of the Voice Quality: Functions, Analysis and Synthesis, pages 15 20, Geneva, Switzerland, August [Drugman & Dutoit 2009] Thomas Drugman and Thierry Dutoit. Glottal Closure and Opening Instant Detection from Speech Signals. In Proceedings of Interspeech, pages , Brighton, United Kingdom, September [Drugman & Dutoit 2010a] Thomas Drugman and Thierry Dutoit. Glottal-based Analysis of the Lombard Effect. In Proceedings of Interspeech, pages , Makuhari, Chiba, Japan, September , 49 [Drugman & Dutoit 2010b] Thomas Drugman and Thierry Dutoit. On the Potential of Glottal Signatures for Speaker Recognition. In Proceedings of Interspeech, pages , Makuhari, Chiba, Japan, September [Drugman & Dutoit 2012] Thomas Drugman and Thierry Dutoit. The Deterministic plus Stochastic Model of the Residual Signal and its Applications. IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pages , March , 29, 65 [Drugman et al. 2009a] Thomas Drugman, Baris Bozkurt and Thierry Dutoit. Chirp Decomposition of Speech Signals for Glottal Source Estimation. In Proceedings of the Non-Linear Speech Processing Workshop, Vic, Spain, June
175 Bibliography 151 [Drugman et al. 2009b] Thomas Drugman, Geoffrey Wilfart and Thierry Dutoit. A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis. In Proceedings of Interspeech, pages , Brighton, United Kingdom, September , 54, 65 [Drugman et al. 2009c] Thomas Drugman, Geoffrey Wilfart, Alexis Moinet and Thierry Dutoit. Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/frame Selection Speech Synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Taipei, Taiwan, April [Drugman et al. 2011] Thomas Drugman, Baris Bozkurt and Thierry Dutoit. Causalanticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation. Speech Communication, vol. 53, no. 6, pages , July [Drugman et al. 2012a] Thomas Drugman, Baris Bozkurt and Thierry Dutoit. A Comparative Study of Glottal Source Estimation Techniques. Computer Speech & Language, Elsevier, vol. 26, no. 1, pages 20 34, January , 52 [Drugman et al. 2012b] Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick Naylor and Thierry Dutoit. Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review. IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pages , March [Drugman 2011] Thomas Drugman. Advances in Glottal Analysis and its Applications. PhD thesis, University of Mons, Faculté Polytechnique, Belgium, [Dutoit & Dupont 2010] Thierry Dutoit and Stéphane Dupont. Multimodal signal processing, chapter 3: Speech Processing. Academic Press, Elsevier, [Dutoit 1997] Thierry Dutoit. An introduction to text-to-speech synthesis. Kluwer Academic Publishers, [Eide et al. 2004] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny and J. Pitrelli. A corpus-based approach to expressive speech synthesis. In Proceedings of the Speech Synthesis Workshop 5, pages 79 84, Pittsburgh, Pennsylvania, USA, June [Ellis 1885] Alexander Ellis. On the Musical Scales of Various Nations. Journal of the Society of Arts, vol. 33, no. 1688, pages , [Erro et al. 2012] Daniel Erro, Yannis Stylianou, Eva Navas and Inma Hernaez. Implementation of Simple Spectral Techniques to Enhance the Intelligibility of Speech using a Harmonic Model. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September , 92
176 152 Bibliography [Eslami et al. 2011] Mahdi Eslami, Hamid Sheikhzadeh and Abolghasem Sayadiyan. Quality Improvement of Voice Conversion Systems Based on Trellis Structured Vector Quantization. In Proceedings of Interspeech, pages , Florence, Italy, August [Falaschi 1989] Alessandro Falaschi. An automated procedure for minimum size phonetically balanced phrases selection. In Proceedings of the ESCA Tutorial and Research Workshop on Speech Input/Output Assessment and Speech Databases, volume 2, pages , Noordwijkerhout, The Netherlands, September [Fant et al. 1985] Gunnar Fant, Johan Liljencrants and Qi-Guang Lin. A Four Parameter Model of Glottal Flow. In Speech Transmission Laboratory, Quarterly Progress and Status Report, volume 26, pages 1 13, French-Swedish Symposium, Grenoble, France, April [Ferguson & Kewley-Port 2002] Sarah Hargus Ferguson and D. Kewley-Port. Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. Journal of the Acoustical Society of America, vol. 112, no. 1, pages , [Ferguson 1980a] J. Ferguson. Variable Duration Models for Speech. In Proceedings of the Symposium on the Application of Hidden Markov Models to Text and Speech, pages , Princeton, New Jersey, USA, [Ferguson 1980b] J. D. Ferguson. Hidden Markov Analysis: An Introduction. Hidden Markov Models for Speech, [Ferguson 2002] Sarah Hargus Ferguson. Vowels in clear and conversational speech: Talker differences in acoustic features and intelligibility for normal-hearing listeners. PhD thesis, Indiana University, Bloomington, Indiana, USA, [Fitzpatrick et al. 2011] Michael Fitzpatrick, Jeesun Kim and Chris Davis. The effect of seeing the interlocutor on auditory and visual speech production in noise. In Proceedings of the International Conference on Auditory-Visual Speech Processing, pages 31 35, Volterra, Italy, August 31 September [Fitzpatrick et al. 2012] Michael Fitzpatrick, Jeesun Kim and Chris Davis. The Intelligibility of Lombard Speech: Communicative setting matters. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September [Forney 1973] G. D. Forney. The Viterbi Algorithm. Proceedings of the IEEE, vol. 61, no. 3, pages , [François & Boëffard 2002] Hélène François and Olivier Boëffard. The Greedy Algorithm and its Application to the Construction of a Continuous Speech Database. In Proceedings of the International Conference on Language Resources and Evaluation, pages , Las Palmas, Gran Canaria, Spain, May
177 Bibliography 153 [French & Steinberg 1947] N. R. French and J. C. Steinberg. Factors Governing the Intelligibility of Speech Sounds. Journal of the Acoustical Society of America, vol. 19, no. 1, pages , [Fukada et al. 1992] Toshiaki Fukada, Keiichi Tokuda, Takao Kobayashi and Satoshi Imai. An adaptive algorithm for mel-cepstral analysis of speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages , San Francisco, California, USA, March , 110 [Fux et al. 2011] Thibaut Fux, Gang Feng and Véronique Zimpfer. Talker-to-listener distance effects on the variations of the intensity and the fundamental frequency of speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Prague, Czech Republic, May [Gales 1998] M. Gales. Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, vol. 12, no. 2, pages 75 98, , 74, 77 [Garnier et al. 2006a] Maëva Garnier, Lucie Bailly, Marion Dohen, Pauline Welby and Hélène Loevenbruck. An acoustic and articulatory study of Lombard speech. Global effects at utterance level. In Proceedings of the International Conference on Spoken Language Processing, pages , Pittsburgh, Pennsylvania, USA, September , 49 [Garnier et al. 2006b] Maëva Garnier, Lucie Bailly, Marion Dohen, Pauline Welby and Hélène Loevenbruck. The Lombard effect: a physiological reflex or a controlled intelligibility enhancement? In Proceedings of the International Seminar on Speech Production, pages , Ubatuba, Brazil, December , 49 [Gauvain & Lee 1994] Jean-Luc Gauvain and Chin-Hui Lee. Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pages , , 78 [Godoy & Stylianou 2012] Elizabeth Godoy and Yannis Stylianou. Unsupervised Acoustic Analyses of Normal and Lombard Speech, with Spectral Envelope Transformation to Improve Intelligibility. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September [Godoy et al. 2012] Elizabeth Godoy, Olivier Rosec and Thierry Chonavel. Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora. IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pages ,
178 154 Bibliography [Gonzalvo et al. 2007] Xavier Gonzalvo, Joan Claudi Socoró, Ignasi Iriondo, Carlos Monzo and Elisa Martínez. Linguistic and Mixed Excitation Improvements on a HMMbased speech synthesis for Castilian Spanish. In Proceedings of the Speech Synthesis Workshop 6, pages , Bonn, Germany, August [Gordon & Ladefoged 2001] Matthew Gordon and Peter Ladefoged. Phonation types: a cross-linguistic overview. Journal of Phonetics, vol. 29, no. 4, pages , October [Hall & Flanagan 2010] Joseph L. Hall and James L. Flanagan. Intelligibility and listener preference of telephone speech in the presence of babble noise. Journal of the Acoustical Society of America, vol. 127, no. 1, pages , [Harnsberger et al. 2008] James D. Harnsberger, Richard Wright and David B. Pisoni. A new method for eliciting three speaking styles in the laboratory. Speech Communication, vol. 50, no. 4, pages , , 52, 59 [Hartsuiker & Kolk 2001] Robert J. Hartsuiker and Herman H. J. Kolk. Error Monitoring in Speech Production: A Computational Test of the Perceptual Loop Theory. Cognitive Psychology, vol. 42, no. 2, pages , [Hazan & Baker 2010] Valérie Hazan and Rachel Baker. Does reading clearly produce the same acoustic-phonetic modifications as spontaneous speech in a clear speaking style? In Proceedings of the joint 5th Workshop on Disfluency in Spontaneous Speech and the 2nd International Symposium on Linguistic Patterns in Spontaneous Speech, pages 7 10, Tokyo, Japan, Septembre , 50, 52 [Hazan & Baker 2011] Valérie Hazan and Rachel Baker. Acoustic-phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions. Journal of the Acoustical Society of America, vol. 130, no. 4, pages , , 90 [Helander et al. 2010] Elina Helander, Tuomas Virtanen, Jani Nurminen and Moncef Gabbouj. Voice Conversion Using Partial Least Squares Regression. IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pages , [Helander et al. 2012] Elina Helander, Hanna Silén, Tuomas Virtanen and Moncef Gabbouj. Voice Conversion Using Dynamic Kernel Partial Least Squares Regression. IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pages , [Hemptinne 2006] Coralie Hemptinne. Integration of the harmonic plus noise model into the hidden markov model-based speech synthesis system. Master s thesis, IDIAP Research Institute, Martigny, Switzerland,
179 Bibliography 155 [Hodoshima et al. 2010] Nao Hodoshima, Takayuki Arai and Kiyohiro Kurisu. Intelligibility of speech spoken in noise and reverberation. In Proceedings of the 20th International Congress on Acoustics, pages 1 4, Sydney, Australia, August [Howell 2012] David C. Howell. Statistical methods for psychology. Wadsworth Publishing, , 120 [Hsu & Chen 2012] Chih-Yu Hsu and Chia-Ping Chen. Speaker-dependent model interpolation for statistical emotional speech synthesis. EURASIP Journal on Audio, Speech, and Music Processing, no. 21, August , 109 [Hunt & Black 1996] Andrew J. Hunt and Alan W. Black. Unit Selection in a Concatenative Speech Synthesis System using a Large Speech Database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Atlanta, Georgia, USA, May , 2 [Imai et al. 1983] Satoshi Imai, Kazuo Sumita and Chieko Furuichi. Mel Log Spectrum Approximation (MLSA) filter for speech synthesis. Electronics and Communications in Japan, vol. 66, no. 2, pages 10 18, [Imai 1983] Satoshi Imai. Cepstral analysis synthesis on the mel frequency scale. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 8, pages 93 96, Boston, Massachusetts, USA, April [Itakura 1975] Fumitada Itakura. Line spectrum representation of linear predictor coefficients of speech signals. Journal of the Acoustical Society of America, vol. 57, page S35, [Iwahashi & Sagisaka 1995] Naoto Iwahashi and Yoshinori Sagisaka. Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Speech Communication, vol. 16, no. 2, pages , [Johnson et al. 1993] Keith Johnson, Edward Flemming and Richard Wright. The hyperspace effect: Phonetic targets are hyperarticulated. Language, vol. 69, no. 3, pages , [Jokinen et al. 2012] Emma Jokinen, Paavo Alku and Martti Vainio. Utilization of the Lombard effect in post-filtering for intelligibility enhancement of telephone speech. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September [Junqua 1993] Jean-Claude Junqua. The Lombard Reflex and its Role on Human Listeners and Automatic Speech Recognizers. Journal of the Acoustical Society of America, vol. 93, no. 1, pages , January , 49
180 156 Bibliography [Jurafsky et al. 2001] Daniel Jurafsky, Alan Bell, Michelle Gregory and William D. Raymond. Frequency and the emergence of linguistic structure, chapter Probabilistic Relations between Words: Evidence from Reduction in Lexical Production, pages J. Bybee and P. Hopper, John Benjamins, Amsterdam, , 57 [Kain & Macon 1998] Alexander Kain and Michael W. Macon. Spectral Voice Conversion for Text-to-Speech Synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Seattle, Washington, USA, May [Kanagawa et al. 2013] Hiroki Kanagawa, Takashi Nose and Takao Kobayashi. Speaker- Independent Style Conversion for HMM-based Expressive Speech Synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Vancouver, Canada, May , 118, 124 [Kawahara & Matsui 2003] Hideki Kawahara and Hisami Matsui. Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Hong Kong, China, April [Kawahara & Morise 2011] Hideki Kawahara and Masanori Morise. Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. SADHANA - Academy Proceedings in Engineering Sciences, vol. 36, no. 5, pages , , 29, 102 [Kawahara et al. 1999] Hideki Kawahara, Ikuyo Masuda-Katsuse and Alain de Cheveigné. Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, vol. 27, no. 3-4, pages , , 29 [Kawahara et al. 2009] Hideki Kawahara, R. Nisimura, T. Irino, M. Morise, T. Takahashi and Hideki Banno. Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Taipei, Taiwan, April , 108 [Kazumi et al. 2010] K. Kazumi, Y. Nankaku and K. Tokuda. Factor analyzed voice models for HMM-based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Dallas, Texas, USA, March [Keller et al. 1993] Eric Keller, Brigitte Zellner, Stefan Werner and Nicole Blanchoud. The Prediction of Prosodic Timing: Rules for Final Syllable Lengthening in French. In Proceedings of the ESCA Workshop on Prosody, pages , Lund, Sweden, September
181 Bibliography 157 [Keller 2005] Eric Keller. The Analysis of Voice Quality in Speech Processing. Lecture Notes in Computer Science, vol. 3445, pages 54 73, , 51 [Kim & Hahn 2007] Sang-Jin Kim and Minsoo Hahn. Two-Band Excitation for HMM- Based Speech Synthesis. IEICE Transactions on Information and Systems, vol. 90, no. 1, pages , [Kim et al. 2011] Jeesun Kim, Amanda Sironic and Chris Davis. Hearing speech in noise: Seeing a loud talker is better. Perception, vol. 40, no. 7, pages , [Klatt & Klatt 1990] Dennis H. Klatt and Laura C. Klatt. Analysis, Synthesis, and Perception of Voice Quality Variations among Female and Male Talkers. Journal of the Acoustical Society of America, vol. 87, no. 2, pages , February [Klatt 1982] Dennis H. Klatt. Prediction of perceived phonetic distance from critical-band spectra: A first step. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Paris, France, May [Kobayashi & Imai 1984] Takao Kobayashi and Satoshi Imai. Spectral analysis using generalized cepstrum. IEEE Transactions on Audio, Speech, and Language Processing, vol. 32, no. 5, pages , [Kominek & Black 2003] John Kominek and Alan W. Black. CMU ARCTIC databases for speech synthesis. Technical report, CMU Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, [Kominek & Black 2004] John Kominek and Alan W. Black. The CMU Arctic speech databases. In Proceedings of the Speech Synthesis Workshop 5, pages , Pittsburgh, Pennsylvania, USA, June [Köster 2001] Stefanie Köster. Acoustic-phonetic characteristics of hyperarticulated speech for different speaking styles. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Salt Lake City, Utah, USA, May [Koutsogiannaki et al. 2012] Maria Koutsogiannaki, Michelle Pettinato, Cassie Mayo, Varvara Kandia and Yannis Stylianou. Can modified casual speech reach the intelligibility of clear speech? In Proceedings of Interspeech, pages , Portland, Oregon, USA, September [Krause & Braida 2002] Jean C. Krause and Louis D. Braida. Investigating alternative forms of clear speech: The effects of speaking rate and speaking mode on intelligibility. Journal of the Acoustical Society of America, vol. 112, no. 5, pages ,
182 158 Bibliography [Krause & Braida 2004] Jean C. Krause and Louis D. Braida. Acoustic properties of naturally produced clear speech at normal speaking rates. Journal of the Acoustical Society of America, vol. 115, no. 1, pages , , 90 [Krstulović et al. 2006] Sacha Krstulović, Frédéric Bimbot, Olivier Boëffard, Delphine Charlet, Dominique Fohr and Odile Mella. Optimizing the coverage of a speech database through a selection of representative speaker recordings. Speech Communication, vol. 48, no. 10, pages , October [Kryter 1962] Karl D. Kryter. Methods for the Calculation and Use of the Articulation Index. Journal of the Acoustical Society of America, vol. 34, no. 11, pages , [Kuhn et al. 2000] Roland Kuhn, Jean-Claude Junqua, Patrick Nguyen and Nancy Niedzielski. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, vol. 8, no. 6, pages , [Kuwabara & Sagisaka 1995] Hisao Kuwabara and Yoshinori Sagisaka. Acoustic characteristics of speaker individuality: Control and conversion. Speech Communication, vol. 16, no. 2, pages , [Langner & Black 2004] Brian Langner and Alan W. Black. Creating a Database of Speech In Noise For Unit Selection Synthesis. In Proceedings of the Speech Synthesis Workshop 5, pages , Pittsburgh, Pennsylvania, USA, June , 40, 41 [Langner & Black 2005] Brian Langner and Alan W. Black. Improving the Understandability of Speech Synthesis by Modeling Speech In Noise. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Philadelphia, Pennsylvania, USA, March , 91, 108 [Latorre et al. 2006] Javier Latorre, Koji Iwano and Sadaoki Furui. New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Communication, vol. 48, no. 10, pages , [Laver 1980] John Laver. The phonetic description of voice quality. Cambridge University Press, Cambridge, United Kingdom, [Laver 1994] John Laver. Principles of phonetics. Cambridge Textbooks in Linguistics, Cambridge University Press, Cambridge, United Kingdom, , 51 [Lavner & Porat 2005] Yizhar Lavner and Gidon Porat. Voice Morphing using 3D Waveform Interpolation Surfaces and Lossless Tube Area Functions. EURASIP Journal on Advances in Signal Processing, no. 8, pages , [Leggetter & Woodland 1995] C. J. Leggetter and Philip C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov
183 Bibliography 159 models. Computer Speech & Language, vol. 9, no. 2, pages , , 34, 77 [LeGoff & Benoît 1996] Bertrand LeGoff and Christian Benoît. A Text-To-Audiovisual- Speech Synthesizer For French. In Proceedings of the International Conference on Spoken Language Processing, pages , Philadelphia, Pennsylvania, USA, October , 65 [Levelt 1990] Willem J. M. Levelt. Speaking: From intention to articulation. MIT Press, Cambridge, Massachusetts, USA, [Liang et al. 2010] Hui Liang, John Dines and Lakshmi Saheer. A Comparison of Supervised and Unsupervised Cross-Lingual Speaker Adaptation Approaches for HMM- Based Speech Synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Dallas, Texas, USA, March , 114 [Lindblom 1963] Björn Lindblom. Spectrographic Study of Vowel Reduction. Journal of the Acoustical Society of America, vol. 35, no. 11, pages , , 51 [Lindblom 1983] Björn Lindblom. The production of speech, chapter Economy of Speech Gestures, pages Peter F. McNeilage, Spinger-Verlag, New York, , 47, 51, 63, 77, 92, 139 [Lindblom 1990] Björn Lindblom. Speech production and speech modelling, volume 55, chapter Explaining Phonetic Variation: A Sketch of the H&H Theory, pages William J. Hardcastle and Alain Marchal (eds.), Kluwer Academic Publishers, [Lindblom 1996] Björn Lindblom. Role of articulation in speech perception: Clues from production. Journal of the Acoustical Society of America, vol. 99, no. 3, pages , [Liu et al. 2008] W. M. Liu, K. A. Jellyman, N. W. D. Evans and John S. D. Mason. Assessment of Objective Quality Measures for Speech Intelligibility. In Proceedings of Interspeech, pages , Brisbane, Australia, September [Lombard 1911] Étienne Lombard. Le signe de l élévation de la voix. Annales des Maladies de l Oreille et du Larynx, vol. 37, no. 2, pages , , 48, 63, 90 [Lu & Cooke 2008] Youyi Lu and Martin Cooke. Speech production modifications produced by competing talkers, babble, and stationary noise. Journal of the Acoustical Society of America, vol. 124, no. 5, pages , [Lu & Cooke 2009] Youyi Lu and Martin Cooke. The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise. Speech Communication, vol. 51, no. 12, pages , December , 49
184 160 Bibliography [Malfrère et al. 2003] Fabrice Malfrère, Olivier Deroo, Thierry Dutoit and Christophe Ris. Phonetic alignement : speech-synthesis-based versus Viterbi-based. Speech Communication, vol. 40, no. 4, pages , June , 59 [Maniwa et al. 2009] Kazumi Maniwa, Allard Jongman and Travis Wade. Acoustic characteristics of clearly spoken English fricatives. Journal of the Acoustical Society of America, vol. 125, no. 6, pages , [Masuko et al. 1997] Takashi Masuko, Keiichi Tokuda, Takao Kobayashi and Satoshi Imai. Voice Characteristics Conversion For HMM-Based Speech Synthesis System. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Munich, Bavaria, Germany, April [Masuko 2002] Takashi Masuko. HMM-Based Speech Synthesis and Its Applications. PhD thesis, Tokyo Institute of Technology, Japan, xiii, 21, 23 [Mayo et al. 2012] Catherine Mayo, Vincent Aubanel and Martin Cooke. Effect of prosodic changes on speech intelligibility. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September [Miller 1998] Corey Andrew Miller. Pronunciation Modeling In Speech Synthesis. PhD thesis, Institute for Research in Cognitive Science, University of Pennsylvania, Philadelphia, Pennsylvania, USA, [Miyanaga et al. 2004] Keisuke Miyanaga, Takashi Masuko and Takao Kobayashi. A style control technique for HMM-based speech synthesis. In Proceedings of the International Conference on Spoken Language Processing, pages , Jeju Island, Korea, October [Möbius 2003] Bernd Möbius. Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis. International Journal of Speech Technology, vol. 6, no. 1, pages 57 71, [Moon & Lindblom 1994] Seung-Jae Moon and Björn Lindblom. Interaction between duration, context, and speaking style in English stressed vowels. Journal of the Acoustical Society of America, vol. 96, no. 1, pages 40 55, [Moore & Nicolao 2011] Roger K. Moore and Mauro Nicolao. Reactive speech synthesis: actively managing phonetic contrast along an H&H continuum. In Proceedings of the 17th International Congress of Phonetics Sciences, Hong Kong, China, August , 77, 92 [Moore 2007] Roger K. Moore. PRESENCE: A Human-Inspired Architecture for Speech- Based Human-Machine Interaction. IEEE Transactions on Computers, vol. 56, no. 9, pages ,
185 Bibliography 161 [Moos & Trouvain 2007] Anja Moos and Jürgen Trouvain. Comprehension of Ultra-Fast Speech - Blind vs. Normally Hearing Persons. In Proceedings of the International Congress of Phonetic Sciences, pages , Saarbrücken, Germany, August [Motu8pre 2006] Motu8pre [Online] [Mouchtaris et al. 2004] Athanasios Mouchtaris, Jan Van der Spiegel and Paul Mueller. Non-parallel training for voice conversion by maximum likelihood constrained adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 1 4, Montreal, Quebec, Canada, May [Mueller & Killion 1990] H. Gustav Mueller and Mead C. Killion. An Easy Method for Calculating the Articulation Index. The Hearing Journal, vol. 43, no. 9, pages 14 17, [Nagórski et al. 2002] Arkadiusz Nagórski, Lou Boves and Herman Steeneken. Optimal Selection Of Speech Data For Automatic Speech Recognition Systems. In Proceedings of the International Conference on Spoken Language Processing, pages , Denver, Colorado, USA, September [Nicolao & Moore 2012a] Mauro Nicolao and Roger K. Moore. Consonant production control in a computational model of hyper & hypo theory (C2H). In Proceedings of the The Listening Talker workshop, Edinburgh, United Kingdom, [Nicolao & Moore 2012b] Mauro Nicolao and Roger K. Moore. Establishing some principles of human speech production through bi-dimensional computational models. In Proceedings of the Statistical And Perceptual Audition workshop and Speech Communication with Adaptive Learning consortium, Portland, Oregon, USA, September [Nicolao et al. 2012] Mauro Nicolao, Javier Latorre and Roger K. Moore. C2H: A Computational Model of H&H-based Phonetic Contrast in Synthetic Speech. In Proceedings of Interspeech, Portland, Oregon, USA, September , 47, 56, 64, 77, 92 [Nicolao et al. 2013] Mauro Nicolao, Fabio Tesser and Roger K. Moore. A phoneticcontrast motivated adaptation to control the degree-of-articulation on Italian HMMbased synthetic voices. In Proceedings of the Speech Synthesis Workshop 8, pages , Barcelona, Spain, August 31 - September [Niederjohn & Grotelueschen 1976] Russell J. Niederjohn and James H. Grotelueschen. The enhancement of speech intelligibility in high noise levels by high-pass filtering followed by rapid amplitude compression. IEEE Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pages ,
186 162 Bibliography [Nose & Kobayashi 2011] Takashi Nose and Takao Kobayashi. Speaker-independent HMMbased voice conversion using adaptive quantization of the fundamental frequency. Speech Communication, vol. 53, no. 7, pages , [Nose & Kobayashi 2013] Takashi Nose and Takao Kobayashi. An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model. Speech Communication, vol. 55, no. 2, pages , [Nose et al. 2007] Takashi Nose, Junichi Yamagishi, Takashi Masuko and Takao Kobayashi. A style control technique for HMM-based expressive speech synthesis. IEICE Transactions on Information and Systems, vol. 90, no. 9, pages , September , 77 [Nose et al. 2009] Takashi Nose, Makoto Tachibana and Takao Kobayashi. HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker s Voice Using Model Adaptation. IEICE Transactions on Information and Systems, vol. 92, no. 3, pages , , 77, 108 [Odell 1995] Julian James Odell. The Use of Context in Large Vocabulary Speech Recognition. PhD thesis, University of Cambridge, United Kingdom, [Ogata et al. 2006] Katsumi Ogata, Makoto Tachibana, Junichi Yamagishi and Takao Kobayashi. Acoustic Model Training Based on Linear Transformation and MAP Modification for HSMM-Based Speech Synthesis. In Proceedings of Interspeech, pages , Pittsburgh, Pennsylvania, USA, September [Ohtani et al. 2009] Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari and Kiyohiro Shikano. Many-to-many eigenvoice conversion with reference voice. In Proceedings of Interspeech, pages , Brighton, United Kingdom, September [Ohtani et al. 2010] Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari and Kiyohiro Shikano. Adaptive Training for Voice Conversion Based on Eigenvoices. IEICE Transactions on Information and Systems, vol. 93, no. 6, pages , June , 108 [Oviatt et al. 1998] Sharon Oviatt, Gina-Anne Levow, Elliott Moreton and Margaret MacEachern. Modeling global and focal hyperarticulation during human computer error resolution. Journal of the Acoustical Society of America, vol. 104, no. 5, pages , , 49 [Paliwal & Atal 1993] Kuldip K. Paliwal and Bishnu S. Atal. Efficient Vector Quantization of LPC Parameters at 24 Bits/Frame. IEEE Transactions on Speech and Audio Processing, vol. 1, no. 1, pages 3 14, January , 81
187 Bibliography 163 [Paliwal 1995] Kuldip K. Paliwal. Interpolation Properties of Linear Prediction Parametric Representations. In Proceedings of Eurospeech, pages , Madrid, Spain, September [Pantazis & Stylianou 2008] Yannis Pantazis and Yannis Stylianou. Improving the modeling of the noise part in the harmonic plus noise model of speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Las Vegas, Nevada, USA, March 30 - April [Patel et al. 2006] Rupal Patel, Michael Everett and Eldar Sadikov. Loudmouth: modifying text-to-speech synthesis in noise. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility (Assets), pages , Baltimore, Maryland, USA, October , 64, 91 [Percybrooks & Moore 2012] Winston Percybrooks and Elliot Moore. A HMM approach to residual estimation for high resolution voice conversion. In Proceedings of Interspeech, pages 90 93, Portland, Oregon, USA, September [Percybrooks et al. 2013] Winston Percybrooks, Elliot Moore and Correy McMillan. Phoneme Independent HMM Voice Conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Vancouver, Canada, May [Pfitzinger 2006] Hartmut. R. Pfitzinger. Five dimensions of prosody: Intensity, intonation, timing, voice quality, and degree of reduction. In Proceedings of the Third International Conference on Speech Prosody, pages 6 9, Dresden, Germany, May , 48 [Picart et al. 2010] Benjamin Picart, Thomas Drugman and Thierry Dutoit. Analysis and Synthesis of Hypo and Hyperarticulated Speech. In Proceedings of the Speech Synthesis Workshop 7, pages , Kyoto, Japan, September , 65 [Picart et al. 2011a] Benjamin Picart, Thomas Drugman and Thierry Dutoit. Continuous Control of the Degree of Articulation in HMM-based Speech Synthesis. In Proceedings of Interspeech, pages , Florence, Italy, August [Picart et al. 2011b] Benjamin Picart, Thomas Drugman and Thierry Dutoit. Perceptual Effects of the Degree of Articulation in HMM-based Speech Synthesis. In Proceedings of the Non-Linear Speech Processing Workshop, pages , Las Palmas, Gran Canaria, November [Picart et al. 2012a] Benjamin Picart, Thomas Drugman and Thierry Dutoit. Assessing the Intelligibility and Quality of HMM-based Speech Synthesis with a Variable Degree of Articulation. In Proceedings of the The Listening Talker workshop, pages 44 47, Edinburgh, Scotland, May
188 164 Bibliography [Picart et al. 2012b] Benjamin Picart, Thomas Drugman and Thierry Dutoit. Statistical Methods for Varying the Degree of Articulation in New HMM-based Voices. In Proceedings of the IEEE Workshop on Spoken Language Technology, pages , Miami, Florida, USA, December [Picart et al. 2013a] Benjamin Picart, Thomas Drugman and Thierry Dutoit. Analysis and HMM-based Synthesis of Hypo and Hyperarticulated Speech. IEEE Transactions on Computer Speech and Language, Special Issue of The Listening Talker, DOI: /j.csl , 65, 77, 93 [Picart et al. 2013b] Benjamin Picart, Thomas Drugman and Thierry Dutoit. Automatic Variation of the Degree of Articulation in New HMM-based Voices. IEEE Journal of Selected Topics in Signal Processing, Special Issue on Statistical Parametric Speech Synthesis, DOI: /JSTSP [Picart et al. 2013c] Benjamin Picart, Thomas Drugman and Thierry Dutoit. HMMbased Speech Synthesis with Various Degrees of Articulation: a Perceptual Study. Neurocomputing Journal, Special Issue of NOLISP 2011, DOI: /j.neucom [Picheny et al. 1986] M. A. Picheny, N. I. Durlach and L. D. Braida. Speaking clearly for the hard of hearing II: Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research, vol. 29, pages , [Picheny et al. 1989] M. A. Picheny, N. I. Durlach and L. D. Braida. Speaking clearly for the hard of hearing III: an attempt to determine the contribution of speaking rate to differences in intelligibility between clear and conversational speech. Journal of Speech and Hearing Research, vol. 32, pages , [Pick et al. 1989] Herbert L. Pick, Gerald M. Siegel, Paul W. Fox, Sharon R. Garber and Joseph K. Kearney. Inhibiting the Lombard effect. Journal of the Acoustical Society of America, vol. 85, no. 2, pages , [PowerplayPro8 2002] PowerplayPro8. [Online] HA8000.aspx, [Pucher et al. 2010a] Michael Pucher, Dietmar Schabus and Junichi Yamagishi. Synthesis of Fast Speech with Interpolation of Adapted HSMMs and Its Evaluation by Blind and Sighted Listeners. In Proceedings of Interspeech, Makuhari, Chiba, Japan, September [Pucher et al. 2010b] Michael Pucher, Dietmar Schabus, Junichi Yamagishi, Friedrich Neubarth and Volker Strom. Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis. Speech Communication, vol. 52, no. 2, pages , February
189 Bibliography 165 [Qin et al. 2006] Long Qin, Zhen-Hua Ling, Yi-Jian Wu, Bu-Fan Zhang and Ren-Hua Wang. HMM-based emotional speech synthesis using average emotion model. In Proceedings of the 5th international conference on Chinese Spoken Language Processing, volume 4274, pages , Singapore, December [Rabiner & Juang 1993] Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition. Prentice Hall Signal Processing Series, , 11, 13, 14, 110 [Raitio et al. 2008] Tuomo Raitio, Antti Suni, Hannu Pulakka, Martti Vainio and Paavo Alku. HMM-Based Finnish Text-to-Speech System Utilizing Glottal Inverse Filtering. In Proceedings of Interspeech, pages , Brisbane, Australia, September [Raitio et al. 2011a] Tuomo Raitio, Antti Suni, Martti Vainio and Paavo Alku. Analysis of HMM-Based Lombard Speech Synthesis. In Proceedings of Interspeech, pages , Florence, Italy, August , 64 [Raitio et al. 2011b] Tuomo Raitio, Antti Suni, Junichi Yamagishi, Hannu Pulakka, Jani Nurminen, Martti Vainio and Paavo Alku. HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 1, pages , , 102 [Roekhaut et al. 2010] Sophie Roekhaut, Jean-Philippe Goldman and Anne Catherine Simon. A Model for Varying Speaking Style in TTS systems. In Proceedings of the Fifth International Conference on Speech Prosody, pages 1 4, Chicago, Illinois, USA, May [Rouas et al. 2004] Jean-Luc Rouas, Jérôme Farinas and François Pellegrino. Evaluation automatique du débit de la parole sur des données multilingues spontanées. In Proceedings of the XXVe Journées d Etude sur la Parole, pages , Fès, Maroc, April [S ] ANSI S Methods for the calculation of the speech intelligibility index. American National Standards, , 90 [Sagisaka et al. 1992] Yoshinori Sagisaka, Nobuyoshi Kaiki, Naoto Iwahashi and Katsuhiko Mimura. ATR µ-talk Speech Synthesis System. In Proceedings of the International Conference on Spoken Language Processing, pages , Banff, Alberta, Canada, October [Saheer et al. 2010] Lakshmi Saheer, John Dines, Philip N. Garner and Hui Liang. Implementation of VTLN for Statistical Speech Synthesis. In Proceedings of the Speech Synthesis Workshop 7, Kyoto, Japan, September
190 166 Bibliography [Saheer et al. 2012a] Lakshmi Saheer, John Dines and Philip N. Garner. Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis. IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 7, pages , [Saheer et al. 2012b] Lakshmi Saheer, Junichi Yamagishi, Philip N. Garner and John Dines. Combining vocal tract length normalization with hierarchial linear transformations. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Kyoto, Japan, March [Scholes 1967] Robert J. Scholes. Phoneme Categorization of Synthetic Vocalic Stimuli By Speakers of Japanese, Spanish, Persian, and American English. Language and Speech, vol. 10, no. 1, pages 46 68, [Shichiri et al. 2002] Kengo Shichiri, Atsushi Sawabe, Takayoshi Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi and Tadashi Kitamura. Eigenvoices for HMM-Based Speech Synthesis. In Proceedings of the International Conference on Spoken Language Processing, pages , Denver, Colorado, USA, September [Sjölander & Beskow 2000] Kåre Sjölander and Jonas Beskow. Wavesurfer - an open source speech tool. In Proceedings of the International Conference on Spoken Language Processing, volume 4, pages , Beijing, China, October [Skowronski & Harris 2006] Mark D. Skowronski and John G. Harris. Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments. Speech Communication, vol. 48, no. 5, pages , [Slaney et al. 1996] Malcolm Slaney, Michele Covell and Bud Lassiter. Automatic Audio Morphing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Atlanta, Georgia, USA, May [Smiljanić & Bradlow 2005] Rajka Smiljanić and Ann R. Bradlow. Production and perception of clear speech in Croatian and English. Journal of the Acoustical Society of America, vol. 118, no. 3, pages , , 90 [Smit 2010] Peter Smit. A Review Of Eigenvoice Adaptation. Technical report, Aalto University, November , 108 [Södersten et al. 1995] Maria Södersten, Stellan Hertegård and Britta Hammarberg. Glottal closure, transglottal airflow, and voice quality in healthy middle-aged women. Journal of Voice, vol. 9, no. 2, pages , June
191 Bibliography 167 [Song et al. 2013] Peng Song, Wenming Zheng and Li Zhao. Non-Parallel Training for Voice Conversion based on Adaptation Method. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Vancouver, Canada, May [Steeneken & Houtgast 1980] H. J. M. Steeneken and T. Houtgast. A physical method for measuring speech-transmission quality. Journal of the Acoustical Society of America, vol. 67, no. 1, pages , [Stent et al. 2011] Amanda Stent, Ann Syrdal and Taniya Mishra. On the Intelligibility of Fast Synthesized Speech for Individuals with Early-Onset Blindness. In Proceedings of the ACM SIGACCESS Conference on Computers and Accessibility, pages , Dundee, Scotland, UK, October [Stylianou et al. 1998] Yannis Stylianou, Olivier Cappé and Eric Moulines. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pages , , 108 [Stylianou 2001] Yannis Stylianou. Applying the Harmonic plus Noise Model in Concatenative Speech Synthesis. IEEE Transactions on Speech and Audio Processing, vol. 9, no. 1, pages 21 29, January [Summers et al. 1988] W. Van Summers, David B. Pisoni, Robert H. Bernacki, Robert I. Pedlow and Michael A. Stokes. Effects of noise on speech production: acoustic and perceptual analyses. Journal of the Acoustical Society of America, vol. 84, no. 3, pages , September , 49, 90 [Suni et al. 2010] Antti Suni, Tuomo Raitio, Martti Vainio and Paavo Alku. The GlottHMM speech synthesis entry for Blizzard Challenge In Proceedings of the The Blizzard Challenge 2010 workshop, page Online: Kyoto, Japan, [Syrdal et al. 2012] Ann K. Syrdal, H. Timothy Bunnell, Susan R. Hertz, Taniya Mishra, Murray Spiegel, Corine Bickley, Deborah Rekart and Matthew J. Makashay. Text- To-Speech Intelligibility across Speech Rates. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September [Taal et al. 2010] Cees H. Taal, Richard C. Hendriks, Richard Heusdens and Jesper Jensen. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Dallas, Texas, USA, March [Tachibana et al. 2003] Makoto Tachibana, Junichi Yamagishi, Koji Onishi, Takashi Masuko and Takao Kobayashi. HMM-based speech synthesis with various speaking styles using model interpolation and adaptation. Technical report vol. 103, no. 264, pp , IEICE Technical Report,
192 168 Bibliography [Tachibana et al. 2005] Makoto Tachibana, Junichi Yamagishi, Takashi Masuko and Takao Kobayashi. Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Transactions on Information and Systems, vol. 88, no. 11, pages , [Tachibana et al. 2006] Makoto Tachibana, Junichi Yamagishi, Takashi Masuko and Takao Kobayashi. A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features. IEICE Transactions on Information and Systems, vol. 89, no. 3, pages , , 109 [Tamura et al. 1998] Masatsune Tamura, Takashi Masuko, Keiichi Tokuda and Takao Kobayashi. Speaker adaptation for HMM-based speech synthesis system using MLLR. In Proceedings of the Speech Synthesis Workshop 3, pages , Jenolan Caves House, Blue Mountains, New South Wales, Australia, November [Tamura et al. 2001] Masatsune Tamura, Takashi Masuko, Keiichi Tokuda and Takao Kobayashi. Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages , Salt Lake City, Utah, USA, May , 74, 76 [Tang & Cooke 2010] Yan Tang and Martin Cooke. Energy reallocation strategies for speech enhancement in known noise conditions. In Proceedings of Interspeech, pages , Makuhari, Chiba, Japan, September [Tang & Cooke 2011] Yan Tang and Martin Cooke. Subjective and objective evaluation of speech intelligibility enhancement under constant energy and duration constraints. In Proceedings of Interspeech, pages , Florence, Italy, August [Taylor 2009] Paul Taylor. Text-to-speech synthesis. Cambridge University Press, [Toda & Tokuda 2007] Tomoki Toda and Keiichi Tokuda. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis. IEICE Transactions on Information and Systems, vol. 90, no. 5, pages , May [Toda et al. 2006] Tomoki Toda, Yamato Ohtani and Kiyohiro Shikano. Eigenvoice conversion based on Gaussian mixture model. In Proceedings of Interspeech, pages , Pittsburgh, Pennsylvania, USA, September , 108 [Toda et al. 2007a] Tomoki Toda, Alan W. Black and Keiichi Tokuda. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pages , , 108
193 Bibliography 169 [Toda et al. 2007b] Tomoki Toda, Yamato Ohtani and Kiyohiro Shikano. One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Honolulu, Hawaii, USA, April [Toda et al. 2012] Tomoki Toda, Takashi Muramatsu and Hideki Banno. Implementation of Computationally Efficient Real-Time Voice Conversion. In Proceedings of Interspeech, pages 94 97, Portland, Oregon, USA, September [Tokuda & Zen 2009] Keiichi Tokuda and Heiga Zen. Fundamentals and recent advances in HMM-based speech synthesis. In Tutorial of Interspeech 2009, Brighton, United Kingdom, September xiii, 24, 26, 31, 32 [Tokuda et al. 1994] Keiichi Tokuda, Takao Kobayashi, Takashi Masuko and Satoshi Imai. Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In Proceedings of the International Conference on Spoken Language Processing, volume 3, pages , Yokohama, Japan, September , 110 [Tokuda et al. 1999] Keiichi Tokuda, Takashi Masuko, Noboru Miyazaki and Takao Kobayashi. Hidden Markov Models Based on Multi-Space Probability Distribution for Pitch Pattern Modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages , Phoenix, Arizona, USA, May [Tokuda et al. 2000] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi and Tadashi Kitamura. Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages , Istanbul, Turkey, June [Tokuda et al. 2002a] Keiichi Tokuda, Takashi Masuko, Noboru Miyazaki and Takao Kobayashi. Multi-Space Probability Distribution HMM. IEICE Transactions on Information and Systems, vol. 85, no. 3, pages , , 74 [Tokuda et al. 2002b] Keiichi Tokuda, Heiga Zen and Alan W. Black. An HMM-based speech synthesis system applied to English. In Proceedings of the IEEE Workshop on Speech Synthesis, pages , Santa Monica, California, USA, September xiii, 28 [Tribolet et al. 1978] J. M. Tribolet, P. Noll, B. McDermott and R. E. Crochiere. A study of complexity and quality of speech waveform coders. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages , Tulsa, Oklahoma, USA, April [Turk et al. 2005] Oytun Turk, Marc Schröder, Baris Bozkurt and Levent M. Arslan. Voice Quality Interpolation for Emotional Text-to-Speech Synthesis. In Proceedings of Interspeech, pages , Lisbon, Portugal, September
194 170 Bibliography [Uebel & Woodland 1999] Luis Felipe Uebel and Philip C. Woodland. An Investigation into Vocal Tract Length Normalisation. In Proceedings of Eurospeech, pages , Budapest, Hungary, September [Uto et al. 2006] Yosuke Uto, Yoshihiko Nankaku, Tomoki Toda, Akinobu Lee and Keiichi Tokuda. Voice Conversion Based on Mixtures of Factor Analyzers. In Proceedings of Interspeech, pages , Pittsburgh, Pennsylvania, USA, September [Valentini-Botinhao et al. 2011] Cassia Valentini-Botinhao, Junichi Yamagishi and Simon King. Can objective measures predict the intelligibility of modified HMM-based synthetic speech in noise? In Proceedings of Interspeech, Florence, Italy, August [Valentini-Botinhao et al. 2012a] Cassia Valentini-Botinhao, Ranniery Maia, Junichi Yamagishi, Simon King and Heiga Zen. Cepstral analysis based on the Glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Kyoto, Japan, March , 92 [Valentini-Botinhao et al. 2012b] Cassia Valentini-Botinhao, Junichi Yamagishi and Simon King. Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proceedings of Interspeech, Portland, Oregon, USA, September , 92 [van Santen & Buchsbaum 1997] Jan P. H. van Santen and Adam L. Buchsbaum. Methods for Optimal Text Selection. In Proceedings of Eurospeech, pages , Rhodes, Greece, September [van Santen 1997] Jan P. H. van Santen. Combinatorial Issues in Text-To-Speech Synthesis. In Proceedings of Eurospeech, pages , Rhodes, Greece, September [VirtualizerDSP ] VirtualizerDSP1000. EN/Products/DSP1000P.aspx, [Online] [Viterbi 1967] Andrew J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, vol. 13, no. 2, pages , [Wang et al. 2012] Linfang Wang, Lijuan Wang, Yan Teng, Zhe Geng and Frank K. Soong. Objective Intelligibility Assessment of Text-to-Speech System using Template Constrained Generalized Posterior Probability. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September
195 Bibliography 171 [Wassink et al. 2007] Alicia Beckford Wassink, Richard A. Wright and Amber D. Franklin. Intraspeaker variability in vowel production: An investigation of motherese, hyperspeech, and Lombard speech in Jamaican speakers. Journal of Phonetics, vol. 35, pages , [Wouters & Macon 2001] Johan Wouters and Michael W. Macon. Control of spectral dynamics in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, vol. 9, no. 1, pages 30 38, , 48, 64 [Wouters 1996] Johan Wouters. Analysis and Synthesis of Degree of Articulation. PhD thesis, Katholieke Universiteit Leuven, Belgium, , 51, 64 [Wu et al. 2009] Yi-Jian Wu, Yoshihiko Nankaku and Keiichi Tokuda. State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis. In Proceedings of Interspeech, pages , Brighton, United Kingdom, September [Yamagishi & King 2010] Junichi Yamagishi and Simon King. Simple methods for improving speaker-similarity of HMM-based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , Dallas, Texas, USA, March [Yamagishi & Kobayashi 2007] Junichi Yamagishi and Takao Kobayashi. Average-Voicebased Speech Synthesis using HSMM-based Speaker Adaptation and Adaptive Training. IEICE Transactions on Information and Systems, vol. 90, no. 2, pages , , 34, 67, 68, 74, 77, 78, 81 [Yamagishi et al. 2003a] Junichi Yamagishi, Koji Onishi, Takashi Masuko and Takao Kobayashi. Modeling of various speaking styles and emotions for HMM-based speech synthesis. In Proceedings of Eurospeech, pages , Geneva, Switzerland, September [Yamagishi et al. 2003b] Junichi Yamagishi, Masatsune Tamura, Takashi Masuko, Keiichi Tokuda and Takao Kobayashi. A Training Method of Average Voice Model for HMM-Based Speech Synthesis. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. 86, no. 8, pages , [Yamagishi et al. 2004] Junichi Yamagishi, Takashi Masuko and Takao Kobayashi. HMMbased expressive speech synthesis - Towards TTS with arbitrary speaking styles and emotions. In Proceedings of the Special Workshop in Maui, Hawaii, January , 77 [Yamagishi et al. 2008] Junichi Yamagishi, Zhen-Hua Ling and Simon King. Robustness of HMM-Based Speech Synthesis. In Proceedings of Interspeech, pages , Brisbane, Australia, September
196 172 Bibliography [Yamagishi et al. 2009a] Junichi Yamagishi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata and Juri Isogai. Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 1, pages 66 83, January xiii, 34, 36, 74 [Yamagishi et al. 2009b] Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhen-Hua Ling, Tomoki Toda, Keiichi Tokuda, Simon King and Steve Renals. Robust Speaker- Adaptive HMM-based Text-to-Speech Synthesis. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pages , August , 77, 78, 108 [Yamagishi et al. 2010] Junichi Yamagishi, Bela Usabaev, Simon King, Oliver Watts, John Dines, Jilei Tian, Yong Guan, Rile Hu, Keiichiro Oura, Yi-Jian Wu, Keiichi Tokuda, Reima Karhila and Mikko Kurimo. Thousands of Voices for HMM-Based Speech Synthesis Analysis and Application of TTS Systems Built on Various ASR Corpora. IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pages , [Yamagishi 2006] Junichi Yamagishi. Average-Voice-Based Speech Synthesis. PhD thesis, Tokyo Institute of Technology, Japan, March xiii, 13, 24, 29, 30, 31, 36, 37, 74, 78, 109 [Ye & Young 2004] Hui Ye and Steve Young. High quality voice morphing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, May , 108 [Yegnanarayana et al. 2008] B. Yegnanarayana, S. Rajendran, Hussien Seid Worku and N. Dhananjaya. Analysis of Glottal Stops in Speech Signals. In Proceedings of Interspeech, pages , Brisbane, Australia, September [Yoo et al. 2007] Sungyub D. Yoo, J. Robert Boston, Amro El-Jaroudi, Ching-Chung Li, John D. Durrant, Kristie Kovacyk and Susan Shaiman. Speech signal modification to increase intelligibility in noisy environments. Journal of the Acoustical Society of America, vol. 122, no. 2, pages , [Yoshimura et al. 1998] Takayoshi Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi and Tadashi Kitamura. Duration Modeling For HMM-Based Speech Synthesis. In Proceedings of the International Conference on Spoken Language Processing, Sydney, Australia, November 30 - December , 29, 30 [Yoshimura et al. 1999] Takayoshi Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi and Tadashi Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of Eurospeech, pages , Budapest, Hungary, September , 29, 78
197 Bibliography 173 [Yoshimura et al. 2000] Takayoshi Yoshimura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi and Tadashi Kitamura. Speaker interpolation for HMM-based speech synthesis system. Journal of the Acoustic Society of Japan, vol. 21, no. 4, pages , [Yoshimura et al. 2001] Takayoshi Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi and Tadashi Kitamura. Mixed Excitation for HMM-based Speech Synthesis. In Proceedings of Eurospeech, pages , Aalborg, Denmark, September [Yoshimura 2002] Takayoshi Yoshimura. Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems. PhD thesis, Department of Electrical and Computer Engineering, Nagoya Institute of Technology, Japan, [Yuan et al. 2006] Jiahong Yuan, Mark Liberman and Christopher Cieri. Towards an Integrated Understanding of Speaking Rate in Conversation. In Proceedings of Interspeech, pages , Pittsburgh, Pennsylvania, USA, September [Zen et al. 2007] Heiga Zen, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi and Tadashi Kitamura. A Hidden Semi-Markov Model-Based Speech Synthesis System. IEICE Transactions on Information and Systems, vol. 90, no. 5, pages , May , 78 [Zen et al. 2009] Heiga Zen, Keiichi Tokuda and Alan W. Black. Statistical parametric speech synthesis. Speech Communication, vol. 51, no. 11, pages , November xiii, 1, 5, 9, 18, 28, 31, 34, 36, 64, 65, 73, 110, 112 [Zorila et al. 2012] Tudor-Catalin Zorila, Varvara Kandia and Yannis Stylianou. Speechin-noise intelligibility improvement based on spectral shaping and dynamic range compression. In Proceedings of Interspeech, pages , Portland, Oregon, USA, September
198
199 Appendix A Publications A.1 Journals B. Picart, T. Drugman, T. Dutoit, Automatic Variation of the Degree of Articulation in New HMM-based Voices, IEEE Journal of Selected Topics in Signal Processing, Special Issue on Statistical Parametric Speech Synthesis, DOI: /JSTSP , to appear soon B. Picart, T. Drugman, T. Dutoit, Analysis and HMM-based Synthesis of Hypo and Hyperarticulated Speech, IEEE Transactions on Computer Speech and Language, Special Issue of The Listening Talker, DOI: /j.csl , to appear soon B. Picart, T. Drugman, T. Dutoit, HMM-based Speech Synthesis with Various Degrees of Articulation: a Perceptual Study, Neurocomputing Journal, Special Issue of NOLISP 2011, DOI: /j.neucom , to appear soon J. Urbain, R. Niewiadomski, E. Bevacqua, T. Dutoit, A. Moinet, C. Pelachaud, B. Picart, J. Tilmanne, J. Wagner, AVLaughterCycle - Enabling a virtual agent to join in laughing with a conversational partner using a similarity-driven audiovisual laughter animation, Journal on Multimodal User Interfaces (JMUI), Volume 4, Number 1, pages 47-58, 2010, DOI: /s A.2 Conference Proceedings B. Picart, S. Brognaux, T. Drugman, HMM-based Speech Synthesis of Live Sports Commentaries: Integration of a Two-Layer Prosody Annotation, Speech Synthesis Workshop 8 (SSW8), pages 19-24, August 31 - September 2, Barcelona, Spain, 2013 S. Brognaux, B. Picart, T. Drugman, A New Prosody Annotation Protocol for Live Sports Commentaries, Proceedings of Interspeech, pages , August 25-29, Lyon, France, 2013 M. Astrinaki, N. d Alessandro, B. Picart, T. Drugman, T. Dutoit, Reactive and Continuous Control of HMM-based Speech Synthesis, IEEE Workshop on Spoken Language Technology (SLT), pages , December 2-5, Miami, Florida, USA, 2012
200 176 Appendix A. Publications B. Picart, T. Drugman, T. Dutoit, Statistical Methods for Varying the Degree of Articulation in New HMM-based Voices, IEEE Workshop on Spoken Language Technology (SLT), pages , December 2-5, Miami, Florida, USA, 2012 B. Picart, T. Drugman, T. Dutoit, Assessing the Intelligibility and Quality of HMM-based Speech Synthesis with a Variable Degree of Articulation, The Listening Talker (LISTA) workshop, pages 44-47, May 2-3, Edinburgh, Scotland, 2012 B. Picart, T. Drugman, T. Dutoit, Perceptual Effects of the Degree of Articulation in HMM-based Speech Synthesis, Proceedings of the Non-Linear Speech Processing International Workshop (NOLISP), pages , November 07-09, Las Palmas, Gran Canaria, 2011 B. Picart, T. Drugman, T. Dutoit, Continuous Control of the Degree of Articulation in HMM-based Speech Synthesis, Proceedings of Interspeech, pages , August 27-31, Florence, Italy, 2011 B. Picart, T. Drugman, T. Dutoit, Analysis and Synthesis of Hypo and Hyperarticulated Speech, Speech Synthesis Workshop 7 (SSW7), pages , September 22-24, Kyoto, Japan, 2010 J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadomski, C. Pelachaud, B. Picart, J. Tilmanne, J. Wagner, La base de données AVLaughterCycle, Actes des 28èmes Journées d Etude sur la Parole (JEP 2010), pages 61-64, May 25-28, Mons, Belgium, 2010 J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadomski, C. Pelachaud, B. Picart, J. Tilmanne, J. Wagner, The AVLaughterCycle Database, Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 10), European Language Resources Association (ELRA), pages , May 19-21, Valletta, Malta, 2010 A. Asaei, B. Picart, H. Bourlard, Analysis of Phone Posterior Feature Space Exploiting Class-Specific Sparsity and MLP-based Similarity Measure, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages , March 14-19, Dallas, Texas, USA, 2010 J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadomski, C. Pelachaud, B. Picart, J. Tilmanne, J. Wagner, AVLaughterCycle: An audiovisual laughing machine, Proceedings of the 5th International Summer Workshop on Multimodal Interfaces - enterface 09, pages 79-87, Genova, Italy, 2009 A.3 Scientific Reports M. Astrinaki, O. Babacan,N. d Alessandro, B. Picart, T. Dutoit, phts for Max/MSP: A Streaming Architecture for Statistical Parametric Speech Synthesis,
201 A.3. Scientific Reports 177 Quarterly Progress and Status Report of the Numediart Research Program, Volume 4, Number 1, pp. 7-11, March 2011 J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadomski, C. Pelachaud, B. Picart, J. Tilmanne, J. Wagner, AVLaughterCycle: An audiovisual laughing machine, Quarterly Progress and Status Report of the Numediart Research Program, Volume 2, Number 3, pp , September 2009
202 178 Appendix A. Publications
Analysis and Synthesis of Hypo and Hyperarticulated Speech
Analysis and Synthesis of and articulated Speech Benjamin Picart, Thomas Drugman, Thierry Dutoit TCTS Lab, Faculté Polytechnique (FPMs), University of Mons (UMons), Belgium {benjamin.picart,thomas.drugman,thierry.dutoit}@umons.ac.be
Thirukkural - A Text-to-Speech Synthesis System
Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,
L9: Cepstral analysis
L9: Cepstral analysis The cepstrum Homomorphic filtering The cepstrum and voicing/pitch detection Linear prediction cepstral coefficients Mel frequency cepstral coefficients This lecture is based on [Taylor,
Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking
Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking The perception and correct identification of speech sounds as phonemes depends on the listener extracting various
MUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.
A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song
, pp.347-354 http://dx.doi.org/10.14257/ijmue.2014.9.8.32 A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song Myeongsu Kang and Jong-Myon Kim School of Electrical Engineering,
Developing an Isolated Word Recognition System in MATLAB
MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling
Text-To-Speech Technologies for Mobile Telephony Services
Text-To-Speech Technologies for Mobile Telephony Services Paulseph-John Farrugia Department of Computer Science and AI, University of Malta Abstract. Text-To-Speech (TTS) systems aim to transform arbitrary
Lecture 1-10: Spectrograms
Lecture 1-10: Spectrograms Overview 1. Spectra of dynamic signals: like many real world signals, speech changes in quality with time. But so far the only spectral analysis we have performed has assumed
Hardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
Creating voices for the Festival speech synthesis system.
M. Hood Supervised by A. Lobb and S. Bangay G01H0708 Creating voices for the Festival speech synthesis system. Abstract This project focuses primarily on the process of creating a voice for a concatenative
Emotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS
The 21 st International Congress on Sound and Vibration 13-17 July, 2014, Beijing/China PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS Dan Wang, Nanjie Yan and Jianxin Peng*
Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA
Audio Engineering Society Convention Paper Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA The papers at this Convention have been selected on the basis of a submitted abstract
School Class Monitoring System Based on Audio Signal Processing
C. R. Rashmi 1,,C.P.Shantala 2 andt.r.yashavanth 3 1 Department of CSE, PG Student, CIT, Gubbi, Tumkur, Karnataka, India. 2 Department of CSE, Vice Principal & HOD, CIT, Gubbi, Tumkur, Karnataka, India.
A TOOL FOR TEACHING LINEAR PREDICTIVE CODING
A TOOL FOR TEACHING LINEAR PREDICTIVE CODING Branislav Gerazov 1, Venceslav Kafedziski 2, Goce Shutinoski 1 1) Department of Electronics, 2) Department of Telecommunications Faculty of Electrical Engineering
TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE
TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE Sangam P. Borkar M.E. (Electronics)Dissertation Guided by Prof. S. P. Patil Head of Electronics Department Rajarambapu Institute of Technology Sakharale,
Speech Signal Processing: An Overview
Speech Signal Processing: An Overview S. R. M. Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati December, 2012 Prasanna (EMST Lab, EEE, IITG) Speech
An Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
Lecture 1-6: Noise and Filters
Lecture 1-6: Noise and Filters Overview 1. Periodic and Aperiodic Signals Review: by periodic signals, we mean signals that have a waveform shape that repeats. The time taken for the waveform to repeat
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Fabio Tesser, Giacomo Sommavilla, Giulio Paci, Piero Cosi Institute of Cognitive Sciences and Technologies, National
Advanced Signal Processing and Digital Noise Reduction
Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New
Robust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist ([email protected]) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
Automatic Evaluation Software for Contact Centre Agents voice Handling Performance
International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania [email protected]
The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,
Ericsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
Artificial Neural Network for Speech Recognition
Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken
ACOUSTICAL CONSIDERATIONS FOR EFFECTIVE EMERGENCY ALARM SYSTEMS IN AN INDUSTRIAL SETTING
ACOUSTICAL CONSIDERATIONS FOR EFFECTIVE EMERGENCY ALARM SYSTEMS IN AN INDUSTRIAL SETTING Dennis P. Driscoll, P.E. and David C. Byrne, CCC-A Associates in Acoustics, Inc. Evergreen, Colorado Telephone (303)
TECHNICAL LISTENING TRAINING: IMPROVEMENT OF SOUND SENSITIVITY FOR ACOUSTIC ENGINEERS AND SOUND DESIGNERS
TECHNICAL LISTENING TRAINING: IMPROVEMENT OF SOUND SENSITIVITY FOR ACOUSTIC ENGINEERS AND SOUND DESIGNERS PACS: 43.10.Sv Shin-ichiro Iwamiya, Yoshitaka Nakajima, Kazuo Ueda, Kazuhiko Kawahara and Masayuki
Myanmar Continuous Speech Recognition System Based on DTW and HMM
Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
A Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman
A Comparison of Speech Coding Algorithms ADPCM vs CELP Shannon Wichman Department of Electrical Engineering The University of Texas at Dallas Fall 1999 December 8, 1999 1 Abstract Factors serving as constraints
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
Solutions to Exam in Speech Signal Processing EN2300
Solutions to Exam in Speech Signal Processing EN23 Date: Thursday, Dec 2, 8: 3: Place: Allowed: Grades: Language: Solutions: Q34, Q36 Beta Math Handbook (or corresponding), calculator with empty memory.
Introduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification
Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification (Revision 1.0, May 2012) General VCP information Voice Communication
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA
RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA ABSTRACT Random vibration is becoming increasingly recognized as the most realistic method of simulating the dynamic environment of military
Dynamic sound source for simulating the Lombard effect in room acoustic modeling software
Dynamic sound source for simulating the Lombard effect in room acoustic modeling software Jens Holger Rindel a) Claus Lynge Christensen b) Odeon A/S, Scion-DTU, Diplomvej 381, DK-2800 Kgs. Lynby, Denmark
Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics
Unlocking Value from Patanjali V, Lead Data Scientist, Anand B, Director Analytics Consulting, EXECUTIVE SUMMARY Today a lot of unstructured data is being generated in the form of text, images, videos
TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS
TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS 1. Bandwidth: The bandwidth of a communication link, or in general any system, was loosely defined as the width of
Music technology. Draft GCE A level and AS subject content
Music technology Draft GCE A level and AS subject content July 2015 Contents The content for music technology AS and A level 3 Introduction 3 Aims and objectives 3 Subject content 4 Recording and production
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium [email protected]
Jitter Measurements in Serial Data Signals
Jitter Measurements in Serial Data Signals Michael Schnecker, Product Manager LeCroy Corporation Introduction The increasing speed of serial data transmission systems places greater importance on measuring
Carla Simões, [email protected]. Speech Analysis and Transcription Software
Carla Simões, [email protected] Speech Analysis and Transcription Software 1 Overview Methods for Speech Acoustic Analysis Why Speech Acoustic Analysis? Annotation Segmentation Alignment Speech Analysis
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Transcription of polyphonic signals using fast filter bank( Accepted version ) Author(s) Foo, Say Wei;
Functional Communication for Soft or Inaudible Voices: A New Paradigm
The following technical paper has been accepted for presentation at the 2005 annual conference of the Rehabilitation Engineering and Assistive Technology Society of North America. RESNA is an interdisciplinary
Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids
Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids Synergies and Distinctions Peter Vary RWTH Aachen University Institute of Communication Systems WASPAA, October 23, 2013 Mohonk Mountain
Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers
Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers Sung-won ark and Jose Trevino Texas A&M University-Kingsville, EE/CS Department, MSC 92, Kingsville, TX 78363 TEL (36) 593-2638, FAX
Sound Pressure Measurement
Objectives: Sound Pressure Measurement 1. Become familiar with hardware and techniques to measure sound pressure 2. Measure the sound level of various sizes of fan modules 3. Calculate the signal-to-noise
L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES
L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES Zhen Qin, Allard Jongman Department of Linguistics, University of Kansas, United States [email protected], [email protected]
Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.
Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet
Develop Software that Speaks and Listens
Develop Software that Speaks and Listens Copyright 2011 Chant Inc. All rights reserved. Chant, SpeechKit, Getting the World Talking with Technology, talking man, and headset are trademarks or registered
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency
The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency Andrey Raev 1, Yuri Matveev 1, Tatiana Goloshchapova 2 1 Speech Technology Center, St. Petersburg, RUSSIA {raev, matveev}@speechpro.com
Non-Data Aided Carrier Offset Compensation for SDR Implementation
Non-Data Aided Carrier Offset Compensation for SDR Implementation Anders Riis Jensen 1, Niels Terp Kjeldgaard Jørgensen 1 Kim Laugesen 1, Yannick Le Moullec 1,2 1 Department of Electronic Systems, 2 Center
Formant Bandwidth and Resilience of Speech to Noise
Formant Bandwidth and Resilience of Speech to Noise Master Thesis Leny Vinceslas August 5, 211 Internship for the ATIAM Master s degree ENS - Laboratoire Psychologie de la Perception - Hearing Group Supervised
LAB 7 MOSFET CHARACTERISTICS AND APPLICATIONS
LAB 7 MOSFET CHARACTERISTICS AND APPLICATIONS Objective In this experiment you will study the i-v characteristics of an MOS transistor. You will use the MOSFET as a variable resistor and as a switch. BACKGROUND
CBS RECORDS PROFESSIONAL SERIES CBS RECORDS CD-1 STANDARD TEST DISC
CBS RECORDS PROFESSIONAL SERIES CBS RECORDS CD-1 STANDARD TEST DISC 1. INTRODUCTION The CBS Records CD-1 Test Disc is a highly accurate signal source specifically designed for those interested in making
From Concept to Production in Secure Voice Communications
From Concept to Production in Secure Voice Communications Earl E. Swartzlander, Jr. Electrical and Computer Engineering Department University of Texas at Austin Austin, TX 78712 Abstract In the 1970s secure
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Quarterly Progress and Status Report. Measuring inharmonicity through pitch extraction
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Measuring inharmonicity through pitch extraction Galembo, A. and Askenfelt, A. journal: STL-QPSR volume: 35 number: 1 year: 1994
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
Lecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3
Recognizing Voice Over IP: A Robust Front-End for Speech Recognition on the World Wide Web. By C.Moreno, A. Antolin and F.Diaz-de-Maria. Summary By Maheshwar Jayaraman 1 1. Introduction Voice Over IP is
Timing Errors and Jitter
Timing Errors and Jitter Background Mike Story In a sampled (digital) system, samples have to be accurate in level and time. The digital system uses the two bits of information the signal was this big
Waves: Recording Sound Waves and Sound Wave Interference (Teacher s Guide)
Waves: Recording Sound Waves and Sound Wave Interference (Teacher s Guide) OVERVIEW Students will measure a sound wave by placing the Ward s DataHub microphone near one tuning fork A440 (f=440hz). Then
MICROPHONE SPECIFICATIONS EXPLAINED
Application Note AN-1112 MICROPHONE SPECIFICATIONS EXPLAINED INTRODUCTION A MEMS microphone IC is unique among InvenSense, Inc., products in that its input is an acoustic pressure wave. For this reason,
RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY
RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY I. INTRODUCTION According to the Common Core Standards (2010), Decisions or predictions are often based on data numbers
PCM Encoding and Decoding:
PCM Encoding and Decoding: Aim: Introduction to PCM encoding and decoding. Introduction: PCM Encoding: The input to the PCM ENCODER module is an analog message. This must be constrained to a defined bandwidth
MUSC 1327 Audio Engineering I Syllabus Addendum McLennan Community College, Waco, TX
MUSC 1327 Audio Engineering I Syllabus Addendum McLennan Community College, Waco, TX Instructor Brian Konzelman Office PAC 124 Phone 299-8231 WHAT IS THIS COURSE? AUDIO ENGINEERING I is the first semester
Energy savings in commercial refrigeration. Low pressure control
Energy savings in commercial refrigeration equipment : Low pressure control August 2011/White paper by Christophe Borlein AFF and l IIF-IIR member Make the most of your energy Summary Executive summary
PART 5D TECHNICAL AND OPERATING CHARACTERISTICS OF MOBILE-SATELLITE SERVICES RECOMMENDATION ITU-R M.1188
Rec. ITU-R M.1188 1 PART 5D TECHNICAL AND OPERATING CHARACTERISTICS OF MOBILE-SATELLITE SERVICES Rec. ITU-R M.1188 RECOMMENDATION ITU-R M.1188 IMPACT OF PROPAGATION ON THE DESIGN OF NON-GSO MOBILE-SATELLITE
Department of Electrical and Computer Engineering Ben-Gurion University of the Negev. LAB 1 - Introduction to USRP
Department of Electrical and Computer Engineering Ben-Gurion University of the Negev LAB 1 - Introduction to USRP - 1-1 Introduction In this lab you will use software reconfigurable RF hardware from National
TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics
UNIVERSITY OF DUBLIN TRINITY COLLEGE Faculty of Engineering, Mathematics and Science School of Computer Science & Statistics BA (Mod) Enter Course Title Trinity Term 2013 Junior/Senior Sophister ST7002
Convention Paper Presented at the 118th Convention 2005 May 28 31 Barcelona, Spain
Audio Engineering Society Convention Paper Presented at the 118th Convention 25 May 28 31 Barcelona, Spain 6431 This convention paper has been reproduced from the author s advance manuscript, without editing,
Construct User Guide
Construct User Guide Contents Contents 1 1 Introduction 2 1.1 Construct Features..................................... 2 1.2 Speech Licenses....................................... 3 2 Scenario Management
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Yousef Ajami Alotaibi 1, Mansour Alghamdi 2, and Fahad Alotaiby 3 1 Computer Engineering Department, King Saud University,
Introduction to Digital Audio
Introduction to Digital Audio Before the development of high-speed, low-cost digital computers and analog-to-digital conversion circuits, all recording and manipulation of sound was done using analog techniques.
The AV-LASYN Database : A synchronous corpus of audio and 3D facial marker data for audio-visual laughter synthesis
The AV-LASYN Database : A synchronous corpus of audio and 3D facial marker data for audio-visual laughter synthesis Hüseyin Çakmak, Jérôme Urbain, Joëlle Tilmanne and Thierry Dutoit University of Mons,
HIGH QUALITY AUDIO RECORDING IN NOKIA LUMIA SMARTPHONES. 1 Nokia 2013 High quality audio recording in Nokia Lumia smartphones
HIGH QUALITY AUDIO RECORDING IN NOKIA LUMIA SMARTPHONES 1 Nokia 2013 High quality audio recording in Nokia Lumia smartphones HIGH QUALITY AUDIO RECORDING IN NOKIA LUMIA SMARTPHONES This white paper describes
Speech Analysis for Automatic Speech Recognition
Speech Analysis for Automatic Speech Recognition Noelia Alcaraz Meseguer Master of Science in Electronics Submission date: July 2009 Supervisor: Torbjørn Svendsen, IET Norwegian University of Science and
ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1
WHAT IS AN FFT SPECTRUM ANALYZER? ANALYZER BASICS The SR760 FFT Spectrum Analyzer takes a time varying input signal, like you would see on an oscilloscope trace, and computes its frequency spectrum. Fourier's
Aircraft cabin noise synthesis for noise subjective analysis
Aircraft cabin noise synthesis for noise subjective analysis Bruno Arantes Caldeira da Silva Instituto Tecnológico de Aeronáutica São José dos Campos - SP [email protected] Cristiane Aparecida Martins
Manual Analysis Software AFD 1201
AFD 1200 - AcoustiTube Manual Analysis Software AFD 1201 Measurement of Transmission loss acc. to Song and Bolton 1 Table of Contents Introduction - Analysis Software AFD 1201... 3 AFD 1200 - AcoustiTube
PUMPED Nd:YAG LASER. Last Revision: August 21, 2007
PUMPED Nd:YAG LASER Last Revision: August 21, 2007 QUESTION TO BE INVESTIGATED: How can an efficient atomic transition laser be constructed and characterized? INTRODUCTION: This lab exercise will allow
APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA
APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer
