Autoregressive HMMs for speech synthesis. Interspeech 2009

Size: px

Start display at page:

Download "Autoregressive HMMs for speech synthesis. Interspeech 2009"

Thomasina Cooper
7 years ago
Views:

1 William Byrne Interspeech 2009

2 Outline Introduction Highlights Background 1 Introduction Highlights Background 2 3 Experimental set-up Results 4

3 Highlights Background Highlights for speech synthesis: synthesis with established excellent synthesis algorithms consistent uses same model for training and synthesis unlike standard HMM synthesis framework easy and efficient training using expectation maximization unlike trajectory HMM performance comparable to standard HMM synthesis framework on Blizzard Challenge-style naturalness evaluation

4 Highlights Background Background HMM synthesis now rivals unit selection 1 a key breakthrough respecting static-dynamic constraints during synthesis 2 standard HMM synthesis framework efficient EM training but inconsistent ignores static-dynamic constraints during training 1 A.W. Black, H. Zen, and K. Tokuda. Statistical parametric speech synthesis. In Proc. ICASSP 2007, pages , K. Tokuda, T. Kobayashi, and S. Imai. Speech parameter generation from HMM using dynamic features. In Proc. ICASSP 1995, volume 1, 1995

5 Highlights Background Background (cont) trajectory HMM 3 respects static-dynamic constraints during training improved synthesis quality consistent uses same model for training and synthesis but training more complicated remains a challenge to find a model that can easily and consistently be used for both training and synthesis we investigate the autoregressive HMM 45 for speech synthesis 3 H. Zen, K. Tokuda, and T. Kitamura. An Introduction of Trajectory into HMM-Based Speech. In Proc. Fifth ISCA Workshop on Speech, C. Wellekens. Explicit time correlation in hidden Markov models for speech recognition. In Proc. ICASSP 1987, volume 12, P.C. Woodland. Hidden Markov models using vector linear prediction and discriminative output distributions. In Proc. ICASSP 1992, volume 1, pages , 1992

6 Outline Introduction 1 Introduction Highlights Background 2 3 Experimental set-up Results 4

7 The model Introduction hidden state sequence θ = θ 1:T e.g. states of full-context models (quinphones, POS, etc) observed acoustic feature vector sequence c = c 1:T e.g. 40-dim static Mel-generalized cepstra

8 The model Introduction hidden state sequence θ = θ 1:T e.g. states of full-context models (quinphones, POS, etc) observed acoustic feature vector sequence c = c 1:T e.g. 40-dim static Mel-generalized cepstra P(c, θ) = t P(θ t θ t 1 ) }{{} transition probs P(c t c 1:t 1, θ t ) }{{} state output dist

9 The model Introduction hidden state sequence θ = θ 1:T e.g. states of full-context models (quinphones, POS, etc) observed acoustic feature vector sequence c = c 1:T e.g. 40-dim static Mel-generalized cepstra P(c, θ) = t P(θ t θ t 1 ) }{{} transition probs P(c t c 1:t 1, θ t ) }{{} state output dist θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 c 1 c 2 c 3 c 4 c 5 c 6

10 State output distributions state output distributions conditional Gaussian: P(c t c 1:t 1, θ t ) = N (c t µ θt (c 1:t 1 ), Σ θt ) mean functions (µ q ) a linear map of a set of summarizers (f d ): µ q (c 1:t 1 ) D A d qf d (c 1:t 1 ) + µ 0 q d=1 where each summarizer f d gives a vector-valued summary of past output c 1:t 1

11 State output distributions (cont) we use summarizers (f d ) a linear combination of past output: f d (c 1:t 1 ) = call (wk d ) window coefficients 1 k= K w d k c t+k use diagonal matrices A d qij = ad qi δ ij and Σ qij = σ 2 qi δ ij feature vector seq components (c i ) independent given state seq θ

12 Example of state output distributions For window set: window w w w

13 Example of state output distributions For window set: window w w w state output distributions for feature vector index i: P(c ti c 1:t 1, θ t ) = N ( c ti µ θti(c 1:t 1 ), σθ 2 ) ti where mean functions: µ qi (c 1:t 1 ) = aqi(c 1 (t 1)i ) + aqi(c 2 (t 1)i c (t 2)i ) + aqi(c 3 (t 1)i 2c (t 2)i + c (t 3)i ) + µ 0 qi

14 Expectation maximization: Forward-Backward algorithm for computing state occupancies γ q (t) easy and efficient parameter re-estimation formulae (see paper)

15 For autoregressive HMM: P(c θ) high-dimensional Gaussian over vector sequences can efficiently compute mean and variance of this Gaussian many current synthesis algorithms directly applicable: synthesis using dynamic features 6 synthesis considering global variance 7 6 K. Tokuda, T. Kobayashi, and S. Imai. Speech parameter generation from HMM using dynamic features. In Proc. ICASSP 1995, volume 1, T. Toda and K. Tokuda. Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech. In Proc. Interspeech 2005, 2005

16 Outline Introduction Experimental set-up Results 1 Introduction Highlights Background 2 3 Experimental set-up Results 4

17 Experimental set-up Results Experimental set-up Blizzard Challenge-style naturalness evaluation using MOS CMU ARCTIC database speaker slt ( 1 hour) 4 systems: natural speech autoregressive HMM system (with synthesis considering GV) baseline standard HMM synthesis framework system (with GV) autoregressive HMM system (without GV) 50 utterances per listener 39 listeners completed (24 native, 15 non-native)

18 Systems in experiment Introduction Experimental set-up Results even for autoregressive system, only the spectral features were modelled with the autoregressive HMM: AR system standard system spectral (MGC) AR standard free params per state log F0 standard band aperiodicity standard clustering standard(!) implemented in HTS (easy to adapt existing code!)

19 Results (native listeners) Experimental set-up Results system mean native median A (natural) B (AR) C (standard) D (AR no GV) Score A B C D System

20 Outline 1 Introduction Highlights Background 2 3 Experimental set-up Results 4

21 for speech synthesis: consistent and efficient model for speech has advantages over standard HMM synthesis framework and trajectory HMM comparable performance to standard HMM synthesis framework on Blizzard Challenge-style naturalness evaluation easy to adapt existing code for autoregressive HMM

22 Acknowledgements References Acknowledgements research funded by the European Community s Seventh Framework Programme (FP7/ ), grant agreement (EMIME) many thanks to organizers of the Blizzard Challenge for providing scripts for our experimental evaluation

23 Acknowledgements References References I A.W. Black, H. Zen, and K. Tokuda. Statistical parametric speech synthesis. In Proc. ICASSP 2007, pages , T. Toda and K. Tokuda. Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech. In Proc. Interspeech 2005, K. Tokuda, T. Kobayashi, and S. Imai. Speech parameter generation from HMM using dynamic features. In Proc. ICASSP 1995, volume 1, C. Wellekens. Explicit time correlation in hidden Markov models for speech recognition. In Proc. ICASSP 1987, volume 12, P.C. Woodland. Hidden Markov models using vector linear prediction and discriminative output distributions. In Proc. ICASSP 1992, volume 1, pages , H. Zen, K. Tokuda, and T. Kitamura. An Introduction of Trajectory into HMM-Based Speech. In Proc. Fifth ISCA Workshop on Speech, 2004.

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University