PROSODIC ANALYSIS OF INDIAN LANGUAGES AND ITS APPLICATIONS TO TEXT TO SPEECH SYNTHESIS

Transcription

1 PROSODIC ANALYSIS OF INDIAN LANGUAGES AND ITS APPLICATIONS TO TEXT TO SPEECH SYNTHESIS A THESIS submitted by RAGHAVA KRISHNAN K for the award of the degree of MASTER OF SCIENCE (by Research) DEPARTMENT OF ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MADRAS. JULY 215

2 THESIS CERTIFICATE This is to certify that the thesis titled PROSODIC ANALYSIS OF INDIAN LANGUAGES AND ITS APPLICATIONS TO TEXT TO SPEECH SYNTHESIS, submitted by Raghava Krishnan K, to the Indian Institute of Technology, Madras, for the award of the degree of Master of Science, is a bona fide record of the research work done by him under our supervision. The contents of this thesis, in full or in parts, have not been submitted to any other Institute or University for the award of any degree or diploma. Prof. S. Umesh Research Guide Dept. of Electrical Engineering IIT-Madras, 6 36 Prof. Hema. A. Murthy Research Guide Dept. of Computer Science and Engineering IIT-Madras, 6 36 Place: Chennai Date: 26 th July, 215

3 ACKNOWLEDGEMENTS It would not have been possible to complete this thesis without the contribution of several people. I would like to express my gratitude to my advisor Prof. Hema A. Murthy for her guidance and unwavering support. She has been a constant source of encouragement and guidance and has played a major role in instilling confidence in me as a researcher. My interactions with her over the last five years have not only shaped my outlook on research, but on life as well. Her unabated energy and enthusiasm for research are qualities I truly admire and can only aspire to emulate. I am grateful to my co-advisor Prof. S. Umesh and the members of the GTC Committee, Prof. C. S. Ramalingam and Prof. C. Chandra Sekhar for their insightful comments and suggestions with respect to my thesis. I would also like to express my gratitude to Prof. Kishore Prahallad for his valuable suggestions and criticism on various tasks that I undertook. I would like to thank Jom, Anusha, Aswin, Kasthuri, Akshay, Shreya and other members of Microsoft lab and Donlab for their support and encouragement over the years and for helping me conduct numerous listening tests. I would like to thank Anjana Babu, in particular, for having played an invaluable part in the initial work that we did on prosodic analysis. Lastly, I would like to thank my parents, my sister Jananie and Aunt Usha Rani for their unreserved support and encouragement. Knowing that I have them behind me has always made my life so much easier and has given me the strength and courage to pursue any path I wish to choose. i

4 ABSTRACT KEYWORDS: Syllable-based; Prosody; Pruning; Prosodic phrasing; Structural similarity; Rhythmic similarity. Synthesis of natural sounding speech for Indian languages has been a challenging task in the field of text-to-speech synthesis over the past few years. The quality of synthesised speech mainly suffers due to the presence of artifacts owing to the mismatch in the acoustic properties both at segmental and suprasegmental levels. These artifacts affect the naturalness and intelligibility of speech, which in turn is reflected in the poor mean opinion scores on listening tests. In this thesis, methods have been proposed to improve the quality of speech synthesis by correcting errors in the speech database, and by manipulating the prosody of utterances to suit the given context. Predicting prosody for text to speech synthesisers is heavily dependent on the punctuation marks present in the text and the part of speech (POS) tags of the words in the text. Therefore, incorporating the appropriate prosody for a given text in a text to speech synthesis system, especially for Indian languages that seldom have punctuation marks and do not have effective methods of part of speech tagging, is a challenging task. Prosody in speech is characterised by rhythm, stress and intonation, and is primiarily a suprasegmental feature. Suprasegmental refers to unit levels above the phoneme such as syllables, words, phrases etc. In this work, we refer to features at the syllable level as segmental features because syllables are the preferred units for synthesis, for syllable-timed Indian languages. Suprasegmental therefore refers to levels above the syllable. At the segmental level, bad units are discarded from the database using the acoustic properties of syllable units. This process is called pruning. This ensures that acoustic continuity is maintained in the database and segmentation errors are also corrected. This method results in a considerable improvement in the quality of synthesis. Additionally, using the units remaining after pruning, to initialise ii

5 hidden Markov models to build a statistical parametric speech synthesiser, is also helpful. At the suprasegmental level, an analysis is conducted to understand the factors that affect the tones and breaks in a spoken utterance. A method to predict prosodic phrase breaks using cues from the text is proposed. The synthesis quality obtained from this system is superior to that of a system without prosodic phrase break prediction. The role played by the structure of a phrase on the prosody of an utterance is also analysed. A new measure called structural similarity, which attempts to correlate two phrases based on the structure of the text present in them, is presented. Structural similarity is also used to define a modified cost measure to select units to synthesise speech in a syllable-based unit selection speech synthesiser. Further, the effect of syllable rhythm on the prosody of a spoken utterance is studied. A measure called rhythmic similarity that correlates two phrases based on their syllabic rhythm patterns is proposed. This analysis shows that rhythmically similar phrases show similarities in prosodic characteristics. iii

6 TABLE OF CONTENTS ACKNOWLEDGEMENTS ABSTRACT LIST OF TABLES LIST OF FIGURES ABBREVIATIONS i ii vii ix x 1 Introduction Overview of Thesis Organisation of the thesis Contribution of the thesis Theoretical Background of USS and HTS Description of Work done on USS for Indian languages Training Pre-clustering Fallback Units Synthesis Phase Description of Work done on HTS for Indian languages Training Synthesis Related Previous Work Previous work on maintaining consistency in speech databases and pruning them Previous work on Prosodic phrase break prediction Related work on prosody prediction Related work on speech rhythm iv

7 2.4 Summary A method to prune speech databases to improve the quality of Indian TTSes Pruning Technique for USS systems Training USS systems Preliminary Experiments and Results Using pruning to improve the quality of phone-based HTS systems Results of Listening tests conducted to evaluate effect of pruning on HTS system Pruning speech databases using likelihood as an additional cue Results of Listening tests conducted to evaluate the pruning using likelihood approach Summary Prosodic Phrase Break Prediction Importance of Prosodic Phrase Break Prediction Challenges faced in prosodic phrase break prediction for Indian languages Lack of Punctuation and POS taggers Agglutinative nature of Indian languages Low resourcedness of Indian languages Case Markers for Prosodic phrasing Word terminal syllables for prosodic phrasing Experiments and Results Summary Analysing the Effects of Phrase Structure and Syllable Rhythm on the Prosody of Syllable-Timed Indian Languages Structural Similarity Transplantation Application of structural similarity to USS Experiments and Results Summary Rhythmic Similarity v

8 5.2.1 Data Preparation Rhythmic Similarity and Transplantation Experiments and Results Summary Conclusion Summary Criticism of the thesis Scope for future work

9 LIST OF TABLES 2.1 Language databases used Small portion of the Common Label Set illustrating how a few examples have been mapped to their roman character equivalent Pairwise comparison tests for Hindi Pairwise comparison tests for Tamil Pairwise comparison tests for Hindi and Tamil to evaluate performance of HTS system after initialising using pruned models Pairwise comparison test results for Hindi and Tamil to observe the performnce of systems that have been using likelihood as a criteria Probabilities of Hindi case markers and Tamil word-terminal syllables (along with their notation in common label set format [1]) being followed by phrase breaks Results of Pairwise comparison tests for Hindi and Tamil USS to compare systems with and without prosodic phrasing Similarity to the original utterance scores for Hindi and Tamil Results of DMOS and WER tests for Hindi Results of DMOS and WER tests for Tamil Similarity to the original utterance listening test results DTW distances between pitch and energy contours of Hindi phrases DTW distances between pitch and energy contours of Tamil phrases 56 vii

10 LIST OF FIGURES 2.1 Plot showing the long-tailed distribution of syllables in 2 languages- Hindi and Tamil Waveform segmented at the syllable-level Flowchart of the hybrid segmentation algorithm An example portion of a CART for the unit sa beg Overview of an HMM-based TTS (a) Histogram of duration difference between a pair of adjacent syllables in the database (b) Histogram of average f difference between a pair of adjacent syllables in the database (c) Histogram of average energy difference between a pair of adjacent syllables in the database Example waveform with artifact Example waveform and transcription with a segmentation error Distribution of acoustic parameters for syllable /see/_end Comparison between syllable segments obtained using the 2 approaches An example of Prosodic hierarchy A portion of the CART tree used for predicting phrase breaks for Hindi A portion of the CART tree used for predicting phrase breaks for Tamil (a) Pie chart showing the number of syllables belonging to each type structure for Hindi (b) Pie chart showing the number of syllables belonging to each type structure for Tamil (A) Waveform, pitch and energy contours of a Hindi phrase (B) Waveform, pitch and energy contours of a Hindi phrase which is structurally similar to (A) (C) Waveform, pitch and energy contours of the phrase obtained by transplanting the prosodic contour of (B) on (A) Plot showing the range of scores for Similarity to the original utterance tests for Hindi and Tamil Selecting a structurally similar phrase from the database viii

11 5.5 (a) Similarity to the original utterance scores for Hindi. (b)similarity to the original utterance scores for Tamil (a) Correlation between duration of phrase and number of syllables per phrase for Hindi. (b) Correlation between duration of phrase and number of syllables per phrase for Tamil (A) Pitch and energy contours of rhythmically similar phrases of Hindi, (B) Pitch and energy contours of rhythmically dissimilar phrases of Hindi (A) Pitch and energy contours of rhythmically similar phrases of Tamil, (B) Pitch and energy contours of rhythmically dissimilar phrases of Tamil (A) Waveform, pitch contour and energy contour of a Hindi phrase (B) Waveform, pitch contour and energy contour of the waveform in (A) when a rhythmically dissimilar Hindi phrase is transplanted on it (C) Waveform, pitch contour and energy contour of the waveform in (A) when a rhythmically dissimilar Hindi phrase is transplanted on it (a) Similarity to the original utterance scores for Hindi. (b)similarity to the original utterance scores for Tamil (a) DTW alignment between f contours of 2 rhythmically similar phrases (b) DTW alignment between f contours of 2 rhythmically dissimilar phrases (a) DTW alignment between energy contours of 2 rhythmically similar phrases (b) DTW alignment between energy contours of 2 rhythmically dissimilar phrases ix

12 ABBREVIATIONS TTS Text to Speech USS Unit Selection Speech Synthesis HMM Hidden Markov Models CDHMM Context Dependent HMM GPMF Global Prosodic Mismatch Function MSDHMM Multi-Space Probability Distribution HMM HTS HMM based speech synthesis system CART Classification and Regression Trees STE Short-Term Energy MOS Mean Opinion Score DMOS Degradation MOS WER Word Error Rate ToBI Tones and Break Indices DTW Dynamic Time Warping x

13 CHAPTER 1 Introduction Text-to-speech synthesis (TTS) as the name suggests, is the process of converting text input to speech output. The main focus of TTS research in the recent past has been to make synthetic speech sound more natural. Since the input to the TTS is only text, the challenge in the system building part of the TTS is finding ways of extracting appropriate information from text that can be realised acoustically to make the synthesised speech output sound more natural and intelligible. State-of-the-art high-quality unit selection speech synthesisers (USS) for Indian languages have been built using syllable as the basic unit. USS systems are based on concatenating actual speech units from the database to synthesise speech. Available literature show that syllables are a better choice of sub-word units as compared to phones and diphones for USS [2], [3]. The reasons as to why syllables are a better choice of sub-word unit for speech synthesis can be summarised as follows: Indian languages belong to the category of syllable-timed languages [4]. Syllables are the fundamental units of speech production [5]. Syllables tend to capture the co-articulation between phonemes well. Being relatively large units, the use of syllables results in a reduction in the number of concatenation points for speech synthesised using the concatenative speech synthesis framework. Syllable boundaries are regions of low energy because of which spectral discontinuities at concatenation points are not perceived. A major consortium effort on USS for Indian languages is based on syllablelike units [6]. Although the syllable-based USS do perform well, the performance in many cases is inconsistent. The synthesised speech seems to lack the flow, continuity and rhythm of natural speech and also suffer in terms of intelligibility due to the introduction of various artifacts. The feedback obtained from various listening tests conducted to evaluate systems was that the synthesised speech lacked the prosody of natural speech.

14 Prosody plays a crucial role in making speech sound natural, and also in the comprehension of the syntactics and semantics of a spoken utterance. As a field, study of the prosody of Indian languages, is still in its infancy. Therefore, the main focus of this work is to critically analyze the factors that affect the prosody of a spoken utterance. Nooteboom in [7] defines prosody as the study of tone, melody and rhythm in speech. Prosody serves semantic purposes and cannot be represented using just the orthographic transcription and the sequence of phonemes. It has to be dealt with at a level higher than that of phonemes (suprasegmental level) and has to be studied more as phenomena caused by the combined effect of sub-word units. Languages like English, Japanese, French etc. use rule-based approaches to predict prosodic characteristics with certain cues from the text. These prosodic characteristics which are called tones and break indices (ToBI), as described in [8], mainly encompass intonation contours and different levels of breaks in a spoken utterance. These methods have proven to be very successful in synthesising highquality speech. This work describes some of the efforts made at improving the synthesis quality of Indian language syllable-based TTS systems. Attempts have been made to study the prosody of syllable-timed Indian languages. The work is directed towards minimising errors in the speech database and extracting appropriate features from the input text that can be used to predict the prosody, and in turn improve the quality of speech synthesis. Statistical parametric speech synthesis using hidden Markov models (HMMs) [9] has gained a lot of popularity in the recent past. These systems differ from USS systems in that they model sub-word speech units and do not concatenate actual speech waveforms. Speech is synthesised by generating speech waveforms from the models built using context information. These systems use the context-dependent monophone as the basic sub-word unit. The automatic speech segmentation algorithm used to obtain syllable segments is also used to segment speech waveforms at the monophone level. The performance of these systems is again heavily dependent on the accuracy of segmentation. Techniques to improve the quality of synthesis of these systems have been dealt with in this thesis in addition to the work done on USS. 2

15 1.1 Overview of Thesis Synthesised speech usually suffers from many artifacts such as spikes, overlaps, sudden variations in acoustic properties of units across the synthesised utterance, buzziness etc. These artifacts are caused by the effects of errors in segmentation, inconsistencies in recording, poor prosody prediction, etc. These artifacts are usually introduced into the speech waveform due to errors at the sub-word unit level. Artifacts when present in the synthesised speech output cause a degradation in the system s performance. Various approaches to reduce artifacts in synthesised speech have already been proposed in [1], [11], [12], [13], [14] and [15]. These approaches use acoustic cues such as average f, average short-term energy (STE) and duration of units to prune outlier units from the database. Along with these features, the approach in this work uses likelihood obtained from the forced Viterbi alignment step as an additional feature for pruning. Artifacts can also be introduced in synthesised speech due to erroneous segmentation. Segmentation of the speech waveforms in the case of speech synthesis has to be very precise. The accuracy of segmentation has a direct effect on the quality of speech synthesis. Small errors in segmentation can lead to degradation in quality due to the co-articulation in speech. Algorithms that produce precise syllable segments are therefore of great importance in building a TTS system. Although the algorithm used to segment the speech waveform in this thesis is very accurate, there still are cases where there are errors in segmentation. A method has been proposed in this thesis using which such segmentation errors can be corrected using the acoustic cues of individual units. Prosody modeling for Indian languages is a particularly hard task. Prosody models for many languages are based on predicting tones and break indices (ToBI) based on a set of rules. Accents are first predicted using the punctuations and part of speech (POS) tags of the text. The rules are then used to predict ToBI using these predicted accents. ToBI have been widely used in various speech applications and have been found to enhance the performance of these applications considerably. Developing rule-based methods for Indian languages is a hard task due to the lack of information present in the text. Indian languages are rarely punctuated 3

16 except for the punctuations denoting the end of a sentence. Punctuations such as commas denoting prosodic phrase breaks are usually absent. Part of speech tags are additional cues from the text which are very useful in predicting prosodic phrase breaks, and tools to POS tag Indian language text are still not completely effective. These breaks have to therefore, be predicted using cues directly from the text. A method has been proposed in this thesis that uses word-level features of the text to predict prosodic phrase breaks. The rules to predict prosodic phrase breaks are learnt from the text using classification and regression trees (CART). While predicting breaks which have more to do with the rhythmic aspect of speech [16] has been successful, predicting the tonal aspects of speech still remains a challenging task. The absence of punctuations and POS tags again proves to be a major hurdle for this task. A part of the work in this thesis aims at correlating the structure of a phrase with prosody. A new measure called structural similarity which measures the similarity in structure between two phrases is defined. It was observed that transplanting prosody between two structurally similar phrases did not degrade the naturalness and intelligibility significantly. However, this measure was not effective when used as an additional criteria in the cost measure to select units for a syllable-based USS system. Further analysis has shown that syllabic rhythm can also be used to correlate the prosody between two phrases. A new measure called rhythmic similarity has been defined which shows that rhythmically similar phrases exhibit similarities in prosodic characteristics as compared to rhythmically dissimilar phrases. This criterion has however not yet been incorporated in the synthesis paradigm. 1.2 Organisation of the thesis The rest of the thesis is organised as follows. Chapter 2 gives a theoretical background of the popular TTS paradigms. Chapter 3 describes the technique used to minimise database errors and prune outlier units from the speech database. The subsequent chapters deal with modeling suprasegmental prosody. Chapter 4 describes the method used to predict prosodic phrases using cues from the text. Chapters 5 describe the work done on structural similarity and rhythmic similarity. Chapter 6 concludes the work, discusses issues in the proposed methods and prospects for future work. 4

17 1.3 Contribution of the thesis The following are the major contributions of this research thesis: Acoustically inconsistent units are those which have acoustic properties very different from the rest of the units of their class in the database. A method to discard these units using a process called pruning has been proposed. Improving the quality of HMM based TTS (HTS) synthesis by correcting segmentation errors Development of textual cues for prediction of prosodic phrase breaks Analysis of structural similarity and its role in prosody Analysing the role played by the syllable rhythm on the prosody of an utterance 5

18 CHAPTER 2 Theoretical Background of USS and HTS Text to speech synthesis (TTS) is the artificial production of human speech by a system which converts the input text into output speech. The challenge is to derive maximum acoustic information from the text so that high-quality speech can be synthesised. TTS systems are comprised of 2 parts - i) the text analysis part, and ii) the speech synthesis part. There are various paradigms of speech synthesis based on the method of waveform synthesis. Two paradigms have been mainly dealt with in this thesis, and those have been described in this chapter. The two paradigms of speech synthesis are Unit Selection Speech Synthesis Systems (USS) and HMM-based speech synthesis systems (HTS). The various parts of these two systems with respect to Indian languages have been described here. Analysis has been carried out and systems built for 2 languages. The data to build TTS systems has to be collected with great care. The quality of the synthesised output depends very heavily on the quality of the data collected. Therefore, as a first step, the training text was chosen very carefully and longer words were avoided. Also the sentences chosen for recording were purely declarative sentences. The text for speech recording was selected to maximize syllable coverage. It was also ensured that as many aksharas 1 as possible and all monophones in the language were covered, which were the back-off units in the absence of a syllable. The complete details of data collection are given in [17]. The speech was recorded in noise-free studio environment sampled at 48KHz with a resolution of 16 bits per sample. The waveforms were downsampled to 16KHz to build the USS systems. A native speaker of the language was chosen as the voice artist. The amount of data collected and details of the speaker are given in Table 2.1. Table 2.1: Language databases used Language Hours of Data Speaker Hindi 6.45 Male Tamil 6 Female 1 C*V* units where C stands for consonant and V for vowel

19 2.1 Description of Work done on USS for Indian languages Unit selection is the simplest form of speech synthesis technology which is based on the concatenation of simple sub-word units. The commonly chosen sub-word unit to build a unit selection speech synthesiser is a phoneme. The sub-word unit chosen in the case of Indian languages is the syllable. Syllables are units of the form C*VC*, where C stands for consonant and V stands for a vowel. Therefore, syllables are usually composed of at least one vowel and may or may not have consonants preceding and/or succeeding the vowel. The reasons for choosing the syllable as the sub-word unit are listed in Chapter 1. The training phase of building any speech synthesiser involves organising the speech database in such a way, that selecting units or generating a speech waveform is easier during the synthesis phase. The various processes involved in the training and testing phase of a USS system are described below. The USS systems were built using the Festival Speech synthesiser based on the FestVox platform Training The training phase is when the speech database is organised into structures that make retrieving the most suitable unit for a given context easy. The structures to organise the speech database in this case, are called classification and regression trees (CART). To build these structures, the text has to be broken down into its sub-word unit representation. This is followed by the waveforms being segmented at the syllable level corresponding to the sub-word unit representation. The CART structures are built using linguistic, acoustic and phonetic features. The details of the entire process are described below Letter to Sound Rules Letter to sound rules is a set of rules that help us break the given text into its corresponding sub-word unit representation i.e. convert the grapheme representation to the phoneme representation or the written to the spoken form. Indian languages being low resourced, there is not enough transcribed text available for Indian languages from which letter to sound rules can be derived automatically. 7

20 The letter to sound rules, in this case, are a set of handwritten rules. The rules have been written for two different kinds of language sub-categories, for Aryan and Dravidian languages. Using the set of rules for Aryan languages, the rules can be directly adapted to any language under the Aryan language sub-category and the same applies to the Dravidian languages. The major differentiation when it comes to Aryan languages is the aspect of schwa (ə) deletion. The rest of the rules for the two sub-categories are more or less the same. Apart from the language-specific part, the rules have been written to break the given text into syllables. This syllabic representation is then broken down into aksharas or monophones. The system is trained using all 3 kinds of units for the same set of sentences Pre-clustering After breaking the text into its sub-word unit representation, another set of rules are then applied on the syllables constituting each word to tag them based on their position in a word and their occurence in a geminate context. The reason for retaining the positional context is because it was observed that the syllables occurring at different positions within a word have different acoustic properties. The tags used for positional context are begin, middle and end. Monosyllabic words are also tagged with a begin tag. It was also found necessary to tag words based on whether they belong in a geminate context. This is because it was found that the segmentation algorithm segmented syllables corresponding to geminates erroneously in many cases. This was because syllables in geminate contexts are articulated very differently from when they are not in geminate contexts. Therefore, it was deemed necessary to add an extra context which indicated whether a syllable belonged to a geminate context or not. During synthesis, care was taken to use a syllable in its original context as far as possible Fallback Units The syllable set for a language is finite but large in number. This can be seen from the Figure 2.1 in which we can observe the distribution is long-tailed for the 8

21 .25 Hindi Tamil Probability of occurrence Syllables Figure 2.1: Plot showing the long-tailed distribution of syllables in 2 languages- Hindi and Tamil 2 languages Hindi and Tamil. In the event that a syllable is not present in the database, fallback units are used. A three-level fallback is employed. When a syllable from a particular positional context is not available, it falls back to a the same syllable with a different positional context. This is the first level of fallback. The second fallback level is aksharas. Aksharas are defined as the set of C V and C units. The aksharas are also tagged with positional context tags beg, mid and end. The third fallback level are monophones. In the absence of an akshara, monophones are used. Since the database is designed to cover all monophones of a language, the system is capable of synthesising any arbitrary text with this three-level fallback A hybrid approach to segmenting speech waveforms at the syllable, akshara and monophone level Speech segmentation refers to the process by which a waveform of continuous speech is segmented into the sub-word units it is composed of. Figure 2.2 shows a waveform segmented at the syllable level. Speech segmentation in the context of speech synthesis is very important, as the sub-word units need to be segmented very precisely in order to obtain high-quality synthesis. Segmenting speech waveforms into their corresponding sub-word units is a very hard task as we need to convert the given text into its corresponding sub-word unit representation and correlate it with the articulatory properties of the waveform. Earlier, a semiautomatic labeling tool was employed to segment the waveform at the syllablelevel [18]. This is still prone to errors, and the labeling was corrected a number of 9

22 Figure 2.2: Waveform segmented at the syllable-level times. The manual intervention required makes the syllable as a fundamental unit for synthesis a tall order. In this work, an automatic speech segmentation algorithm is used which is found to give reasonably accurate syllable boundaries. This is achieved by using the Hybrid Segmentation algorithm which employs Hidden Markov Models (HMMs) in tandem with the group delay algorithm to segment the speech waveform. The details of this algorithm are given in [19]. The overall algorithm has been outlined in the flowchart given in Figure 2.3. In the hybrid segmentation approach, HMM-based segmentation and group delay (GD) based segmentation are performed iteratively to obtain accurate segmentation automatically. Parameters are tuned in GD based segmentation to over-estimate the syllable boundaries. This results in many spurious boundaries, but the correct boundaries are not misplaced. Flat start initialized embedded training of monophone HMMs followed by forced Viterbi alignment are first given as input to this algorithm. HMMs and the group delay algorithm are used iteratively to correct the syllable and monophone boundaries. The group delay boundaries in the proximity of the boundaries given by HMMs are considered as correct boundaries for syllables. Flat start embedded training is performed on the monophones within a syllable restricted to the syllable to obtain accurate monophone labels. This process is performed iteratively to obtain accurate segmentation at the syllable level and monophone levels. The algorithm for hybrid segmentation is given in Figure

23 Figure 2.3: Flowchart of the hybrid segmentation algorithm Labels for Aksharas are obtained by concatenating the segments in the monophone label files corresponding to each Akshara. The accuracy of segmentation is found to be very crucial and it was found that this algorithm gives reasonably accurate labels. These labels are then used in the voice building process to build CARTs. 11

24 Clustering Syllables Any database used for building a text to speech synthesiser will have multiple occurrences of each syllable. These syllables, therefore, have to be clustered to help reduce the search space and the time complexity during synthesis. Linguistic, acoustic and phonetic criteria are used to cluster these syllables during the training phase. The acoustic distance in Equation 2.1 used to cluster units is a weighted Mahalanobis distance to measure the distance between two phonemes of the same class. Using the context information and the acoustic distance these units are clustered using CART [2], [12]. Figure 2.4 shows a small portion of a CART. As can be seen from the figure, the leaf nodes in the CART are clusters. The number next to each indexed unit is the target cost for that particular unit, which is the acoustic distance of that unit from the cluster centre. The acoustic distance for clustering and the target cost is computed as follows if V > U T dist(v, U) = W D U V U i=1 n j=1 W j.(abs(f ij (U) F (i V / U )j (V ))) SD j n U (2.1) Where U is the number of frames in U F ij (U) is parameter j of frame i of unit U SD j is the standard deviation of the parameter j W j is the weight for the parameter j W D is the duration penalty weight Equation gives mean weighted distance between the two units with the shorter unit linear interpolated to the longer unit The features used for acoustic cost computation are the mel-frequency cepstral coefficients (mfcc) and their velocity co-efficients, f, and absolute power. CART is a decision tree based on a set of yes/no questions. Every unit in the database has its own CART, which contains all occurrences of that unit in the database clustered with respect to the context in which they occur in the training sentence. 12

25 Figure 2.4: An example portion of a CART for the unit sa beg Synthesis Phase The synthesis phase involves breaking the input text using a set of LTS rules, and searching for a unit in the CART that best suits the context the unit is present, in the sentence to be synthesised Letter to sound rules The letter to sound rules for synthesis are very similar to the rules that are used during training. The language-specific rules and the syllabification process are the same as the process followed during the training phase. The main difference is that it searches for the context closest to what is present in the input sentence. In the absence of the syllable, it is broken down into smaller sub-word units or back-off units which are substituted for the missing syllable. Two levels of back-off are performed. The syllable is first broken down into aksharas and searched, in the absence of an akshara the monophones corresponding to the missing units are substituted Selection of units Once the text has been broken down into the smaller sub-word units, a target specification is generated using the units that make up the sentence that has to be synthesised. Using this target specification, an appropriate cluster is found. A Viterbi search is performed through each one of these candidate clusters to find the optimal set of units that could be used to synthesise the sentence. The optimal 13

26 set of units is found by minimising this cost N Cdist(S i ) + W Jcost(S i, S i 1 ) (2.2) i=1 Cdist(S i ) is the distance of syllable S from the center of the cluster known as the target cost. Jcost(S i, S i 1 ) is the cost of concatenating syllables S i and previous syllable S i1. W is used to weigh join cost over target cost. N is the number of syllables in the utterance to be synthesized. The target cost is computed as the acoustic distance given in 2.1 between each unit in the candidate cluster with the cluster centre. The concatenation cost is computed by measuring the cepstral distance, the difference in absolute power and f at the point of concatenation of the pair of syllables Synthesis using the selected units Once the units have been selected, they are concatenated using the windowed join method where the ends of the unit are windowed and are concatenated. Concatenation using optimal coupling is usually preferred when phones are used as the sub-word units. Optimal coupling is disabled in the case of syllables as the boundaries of syllables are usually acoustically stable. 2.2 Description of Work done on HTS for Indian languages HTS systems, unlike USS, do not involve concatenation of waveforms. Instead, sub-word units are modeled using sequences of coded vectors and speech parameter sequences are generated using these models. The model that best suits our purpose is the Hidden Markov Model (HMM) as it effectively captures sequential information. Once the system has been built, there is no need to retain the original speech waveforms as speech parameter sequences are generated using the models built. This results in a large reduction in the footprint size. But HMM-based speech synthesisers suffer mildly in terms of naturalness as the models are an average 14

27 Figure 2.5: Overview of an HMM-based TTS representation of each subword unit. This results in the synthesised waveform sounding buzzy. This kind of synthesis though is highly intellegible as a continuous stream of parameters are generated, while USS systems suffer in terms of intelligibility as the speech waveform is formed by concatenating smaller waveforms and the synthesised speech sounds discontinuous. Figure 2.5 illustrates the overview of an HMM-based speech synthesiser. The sub-word unit used, in this case, is the context dependent monophone. The details of the procedure for building an HMM-based speech synthesiser are described below Training Letter to sound rules The letter to sound rules are the same as the rules mentioned in Section The input text is first broken down into syllables using the language specific rules followed by syllabification rules. Once the words have been broken down into syllables, the syllables are broken down into their constituent monophones. To map the characters in the native language script to their corresponding phoneme mapping in roman script, a mapping scheme has been proposed in [1]. In this 15

28 scheme, common sounds from 13 different Indian languages are mapped to a common symbol in the roman script. This mapping is called the common label set. A small portion of the common label set for the two languages dealt in this paper is shown in Table 2.2. Table 2.2: Small portion of the Common Label Set illustrating how a few examples have been mapped to their roman character equivalent Common Label set Hindi Tamil Notation a अ அ i इ இ ii ई ஈ rq ऋ, ॠ - zh - ழ k क க kh ख Segmentation of waveforms at monophone level The waveforms are segmented at the monophone level using the algorithm described in Section Once the speech waveforms have been segmented using the hybrid segmentation algorithm, monophone 2 and fullcontext 3 labels are generated from the utterances. These label files are then used to initialise models for an HMM-based TTS system Feature Extraction An HMM-based speech synthesiser is built by generating models of sub-word units using features of the speech waveform. Since the speech production mechanism can be viewed as glottal pulses exciting the vocal tract to produce speech output, the features extracted correspond to the excitation and vocal tract characteristics. The features used, in this case, are mel-generalised cepstral coefficients (MGC) and its dynamic features ( ), logf (lf ) and its dynamic features (1+1+1), and duration of each sub-word unit to build models. MGC, in this case, represent the vocal tract characteristics and lf represent excitation. 2 contains only time-stamps and the monophone transcription 3 Contains detailed context information -segmental and suprasegmental - information about the utterance 16

29 Model Building The model building process involves initialising and re-estimating context independent (CI) and context dependent (CD) monophone HMMs based on the maximum likelihood criterion: ˆλ = arg max λ p(o W, λ) (2.3) where λ represents the model parameters, O is the training data and W is the transcriptions corresponding to the training data. The acoustic parameters which are modeled using multi-space probability distributions HMMs (MSDHMMs) as both MGC, lf and their dynamic parameters are modeled using one HMM. Acoustic parameters and duration are modeled separately using 5 states for each phoneme model Synthesis An utterance structure is formed for the text to be synthesised using the letter to sound rules (Section ) and context information is extracted for the same. Speech parameter sequences are generated using CDHMMs for this sequence of phonemes using: ô = arg max o p(o w, ˆλ) (2.4) where o represents speech parameters and w is the transcription of the test sentence. The output speech is synthesised using these sequences of parameters. 2.3 Related Previous Work There have been several efforts in the past that have worked on pruning speech databases and analysing the rhythmic and tonal characteristics of speech. In Sections 2.3.1, 2.3.3, and a brief review of these works is presented. 17

30 number of examples duration difference number of examples difference in average f of syllables number of examples difference in average energy of syllables Figure 2.6: (a) Histogram of duration difference between a pair of adjacent syllables in the database (b) Histogram of average f difference between a pair of adjacent syllables in the database (c) Histogram of average energy difference between a pair of adjacent syllables in the database Previous work on maintaining consistency in speech databases and pruning them [11], [14] and [15], propose various methods for pruning speech database to reduce the size of the footprint to enable porting the TTS onto devices with smaller memory capabilities and for various other applications. The initial impetus for the work done in this thesis was given by [21] A Probabilistic Approach to Selecting Units for Speech Synthesis Based on Acoustic Similarity In [21] a method is proposed in which the USS synthesises sentences based on a cost measure which focuses on reducing the acoustic variability between adjacent units. This reduces sudden prosodic variations in the synthesised sentence and makes the synthesised output sound more continuous. To synthesise speech, differences in f, energy and duration of consecutive pairs of units in the database were observed. Figure 2.6 shows the histogram of differences in these 3 parameters for a pair of syllables at adjacent positions. It was observed from this figure and many others that the distributions in most cases were Gaussian. These differences were converted to probability density functions. During synthesis, the differences in the 3 parameters were computed for all possible adjacent pairs of candidate units and the units were then chosen based on the values of the 3 parameters that best fit the distribution. This work showed that it was very important to maintain acoustic continuity across a synthesised utterance, and that sudden jumps in acoustic properties between adjacent units in a synthesised utterance causes the system to score very 18

31 low on listening tests Automatic pruning of unit selection speech databases for synthesis without loss of naturalness In [11], two methods of selecting the most suitable syllable unit for synthesis have been described. Selecting the average unit - The first method describes choosing the average unit from a number of realisations such that it is prosodically neutral with minimal influence of context. This was done by computing the mean of prosodic features of all the realisations of the unit in the speech database, and selecting the unit whose prosodic features were closest to the computed mean. The prosodic features used were pitch (f ), short-term energy (STE) and duration. Selecting the optimal unit - In this approach, a measure called global prosodic mismatch function (GPMF) is defined. This function is computed as follows: GP MF (X) = N { 1 P x P (A i ) + 1 D x D(A i ) + 1 E } x E(A i ) i=1 (2.5) Where, A i is the unit under consideration, N is the number od instances of the unit, P (A i ), D(A i ) and E(A i ) are the pitch, duration and energy of the unit A i and P X, D X and E X are the expected values of pitch, duration and energy of A i. This function measures the distance between candidate units in the database and the target specification that have been predicted using the corresponding acoustic models. The ideal unit, in this case, would be the unit which has the minimum value of GPMF. Perceptual tests showed that GP MF worked better as a cost function than the average unit method. A large number of sentences were synthesised from a large text corpus and the GP MF cost was computed for all units in the database. The instances of a particular unit with minimum cost were added to the database while the others were excluded. It was found that this method of pruning not only resulted in a reduction of the database size but also improved the quality of synthesis since it removed bad units from the database. Further, using these corrected 19

32 labels to build an HMM-based speech synthesiser also proved very helpful A Statistical Method for Database Reduction for Embedded Unit Selection Speech Synthesis [14] proposed a method to prune the database for a TTS to reduce the size of the footprint. This method relies on the statistics produced by the unit selection speech synthesis on a large text corpus. This method uses the frequency of occurrence of each unit and acoustic cost [2], to discard redundant units from the database. A fitness vector was defined which was formed using a combination of the frequency of occurrence of a unit during synthesis on text from a large corpus, and the mean of the acoustic cost of the unit each time it is used for synthesis. This method effectively removes redundant units from the database because the fitness vector of each unit under consideration was multiplied with the score difference with the previously selected unit. This reduced the fitness value of the similar units and increased the fitness value of the dissimilar units. Units with higher fitness values were added to the database and the others were excluded. It was observed that this method of pruning reduced the size of the database with minimal degradation in quality. These methods to pruning the speech database rely primarily on the feedback of synthesised output to decide on the units that have to be discarded. The method proposed in this thesis is different from the aforementioned approaches in that, it discards the bad units prior to the system building phase. The labels from the output of the segmentation phase are used and the syllable units are discarded based on their prosodic consistency with the rest of the units in the speech database Previous work on Prosodic phrase break prediction Prosodic phrases have traditionally been predicted using the ToBI approach which use POS tags to predict different levels of tones and breaks. Initially, punctuation marks in the text were assumed to be good predictors of phrase breaks. However, later research suggested that punctuation marks are often wrongly used in the text and can lead to several errors in phrase break prediction. Therefore, several approaches that use different cues from text were proposed to handle this problem 2

33 of POS tagging. [22] compare and contrast various methods of phrase break prediction for the English language. These methods have been briefly described in this section POS tag based approach In this approach, the sentence for which phrase breaks have to be predicted is first grouped into one of the generalised POS tag groups from a bag of phrases. After this, the sentence is analysed further and classified into a more specific group that better matches the combination of words in it. Prosodic phrase breaks are then predicted using a set of rules for that particular combination of POS tags Prediction by example approach In this approach, phrase breaks are predicted using an exemplar-based approach. This approach requires a very large and accurately annotated text corpus. A sentence which is similar to the given input sentence is searched for in the database and breaks are placed in the test sentence at positions corresponding to the sentence in the text corpus. This similarity is measured in terms of the combination of POS tags. Therefore, a sentence is first reduced to a set of POS tags. This reduced set is then searched for in the text corpus and the best matching sentence is used to predict phrase breaks for the input sentence. The problem with this approach is that it needs an almost infinite text corpus in addition to which the approach is very slow Prediction by phrase modelling HMMs have also been used to predict phrase breaks. These models have been quite successful in phrase break prediction mainly because HMMs [23] capture sequential information very effectively. The sentences in the training corpus are first POS tagged and clustered using k-means clustering [24]. Left-to-right HMMs are then built using these clustered HMMs with word skipping allowed. The number of states is decided by the mean length of the phrases in each cluster. Phrase breaks are predicted given the POS tags for a sentence. Phrase breaks are postulated at the junctures of these models. 21

34 Prediction using features from a syntactic parser This approach uses a syntactic parser to output the best set of features which can be used for clustering. The syntactic parser takes an input sentence and gives parsed output as a set of features which denote various levels of syntactic information about the sentence. These features were then used to build decision trees which were found to predict phrase breaks very effectively. There have also been various efforts that have successfully predicted phrase breaks for Indian languages. Predicting phrase breaks for Indian languages is a hard task as the text is seldom punctuated and POS tagging for Indian languages has not been perfected yet. [3] and [25] propose a method using which pauses can be predicted using cues from the text. The method proposed in this thesis uses these same cues to predict prosodic phrase breaks Related work on prosody prediction The tonal aspect of prosody has been successfully predicted for many languages such as English, Japanese, Mandarin, Portuguese etc. These languages use a system of prosody labeling called ToBI Tones and break indices (ToBI) This is a rule-based approach that was initially developed for English and has now been extended to various languages. ToBI predicts 2 kinds of prosodic aspects for a given text namely tones and breaks. Tones are predicted from the accents associated with every word in the phrase. These accents are predicted using POS tags. There are different kinds of tone indices such as word pitch accent, phrase pitch accent, etc. which are predicted based on a set of rules. Breaks are another prosodic aspect which are again dependent on the POS tags. There are different levels of break indices as well, such as the breaks that separate words, the breaks at the ends of phrases, etc. Although ToBI have successfully been used for various languages, adapting a similar set of rules for Indian languages is a hard task as Indian languages lack punctuation and methods of POS tagging. Therefore, the method described in this thesis analyses prosody based on the structure of the text rather than the 22

35 content. This criterion has also been included in the cost measure to select units in a USS system. This cost function has been added as an additional criteria to the one described in [12] Related work on speech rhythm Rhythm is defined as any regularly recurring event. [7] states that rhythm in speech is defined either with respect to every syllable in an utterance or with respect to stressed syllables. Languages in which rhythm is defined as an equal duration between the production of successive syllables are called syllable-timed languages. In these languages, there is an equal prominence associated with every syllable and they generally lack reduced vowels. Stress-timed languages are those in which there is an equal duration between stressed syllables and the unstressed syllables are shortened or lengthened accordingly. In [26] though, a different interpretation to speech rhythm is presented. It is said that no language can be strictly classified as syllable-timed or stress-timed, and that all languages exhibit both varieties of timing. Further, the same speaker can exhibit different kinds of timing on different occasions. Further analysis showed that the syllable durations in syllable-timed languages are not uniform. For Hindi, the mean and standard deviation in syllable duration are around 196ms and 76ms respectively and for Tamil, is around 199ms and 91ms respectively for the databases used in this paper. This shows that there is a large variation in the duration of syllables spoken by one speaker for one language. Since it is not possible to define rhythm on the basis of syllable and stress timing, a new criteria for analysing rhythm has been proposed in this paper. Syllabic rhythm in this paper has been defined as the rate of production of syllables, in terms of the number of syllables per word and the duration of each word. The analysis has been conducted at the intonational phrase level of the prosodic hierarchy [27]. The reason for restricting the analysis to an intonational phrase is mainly because the analysis could then be restricted to within one prosodic contour. Also, there is a considerable amount of co-articulation between syllables within a phrase, while this occurring between phrases is not possible as there is a significant pause separating two phrases. There have been many other efforts through the years aimed at analysing 23

36 the rhythmic aspect of speech. [28; 29] characterise and describe methods to generate rhythm patterns particularly for the purpose of text-to-speech synthesis. [3; 31; 32; 33] detail classification of languages based on rhythm into stress-timed and syllable-timed while [34], [35], [36] and [37] study isochrony (rhythmic division of time into equal portions by a language) from a production and perception perspective. The effect of syllable stress and rate on isochrony is explored in [34]. [35] analyses the role played by rhythmic and syntactic units in production and perception and suggests that isochrony is more of a perceptual phenomenon. [38; 39; 4; 41] analyse the effects of speech rate in perceiving rhythm and [42] studies the factors of language such as syllable structure, etc. that can be used to characterise rhythm. 2.4 Summary The procedures to build USS and HTS systems for Indian languages are first discussed. Previous efforts on removing artifacts focused on removing dirty units from the database based on their acoustic score during synthesis and frequency of occurrence. Prosody prediction in previous works on languages such as English relied more on punctuations and POS tags. There have also been efforts towards data-driven methods to phrase break prediction. The prediction of tonal aspects of speech though relies more on rule-based approaches which again depend on POS tags. 24

37 CHAPTER 3 A method to prune speech databases to improve the quality of Indian TTSes The synthesis system described in the previous chapter, has serious issues in terms of prosody. These issues include errors due to artifacts and inaccuracy of segmentation. Speech synthesisers that work on unit concatenation suffer in quality mainly due to sudden variation in acoustic properties between adjacent units. An example of such an artifact in synthesised speech is shown in Figure 3.1. In this figure, the unit थ (thee) is not articulated properly. Artifacts such as this cause the performance of the USS to degrade significantly. Also, incorporating the appropriate rules to select a unit from the database that has the right co-articulative influences for the given context is a hard task. This chapter describes methods using which the chances of selecting a wrongly segmented unit or a bad unit is minimised. Figure 3.1: Example waveform with artifact Another artifact that is commonly encountered in speech synthesised using a USS is an erroneously segmented unit. In these cases, a unit that is part of the utterance that is to be synthesised has retained a part of the adjacent unit from the original sentence it was selected from. This is due to errors in the speech segmentation process. An example of a unit which has been erroneously

38 segmented is shown in Figure 3.2. Using the method described in this chapter, erroneous segmentation can be corrected and many of the erroneously segmented units which have not been corrected can also be discarded. Figure 3.2: Example waveform and transcription with a segmentation error Another issue faced in building a TTS is the inconsistency in the database. Despite hiring a professional voice talent to record the data, there are inconsistencies present in the recording. The recording of text is done over many sessions and the conditions vary across sessions. In addition to these problems, there is this problem of maintaining a uniform syllable rate while speaking, especially when there are complex words in the text. The speaker will have a tendency to shorten the syllable duration while speaking. For example, the words प ष म अ व ल (purushottam agrawaal) is in some cases articulated as प ष म व ल (purushotta magrawaal). Here the syllable length changes and the syllabification is different from what is obtained using the text syllabification rules. When such syllables are picked up by the USS, the quality of synthesis degrades and the mean opinion score (MOS) drops significantly. It is essential that some uniformity in syllable parameters such as duration, average short-term energy (STE) and average pitch (f ) be maintained across the units in the database. This is achieved by pruning the database of outliers, by making use of average STE, average f and duration criteria [1], or based on a score that depends on a weighted sum of the three parameters [13], [11]. [12] and 26

39 [14] focus on reducing the size of the database by pruning units from the database, with little or no degradation in synthesised speech quality or naturalness. The same is achieved by a vector quantization technique in [15]. There have been a number of attempts aimed at (i) improving the quality of unit selection system (USS) and (ii) reducing the size of the speech database in [12], [13], [11], [14]. These have resulted directly or indirectly in the re-organisation of the classification and regression tree (CART). In this Chapter, we use a syllable based approach to USS with an attempt to maintain consistency in acoustic parameters across an utterance by appropriate pruning of the database. Instances of a unit which vary quite significantly from the average acoustic properties of the unit are discarded. The first step is to prepare a syllable database with units that are chosen carefully. The transcription must be completely free of errors. Next, the syllable boundaries must be very accurate. Owing to the size of the syllable being larger than the phone, an effective segmentation algorithm as described in Section is used. Once accurate syllable boundaries are available, different pruning techniques based on acoustic properties of the syllables are employed (Section 3.1). In particular, the statistical characteristics of duration, average STE and average pitch f of the syllable are used to prune units from the database. 3.1 Pruning Technique for USS systems Pruning is performed mainly to remove badly segmented units and also to avoid the effect of inconsistencies in recording. The database is pruned using the combined effects of the acoustic cues duration, average STE and average f. Syllable-timing primarily corresponds to duration. Nevertheless, in the context of preserving the semantic content of the utterance, prosodic parameters like energy and f must also be considered. Energy and f play an important role mainly because, if the sequence of units chosen for synthesis has non-uniform pitch and energy, overlaps are perceived in the synthesised speech. The pruning performed thus ensures rhythmic and acoustic consistency in the database. The steps to select the units to be pruned are as follows: The duration, average f and average STE are computed for each unit using 27

40 all instances of that particular unit. The mean (µ) and standard deviation (σ) of these parameters for each unit are then computed. The units lying outside the region specified by some fraction of σ for all 3 parameters are tagged with a special symbol so that they will not participate in synthesis. Only units with greater than 1 occurrences in the database are chosen for pruning. If there are greater than 5 occurrences of the unit after pruning, the first 5 occurrences are retained For pruning back-off units (aksharas and monophones), only the units corresponding to the syllables that have not been pruned are retained. The reason for choosing the first 5 occurrences of a unit and not adopting a k-nearest neighbour approach is mainly to prevent the synthesis from sounding too monotonous. The main aim of this paper is to remove the unwanted units from the database and at the same time make an effort to synthesise speech that sounds prosodically natural. This also reduces the search space considerably. Aksharas and monophones are the back-off units that are used for synthesis in the case of a syllable being absent from the database during synthesis. Syllable based USS systems are built using units that remain after pruning the speech database Training USS systems To test the claim, systems with two levels of pruning have been built for two Indian languages - Hindi and Tamil. Hindi is an Aryan language, and Tamil is a Dravidian language. The Unit Selection TTS systems are built using the Festival speech synthesiser based on the Festvox framework as described in Section 2.1. The systems are built based on the syllable as the basic unit. A set of hand written pronunciation rules are written to split the text into syllables as described in Section and CART are built using linguistic and acoustic context as described in Section The appropriate sequence for synthesising output speech is chosen by performing a Viterbi search through a set of target clusters. The set of units with the optimum 28

41 weighted sum of target and concatenation cost are chosen and concatenated to synthesise speech. Prediction of prosodic phrases for Indian Languages plays a very crucial role in synthesising good quality speech. This is mainly because Indian Languages are seldom punctuated. Therefore, it is important to have a robust method of predicting prosodic phrase breaks. Details of this are given in Chapter 4. Cues from the text are used to build CART to predict prosodic phrase breaks. Wordterminal syllables and case markers are used as features to build a CART, which are later used as cues to predict pauses during synthesis. 3.2 Preliminary Experiments and Results Systems with both pruned and unpruned databases for USS were built for the two languages. Pruning was based on the parameters duration, average STE and average f. Pairwise comparison listening tests [43] were performed to evaluate the effect of pruning with different values of k in k σ. The results of the subjective listening tests for.25 σ and.75 σ are presented in Tables 3.1 and 3.2. Pruning was also performed for other values of standard deviation -.5 and 1 times the σ values. By informal listening tests, these systems were excluded from being part of the subjective evaluation. The systems were evaluated subjectively using pairwise comparison tests [43]. This test consists of giving a preference between synthesised sentences of two systems, A and B, with the text for both remaining the same. The first part of the test is an A-B test, where the synthesised sentences of system A are always played first against those of system B, and vice-versa in the B-A test. The score A-B+B-A gives an overall preference for system A against system B and is calculated by the following formula: A B + B A = ( A B + (1 B A )) 2 In our evaluation, the proposed system is always system A and the default system is always B. The test was conducted on 15 listeners on a set of 1 synthesised sentences. We will follow these conventions when referring to systems built using databases pruned to 29

42 Number of Syllables Number of Syllables Number of Syllables Duration Average Energy Average f Figure 3.3: Distribution of acoustic parameters for syllable /see/_end σ - System P σ - System Q Table 3.1: Pairwise comparison tests for Hindi Table 3.2: Pairwise comparison tests for Tamil Score P Q A-B B-A A-B+B-A Score P Q A-B B-A A-B+B-A From the USS results, we observe that the overall preference is for the proposed systems, except in the case of system P against default system for Tamil, in which the preference is for the default system. The most preferred though is system S which is pruned to.25 σ of the three parameters. This result is consistent with our initial observation that reducing acoustical variability improves the synthesis quality. As seen from the Hindi USS results, pruning using a combination of the three parameters and excluding units lying outside the region of.25 σ definitely boosts the system performance. Although there is an improvement in the case of both Hindi and Tamil, it can be seen that the improvement for Tamil is not as significant as it is for Hindi. This is mainly due to the agglutinative nature of Tamil, because of which the language is replete with geminates. A lot of segmentation errors can occur when geminates are encountered. This causes a drop in system performance. Figure 3.3 shows the distribution of duration, average STE and average f for all instances of the Hindi syllable स _end (see_end). It can be seen from the figure 3

43 that, the deviation of the above three parameters is not very high. Therefore, choosing a smaller value of standard deviation would suffice. 3.3 Using pruning to improve the quality of phonebased HTS systems Since pruning using acoustic cues helped in improving the quality of USS systems, it was proposed that they could also be indirectly used to improve the quality of HTS systems. Since HTS systems are phone-based, only the monophones corresponding to the unpruned syllables were used to initialise an HTS system. Once pruning using the method described in Section 3.1 was performed and the monophones corresponding to the pruned syllables were excluded, the remaining monophones were then used to build monophone HMMs. Re-estimation and forced alignment were then iteratively performed to obtain accurate monophone labels. These monophone labels were then used to initialise models to build an HTS system. The building process is the same as the one mentioned in Section 2.2. Listening tests were performed to evaluate how important the initialisation to an HMM-based system was Results of Listening tests conducted to evaluate effect of pruning on HTS system Pairwise comparison listening tests described in Section 3.2 were conducted to evaluate the effect of initialising HMMs using the monophones of pruned syllables. The results of the pairwise comparison test are given below. The proposed system was compared with the system which is initialised without the pruned models the results of which are given in Table 3.3. The test was conducted on 15 listeners on a set of 1 synthesised sentences. 31

44 Table 3.3: Pairwise comparison tests for Hindi and Tamil to evaluate performance of HTS system after initialising using pruned models Score Hindi Tamil A-B B-A A-B+B-A From the results above it can be concluded that initialising an HTS using the monophones corresponding to pruned syllables does improve the performance of the HTS and that this system is preferred over the HTS whose models are initialised using the output of the automatic segmentation algorithm. This shows that initialising the HTS with accurate segments is crucial to improving the synthesis quality. Here again, the improvement in performance for Tamil is not as significant as it is for Hindi. This can again be attributed to the agglutinative nature of Tamil which makes segmental correction very difficult. 3.4 Pruning speech databases using likelihood as an additional cue It can be clearly seen from the results in Section 3.3 that pruning the speech database using acoustic cues does have a significant effect on the quality of speech synthesis for USS. Although this is true for all the languages, further analysis showed that a few erroneous units were still retained which resulted in the performance of the system being inconsistent. A slightly modified algorithm to the one described in Section 3.1 was proposed. Since most of the units that remain after pruning the speech database can be assumed to be units with acoustic consistency and accurate segmentation, the monophones corresponding to these were used to initialise monophone models. Syllable level forced Viterbi alignment was then performed using these models and it was observed that the syllable labels obtained using this method were very accurately segmented. This can be seen from Figure 3.4 that the erroneous syllable boundaries given by the hybrid segmentation approach are corrected when syllable segments are obtained using the pruned monophones. The dotted black line indicates the erroneous syllable boundary and the region in yellow enclosed 32

45 Figure 3.4: Comparison between syllable segments obtained using the 2 approaches within the solid black lines indicates the correct segment for that same syllable. After correcting the syllable boundaries using the aforementioned approach, pruning was performed again and this time likelihood obtained while performing forced alignment at the syllable level was used as an additional cue. The algorithm for the two pass pruning is now as follows: The duration, average f and average STE are computed for each unit using all instances of that particular unit. The mean (µ) and standard deviation (σ) of these parameters for each unit is then computed. The units lying outside the region specified by some fraction of σ for all 3 parameters are tagged with a special symbol so that they will not participate in synthesis. Only units with greater than 1 occurrences in the database are chosen for pruning. If there are greater than 5 occurrences of the unit after pruning, the first 5 occurrences are retained For pruning back-off units (aksharas and monophones), only the units corresponding to the syllables that have not been pruned are retained. Initialise HMMs using the monophones that have not been excluded Perform initialisation, re-estimation and forced alignment at the monophone level iteratively till the monophone segments obtained are accurate 33

46 Use these re-estimated monophone models to perform forced Viterbi alignment at the syllable level The duration, average f, average STE and likelihood are then extracted for each unit The mean and standard deviation (σ) for each unit is then computed for the average f, average STE, duration and likelihood The units lying outside the region.25 σ for average f and average STE and those which are <.25 σ of duration are tagged with a special symbol. The longer units are retained. For the units that have not been excluded, the ones with likelihood value <.25 σ were excluded. The units with greater values of likelihood were retained Only units with greater than 1 occurrences in the database are chosen for pruning If there are greater than 5 occurrences of the unit after pruning, the first 5 occurrences of the unit lying within the specified value of sigma are retained For pruning back-off units, only the units corresponding to the syllables that had not been pruned are retained. The other back-off units i.e. Aksharas and monophones are excluded During the second iteration, the reason for not excluding the units with higher duration was because previously conducted informal tests had suggested that using long well articulated syllable units improves synthesis quality. These units would be excluded by the likelihood criterion if they were erroneous segments. USS systems were built using the labels obtained using this approach, and the results of the listening tests conducted are given Results of Listening tests conducted to evaluate the pruning using likelihood approach Pairwise comparison listening tests were conducted to evaluate the new pruning technique. The evaluation methodology is as described in Section 3.2. Each of the systems was compared with systems built using the previous pruning technique. The results are given in Table 3.4. The test was conducted on 15 listeners on a set of 1 synthesised sentences. In this case A is the system which has been pruned using the additional acoustic cue of likelihood while B is the system that has been pruned using the old method described in Section

47 Table 3.4: Pairwise comparison test results for Hindi and Tamil to observe the performnce of systems that have been using likelihood as a criteria Score Hindi Tamil A-B B-A A-B+B-A From the results obtained it can be seen that pruning using the two pass technique does result in a significant improvement in the performance of the USS. It can be seen that the improvement for Tamil is a lot more significant than it is for Hindi. This shows that the agglutinative nature of Tamil causes the segmentation to be erroneous in many cases which is effectively corrected during forced Viterbi alignment using pruned monophone models. 3.5 Summary In this paper, we propose a new approach to prune speech databases for syllabletimed Indian languages. Due to various issues faced in developing a syllable based USS system, this technique of pruning proves to be very useful. It has been shown that using appropriate prosodic criteria for pruning results in a database that can be used to build better quality speech synthesisers. Pairwise comparison of the standard USS systems with that of the pruned versions shows that pruning using duration, STE and f is preferred. Also, allowing a very small deviation in prosodic parameters of the individual unit is also sufficient. Further, using the unpruned units to initialise monophone HMMs for an HTS results in an improvement in the quality of the HTS. Also, since these models were precise enough to cause an improvement in HTS, they were used to obtain syllable and monophone labels for a USS in which way a lot of segmentation errors were corrected. These labels when pruned and used to build a USS worked better than the old pruning technique. In the new technique, the likelihood score obtained as a result of forced alignment was used as an extra cue in addition with the average f, average STE and duration. 35

48 CHAPTER 4 Prosodic Phrase Break Prediction In natural speech, humans tend to group words together with noticeable pauses between the groups. These groups are called prosodic phrases and the pause between them is called a prosodic phrase break. A prosodic phrase linguistically known as the intonational phrase in the prosodic hierarchy (Figure 4.1), is a segment that occurs within one prosodic contour. Prosodic phrases help in understanding the semantics of a spoken utterance and have been found to be very important in the context of TTS. Prosodic phrase breaks also contribute to the rhythm of a spoken utterance [16] and have been found to enhance the quality of synthesised speech. This chapter describes a knowledge-based approach which uses cues from the text, to predict prosodic phrase breaks. Figure 4.1: An example of Prosodic hierarchy 4.1 Importance of Prosodic Phrase Break Prediction Prosodic phrase break prediction is breaking the given text into meaningful chunks of information. This improves the quality of text to speech synthesisers because it inserts pauses in the synthesised speech wherever required and makes the synthesised more meaningful. Previously conducted informal listening tests showed that inserting these pauses in the synthesised speech was very crucial. This is because, if there was no pause, the listeners perceived the units of speech to be

49 overlapping over each other wherever they expected a prosodic phrase break. This resulted in the TTS systems scoring very poorly on the mean opinion score (MOS) test. Therefore, an efficient strategy to predict prosodic phrase breaks was deemed necessary. 4.2 Challenges faced in prosodic phrase break prediction for Indian languages Prosodic phrase break prediction is especially hard in the case of Indian languages because of certain characteristics of Indian language texts which makes it very difficult to disambiguate the semantics of the utterance to be spoken. The following are some of the challenges faced Lack of Punctuation and POS taggers The most crucial part of a text to speech synthesis system is to derive as much relevant information from the text as possible which can be used to achieve highquality speech synthesis. Prosody prediction for languages such as English are handled using rule-based approaches that depend on punctuations and part of speech (POS) tags. These rules are called Tones and Break Indices (ToBI) and are widely used in English TTSes. These rules have also been developed for Japanese, French and many other languages. Indian languages though, do not have well developed POS taggers and generally lack punctuations except for the full-stop at the end of sentences, which makes the task harder Agglutinative nature of Indian languages Many of the languages in India are agglutinative, examples are Tamil and Telugu. Agglutinative languages are those in which multiple words can be combined and spoken as one single word. In these cases, the meaning does not change while the prosodic characteristics can vary significantly. An example of this in Tamil is the three words வந த - க ண ட -இர க க ற ன (vandu-kondu-irukkiraan) which can be spoken in isolation or as a complex word as in வந த க ண ட ர க க ற ன (vandukondirukkiraan). Although these words whether spoken in isolation or as a single word mean the 37

50 exact same thing, the prosodic hierarchy in these two cases changes significantly Low resourcedness of Indian languages Most of the Indian languages are low resourced languages in that, there are rarely any accurately annotated text corpora available for most Indian languages. Therefore, developing approaches that use machine learning to learn from a huge text corpus to achieve the task is not feasible. It is due to issues such as these that knowledge-based approaches to prosodic phrasing was found necessary in the context of Indian language TTSes. Prosodic phrase break prediction for Indian languages has been included in the Festvox framework. Since phrase break prediction in the FestVox framework happens at the word level, only features at the word level and higher were used to build CARTs. Phrase break prediction for the two languages dealt with in this thesis has been described in the following section. 4.3 Case Markers for Prosodic phrasing Initially, to understand the phrasing pattern in Hindi, the text transcription needs to be very precise. Pauses were marked manually in the text by listening to the sentences in the database, and by marking commas in the text wherever the speaker has paused. Corrections were also made to the text if there was a disparity between the text transcription and the recording. On doing this, for Hindi it was found that there were certain monosyllabic words in the text which had a very high probability of being followed by a pause. A list of a few of these words is given in Table 4.1. These words known as case markers were used as cues to perform phrase break prediction. A CART was built to predict pauses using the following textual features: identity of present word identity of previous word identity of next word position of word in the phrase 38

51 Figure 4.2: A portion of the CART tree used for predicting phrase breaks for Hindi Table 4.1: Probabilities of Hindi case markers and Tamil word-terminal syllables (along with their notation in common label set format [1]) being followed by phrase breaks Hindi Tamil ह (hei).93 வ (we).51 थ (thii).91 ன ல (naal).62 थ (thaa).42 வ ம (vum).43 पर (par).44 க_v (ka).34 क (ko).33 ய (yai).46 An example portion of the CART used to predict pauses for Hindi is shown in Figure 4.2. In the figure P(B) corresponds to the probability of a phrase break and P(NB) corresponds to the probability of a no-break. Using this CART it was found that pauses could be marked with an accuracy of 89%. 4.4 Word terminal syllables for prosodic phrasing Unlike Hindi, phrase breaks in Tamil could not be predicted using just simple word identity features. This is mainly because Tamil is an agglutinative language and the identity of a word is lost when it is merged with other words. Therefore, in the case of Tamil, the identity of word terminal syllables were used to build a CART to predict phrase breaks. Examples of word-terminal syllables used for Tamil are given in Table 4.1. The textual features used to build CARTs for Tamil 39

52 are the same as Hindi. The additional features used were just the features for word-terminal syllables. These are: Identity of word-terminal syllable of present word Identity of word-terminal syllable of previous word Identity of word-terminal syllable of next word Figure 4.3: A portion of the CART tree used for predicting phrase breaks for Tamil An example portion of the CART used to predict pauses for Tamil is shown in Figure 4.3. Using this CART it was found that pauses could be marked with an accuracy of 86%. Using this method to predict prosodic phrase breaks it was found that the quality of synthesis did improve considerably for both Hindi and Tamil. 4.5 Experiments and Results Pairwise comparison listening tests were conducted to evaluate the performance of pause prediction. The results of the test were evaluated as described in Section 3.2. The results of these tests are given in Table 4.2. The listening test was conducted on 15 listeners on a set of 2 synthesised sentences. For both languages, A is the system with prosodic phrase break prediction while B is the system without prosodic phrase break prediction. 4

53 Table 4.2: Results of Pairwise comparison tests for Hindi and Tamil USS to compare systems with and without prosodic phrasing Score Hindi Tamil A-B B-A A-B+B-A From Table 4.2 it can be seen that the system with phrase break prediction is given a higher preference for both Hindi and Tamil. The improvement for Tamil however, is not as significant as it is for Hindi. This can be attributed to the agglutinative nature of Tamil which makes prosody prediction for Tamil a hard task. 4.6 Summary In this chapter methods to predict prosodic phrase breaks using cues from the text are described. The lack of punctuations and efficient methods of POS tagging for Indian languages does make this task harder. Therefore, a knowledge-based approach had to be developed to achieve this task. Case markers and word-terminal syllables were therefore identified as cues to predict prosodic phrase breaks. It was found that these cues were effective in predicting prosodic phrases and that the quality of synthesis did improve on using prosodic phrase break prediction. 41

54 CHAPTER 5 Analysing the Effects of Phrase Structure and Syllable Rhythm on the Prosody of Syllable-Timed Indian Languages Stress, intonation, intensity and rhythm of speech are factors that generally characterise the prosody of a speaker [44]. Since prosodic features are suprasegmental, segmental correction of prosody alone is, therefore, inadequate. Speech can be grouped into prosodic units called phrases. Chapter 4 describes breaking a given text into phrases using just cues from the text. Phrase break prediction deals more with the rhythmic analysis of speech. The work in this chapter tries to analyse the factors of text that can be used to try and predict the tonal elements of a spoken utterance. Two criteria are proposed in this chapter to analyse the tonal aspects of speech. The first criterion proposed in this chapter focuses on correlating acoustic elements of a phrase with similarities in text patterns, for a speech database of declarative sentences. This criterion is then used to define a modified acoustic cost measure which is used along with the traditional acoustic cost to select units for a syllable based unit selection text to speech synthesiser. The second criterion correlates two phrases based on the similarities in their syllable rhythm. This analysis showed that phrases with similar rhythmic patterns also have similar prosodic characteristics. 5.1 Structural Similarity In [45] and [46], it is shown that features such as syllable structure, the position of the syllable in a word, the number of syllables in the word etc. play a crucial role in prosody. A thorough analysis was performed to decide on the features to be extracted from the text. The aim was to find a pair of phrases that could be matched in terms of patterns in their text and then see if they could be correlated

55 in any way. To analyse prosody, a new measure called structural similarity is proposed which matches a pair of phrases in terms of the following parameters: 1. Position of the phrase in a sentence On analysing the prosodic characteristics of phrases uttered at different parts of the sentence it was observed that the characteristics of phrases are different depending on their position in a sentence 2. Number of words in the phrase The number of words in a phrase, are highly correlated with the duration of a phrase. The duration of a phrase is indicative of the rhythm of the utterance. 3. Number of syllables per each word in the phrase This can be used as a measure for rhythm. Syllables towards the end of words tend to get shorter as the word gets longer. 4. Syllable structure of each syllable in the phrase (V, CV, CVC, etc.) Most Indian languages contain syllables with a very simple structure. This can be seen in Figure 5.1 that most of the syllables in Hindi and Tamil are of the form CV, VC, CVC and V. Also, in [47] and [42] it has been shown that the structure of a syllable affects the rhythm of a spoken utterance. 5. Position of the syllable in a word The articulatory properties of syllables change when uttered at different positions in the word. (a) (b) Figure 5.1: (a) Pie chart showing the number of syllables belonging to each type structure for Hindi (b) Pie chart showing the number of syllables belonging to each type structure for Tamil 43

56 Amplitude Pitch(Hz) Energy(J) 1 (A) Samples x Frames Frames Amplitude Pitch(Hz) Energy(J) 1 (B) Samples x Frames Frames Amplitude Pitch(Hz) Energy(J) 1 (C) Samples x Frames Frames Figure 5.2: (A) Waveform, pitch and energy contours of a Hindi phrase (B) Waveform, pitch and energy contours of a Hindi phrase which is structurally similar to (A) (C) Waveform, pitch and energy contours of the phrase obtained by transplanting the prosodic contour of (B) on (A) Transplantation Initial experiments involved transplanting the prosodic characteristics of one structurally similar phrase on to another and observing if there was any degradation in naturalness. Transplanting prosody is resynthesising one of the phrases using the prosodic features of the other. Phase vocoder pitch synchronous overlap add was used to perform time-scale modification and pitch synchronous overlap-add for pitch scale modification. Informal listening tests showed that degradation in the naturalness of the resulting utterance was considerably less. An example of such a transplanted phrase is shown in Figure 5.2. In this figure, (C) is the resultant utterance and (B) is the utterance whose prosodic characteristics were transplanted onto (A). It can be seen from the figure that there are no striking similarities in the prosodic contours of the reference phrase and the structurally similar phrase, in spite of which transplanting prosodies between them resulted in no significant loss in naturalness. Similarity to the original utterance listening tests were conducted to verify this claim. In this test, listeners are asked to rate the phrase with transplanted prosody on a scale of 1 5 based on how similar they felt the transplanted phrase was to the original. The results are given in Table 5.1. The test was conducted on 15 listeners on a set of 15 phrases. 44

57 Hindi Tamil Figure 5.3: Plot showing the range of scores for Similarity to the original utterance tests for Hindi and Tamil Table 5.1: Similarity to the original utterance scores for Hindi and Tamil Language Hindi Tamil Score It is evident from the results that transplanting prosody between structurally similar phrases results in a minimal loss in naturalness. From Figure 5.3 it can be seen that for Hindi, the degradation from the original is considerably less while for Tamil there is some degradation. The degradation for Tamil is more probably because of the agglutinative nature of Tamil. Since this criterion shows minimum degradation when transplanted, it was proposed that this criterion be used to define a new cost measure to select units for a USS system Application of structural similarity to USS Transplanting prosody between two structurally similar phrases results in minimal degradation in naturalness. It was therefore proposed that during synthesis, acoustically similar units should be chosen from structurally similar phrases in the database Training the USS To train the USS, the same steps as mentioned in Section 2.1 are followed. This system though has been built purely for syllables and will fail to synthesise any text in which there is a syllable that is not present in the database. The initial steps to 45

58 build the system are using a set of hand-written LTS rules to break the sentences in the text corpus into their respective syllables. These are then used to segment the speech waveforms using the hybrid segmentation algorithm as described in Section After obtaining syllable-level segmentation for all the waveforms in the database, the syllables are clustered using linguistic, acoustic and phonetic criteria and CART are built for each unit. After the usual steps of building a USS, the additional steps that need to be carried out are pre-clustering the phrases based on the number of words in them and their position in a sentence. The syllable structure of each syllable belonging to each phrase is also stored. The prosodic phrase prediction module has also been included because during synthesis, the sentence to be synthesised has to first be broken down into phrases after which a structurally similar phrase from the database has to be looked up Synthesis During synthesis, the text to be synthesised is first broken down into phrases using the phrase prediction module as described in Chapter 4. Once the phrases are obtained, an exemplar-based approach is used to find the structurally similar phrase from the database. Depending on the position of the phrase in the sentence and the number of words in the phrase, an appropriate phrase from the training corpus is looked up and the units to be synthesised are selected based on their acoustic similarity to their corresponding units in the structurally similar phrase. The process of selecting a structurally similar phrase from the database is shown in Figure 5.4. The two phrases are matched in terms of the features mentioned in Section 5.1. In the figure, we can see that the two phrases are matched in terms of the number of words in the phrase, number of syllables per word, and the structure of every syllable in the phrase. Even if an exact match is not found, the phrase in the database that matches closest with the phrase to be synthesised is selected. After selecting the structurally similar phrase and the cluster for every unit, a Viterbi search is performed through the candidate clusters to select the optimal sequence of units. The cost measure used to perform the Viterbi search is a slightly modified cost as compared to the cost described in [2] which, in this case, 46

59 Figure 5.4: Selecting a structurally similar phrase from the database is referred to as the traditional cost measure. Unit selection speech synthesisers traditionally used two cost measures to decide the optimal set of units for synthesis which are target cost and concatenation cost. In equation 5.1 Cdist(S i ) is the distance of syllable S from the centre of the cluster known as target cost. Jcost(S i, S i 1 ) is the cost of concatenating syllables S i and previous syllable S i 1. W is used to weigh join cost over target cost. N is the number of syllables in the utterance to be synthesized. N Cdist(S i ) + W Jcost(S i, S i 1 ) (5.1) i=1 The modified cost measure is as follows: N Cdist(S i ) + W Jcost(S i, S i 1 ) + DT W (S i, Si r ) (5.2) i=1 Where DT W (S i, Si r ) is the dynamic time warped distance between each candidate unit S i at position i and its corresponding unit in the structurally similar phrase Si r selected from the database. Using the modified cost, the optimal sequence of units is selected and the units are concatenated to synthesise output speech. For the process of concatenation, the syllable units are windowed at the unit 47