A Direct Approach for Speech to Speech Translation Project Report

Transcription

1 A Direct Approach for Speech to Speech Translation Project Report 1 Kishore Prahallad Language Technologies Institute, Carnegie Mellon University skishore@cs.cmu.edu I. Introduction to Speech Translation The objective of a speech to speech translation (SST) system is to convert speech from one language to another. A typical SST system consists of three components: 1. Speech recognition system (ASR) which converts speech in source language to its corresponding text 2. Machine translation system (MT) which translates text in source language to text in target language and 3. Speech synthesis system (SS) which converts text in target language to its spoken form. The conventional architecture of such SST system is shown in Fig. 1. It is a cascade architecture, where ASR, MT and SS systems are loosely coupled to form a SST system. The output obtained from an ASR system is given as input to an MT system, and the output obtained from the MT system is given an input to a SS system. Given a perfect ASR and MT systems, this architecture may be sufficient to achieve the goal of SST system. However, due to the limitations of current state-of-art technology in ASR and MT areas there are errors or ambiguities in the output of these components which are propagated to its successive components. The conventional cascade architecture does not provide efficient coordination or feedback among the components to achieve an optimal performance. To improve the performance of SST systems many researchers are working on integrated models using finite state machines [1][2]. In this report, we adapt a different perspective of performing a speech-speech translation. We view the source-language speech vectors as observable sequence produced by the target-language word models. Our speech to speech translation system has only two components: 1. Source-speech to target-text recognition (including the word ordering and syntax) and 2. Target-text to speech conversion. To model source-speech to target-text translation we estimate P(TS/SS), where SS is source speech vectors and TS is target sentence. P(TS/SS) = P(SS/TS) P(TS) where P(SS/TS) is the likelihood and P(TS) is the language model of the target-language. P(SS/TS) essentially models the cross-language correspondence between target-text and source-speech without any intermediate steps. The cross-language word models can be trained using fully connected trellis and continuous distribution models. In this report we demonstrate how such a direct model could be built for SST system and report the performance of direct model for a limited domain Telugu-English SST system. This report is organized as follows: Section II formulates the direct model approach

2 2 for the task of speech translation. Section III gives an intuitive reasoning about why a direct model should work. Section IV describes the Telugu-English limited domain system built using this direct model. Section V discusses the performance of the direct model for speech translation. Fig. 1. Cascade architecture of a typical speech to speech translation system. II. Formulation of a Direct Model Given the acoustics A s, the goal of SST system is to obtain acoustics A t corresponding to a target language such that P (A t /A s ) is maximized. To achieve this goal, conventional SST systems use three components: ASR, MT and TTS systems. ASR system models P (W s /A s ) where P (W s ) is the source language text corresponding to the acoustics A s. P (W s /A s ) = P (A s /W s )P (W s ), where P (A s /W s ) is the acoustic model and P (W s ) is the language model of the source language. MT system models P (W t /W s ) where P (W t ) is the translation of the source text W s. P (W t /W s ) = P (W s /W t )P (W t ), where P (W s /W t ) is the translation model and P (W t ) is the language model of the target language. TTS system models P (A t /W t ) where P (A t ) is the acoustic sequence corresponding to target text W t. Current state-of-art TTS systems use unit selection approach or HMM based speech synthesis techniques to obtain natural speech. Using these three components, P (A t /A s ) is obtained as factored probabilities. P (A t /A s ) = P (A t /W t )P (W t /W s )P (W s /A s ). Our proposal is to integrate MT system P (W t /W s ) and ASR system P (W s /A s ) to obtain a direct model which is given by P (W t /A s ) = P (A s /W t )P (W t ), where P (A s /W t ) is the cross-language acoustic model and P (W t ) is the language model of the target language. Using this direct model, P (A t /A s ) = P (A t /W t )P (W t /A s ). This ap-

3 3 proach now uses one acoustic model and one translation model as apposed to one acoustic, one translation model and two language models used in cascaded SST system. III. How Direct Model Works To understand how a direct model could be built/work for speech translation, let us review the HMM work on Statistical Machine Translation (SMT) system. Let us look at a case of two example languages S1 and T1, where S1 has two words x1 and x2 and T1 has three words y1, y2 and y3. To build a SMT system for translating S1 to T1, we need a set of parallel sentences which could be as follows: t1: y3 y1 y2 s1: x1 x2 t2: y1 y2 s2: x1 t3: y3 s3: x2 Each word (y1, y2 and y3) in the target language is represented with one HMM state and is referred to as word model. These models are trained using the parallel data as follows: Given a translation pair say t1 and s1, a sentence model is built for t1 with y1, y2 and y3 word models which are fully-connected i.e., each word in connected to all other words with equal transition probabilities as shown in Fig.2(a). This sentence model is aligned with the source text s1 using Baum-Welsh training algorithm. It should be noted that we use fully-connected sentence model as the positional correspondence between target words and source words is not known apriori. Using such fully connected sentence model the SMT system slowly learns the association between the target words and source words. For example after observing the sentences t2 and t3, the probability counts of y1 and y2 would be more biased for x1, while y3 would be biased for x2. (a) (b) Fig. 2. (a) A fully-connected trellis as used in SMT systems (b) A sequential connected trellis used in typical ASR systems

4 4 A. Vector Representation for Source Language Words Let us assume that each of the source language word is represented by a set of vectors which have a direct correspondence with the word either in the text domain or in some transformed domain. Let x1 = g 1, g 2..g n and x2 = h 1, h 2..h n. Let us use these vectors and replace the parallel data as follows: t1: y3 y1 y2 s1: g 1..g n h 1..h n t2: y1 y2 s2: g 1..g n t3: y3 s3: h 1..h n In this parallel data each target word is associated with a sequence of source-language vectors. When the vector g 1 is observed in the source language side then the corresponding target word (say y1 or y2) is likely to observe the next n-1 vectors. In other words, a sentence model would be built with word models which are fully-connected but the selftransition probabilities of these models would be higher than the other transitions. This sentence model would be similar to the one used in standard SMT (Fig.2(a)), but with the following differences: Self-transition probabilities is higher than the other transitions Since g, h are a sequence of vectors, the emission probabilities would be obtained from a continuous distribution model such as Gaussian Mixture Model. So far, we have not defined the vectors g, h. These vectors could be any features representing the words x1 and x2. For a speech translation system, these vectors are the features representing the spoken form of the words x1 and x2. The process of obtaining these vectors from a speech signal is explained in Section IV-A. However, given the nature of speech signal the following are the major issues in building a direct model for speechtranslation system: The spoken form of a given word is of varying length. Each time a word is spoken, it does not yield the same sequence of vectors but they are drawn from a unknown distribution. To incorporate the varying observation length, each word model will have different selftransition probabilities which could be learned as a process of training. Each word may also be modeled with more than one state, but the method of determining the number of states has to be explored. To show how this training is different from standard speech recognition model (monolanguage) training, we are showing the trellis connection of the sentence model t1 but for training its own acoustics in Fig.2(b). In mono-language training the sentence model is built as a sequence of word models, thus only sequential connections are allowed. At the same time the number of states in each word model is different and is know apriori. IV. Building a Direct Model for Telugu-English SST To explore the effectiveness of this approach, we have built a direct model for a Telugu- English limited domain application. Since there is no benchmark corpus available to evaluate the approach of direct model, we have developed a limited domain Telugu-English speech corpus recorded by a single speaker. There are 78 parallel sentences in this corpus corresponding to Telugu-English travel guide application. The 78 sentences corresponding to Telugu text are read out by a male speaker. Each utterance is recorded three times so

5 5 that there is more than one example to train a statistical model. The recording is done in a typical lab environment using the multimedia facilities available with the Linux Desktop. The primary purpose of this corpus is to study whether a direct model could be trained and to explore the effectiveness of the direct model. A. Features Representing the Speech Signal Give a speech signal, features are extracted from short-time (10-30 ms) processing of the signal and are referred to as segmental features. Some of the segmental features are linear prediction cepstral coefficients, Melcepstral coefficients, log spectral energy values, etc. [3]. These features represent the short-term spectra of the speech signal. The spectrum of a speech segment is determined primarily by the shape of the vocal tract. In this work, the spectral features are represented by linear prediction cepstral coefficients [4]. A.1 Pre-processing of Speech Signal The speech signal x(n) is pre-emphasized to counteract the spectral roll-off due to the glottal closure in voiced speech [5]. x(n) = x(n) αx(n 1), where α = 1. Differencing the speech signal in time domain, multiplies the signal spectrum with linear filter to give emphasis to the high frequency components [3]. A.2 Extraction of Mel-Frequency Cepstral Coefficients The characteristics of the speech signal are assumed to be stationary over a short duration of time (between ms) [3]. The differenced speech signal is segmented into frames of 10 ms using a Hamming window with a shift of 5 ms. The differenced speech signal is passed through a set of Mel-Frequency filters to obtain 13 cepstral coefficients. Thus each frame of speech data is represented by a vector of 13 coefficients. A.3 Feature Scaling Modeling and recognition is easier if all the acoustic features have roughly the same numerical range. One of the standard methods of scaling the feature vectors is to normalize the features to have zero mean and to have a specified variance. If X k is the feature vector, then the normalized feature vector is given by X k = k (X k X)/σ, where X is the mean vector and σ is the standard deviation and k is a constant scaling factor. B. Training Word Models To build a direct model, we need to train target word models with the source language acoustics. The following steps clearly explains the process of training the word models. 1. Given: A target sentence (sequence of words) and the corresponding source language acoustics 2. A target sentence model is constructed by concatenating the set of target word models - Each target word model has one or more number of states with left to right transitions and there are no skip states.

6 6 - All the word models are connected (i.e there is allowed transition from one word to any other word). Thus we call the trellis as fully connected. 3. Given this target sentence model, it is aligned with the source-acoustics using forwardbackward algorithm. 4. Steps 2 and 3 are repeated for all the training pairs, and the probability counts are accumulated. 5. Steps 2, 3 and 4 constitute one iteration, and the models are re-estimated using the accumulated probability counts. 6. The re-estimated models are used again to repeat Step 2,3, 4 and 5 until the stopping criteria is met. 7. For all the experiments reported in this study, we have used 10 iterations as stopping criteria of the training process. V. Results and Discussion Once the target word models are trained, a simple evaluation criteria which is taken up in this project is to study the alignments of the direct model. Since this approach is not studied in the literature, we want to see the alignments of the target word models with respect to source language acoustics. A sample alignment for a sentence I want to know. is shown in Table 1. Table 1: An Example Alignment Machine-Labeled Hand-labeled 0 # 0 # SIL SIL i i know know want want to SIL SIL In Table 1, the first two columns are machine generated labels of the English sentence for the Telugu acoustics. The columns three and four show the hand-labeled time stamps for the same acoustic signal. The machine labeled data demonstrate the advantages of direct model. As can be observed, the model is able to learn the acoustics across language. Moreover, the time stamp generated for to indeed correspond to morpheme change of the word want in Telugu, where the marker is attached to the end of the word, and hence to got aligned to the end of the speech segment corresponding to want. To automatically evaluate the alignments, we have hand-labeled some of the training sentences. It should be noted that this labeling is cross-language labeling, i.e, for a given Telugu speech segment the corresponding English word has to be written. Given the effort it takes, we could hand-label 30 sentences in this fashion. Thus training of the models is done on 254 (78 * 3) utterances and the evaluation is done on 30 utterances. The steps followed for the evaluation is as explained below.

7 7 Given a machine generated time stamps for the words, the nearest time stamps in the reference label is searched. If the labeled word and the reference word matches then the score is incremented by one. Performance is reported in terms of score/total words * 100 Table 2. shows the performance of direct model for different states and different Gaussian components used to represent a target word model. Table 2: Performance of Direct Model reported in terms of accuracy States/Word Gaussians/State Accuracy % % % % % n % As can be observed from Table 2, the alignment accuracy increases with the increase in the number of states per word model. The relative high accuracy is obtained using 5 states per word model. The last row in Table 2, refers to our experimental results where we used non-uniform number of states per word model. The number of states for each word model is manually assigned in this case. However it seems to be sub-optimal than the use of 5 states for each word model. VI. Conclusions This report proposed a direct model for speech to speech translation system which integrates ASR and MT components. A set of experiments was carried out on a limited domain speech corpus to investigate whether such models could be trained. Empirical results observed for every frame have shown that such models could be trained. Quantitative analysis have shown that a near 30% alignment accuracy (with reference to hand-label data) could be obtained. The evaluation metric used here was naive and the decision was binary (it does not take into account deviations). While this study has shown the direct model could be trained, a number of issues still remain to be resolved. How to confine on the number of states per word model. In this report, we have empirically used a uniform/non-uniform number of states. However, a better method could be to train a predictor which can predict the number of states for the target word model. How to compensate for word models (specifically articles, prepositions) whose acoustics need not be present in the source language. In this study, we have not taken any specific measure to account for this phenomenon. Adding null observations (similar to the techniques of using null states in statistical machine translation) could be explored. How to build a decoder for this direct model. The direct model captures the acoustics of the target word but does not capture the word order of target language. A decoder for this

8 8 approach would need to resolve this issue and word-spotting techniques, stack decoder, sentence level decoding could be some of the possible directions to explore. VII. Software Resources Recording and labeling (required for testing) of speech data is done under Festvox environment. To extract features from the speech signal, Mel-frequency cepstral coefficients (MFCC) are extracted from the speech signal using the software available from skishore > software To train the word models a HMM trainer and decoder is written in Perl. This package is written specifically for this project. The HMM trainer use Baum-Welsh algorithm to train the models and the decoder uses Viterbi algorithm to find the alignment. This software supports arbitrary number of states for each word model. VIII. Acknowledgment I would like to thank Stephan Vogel, Dr. Alan Black, Dr. Robert Frederking, Dr. Ravi Mosur and Dr. James K Baker for useful discussions and suggestions which have led to refinement of this idea. References [1] H. Ney, Speech translation: Coupling of recognition and translation, in Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, [2] Francisco Casacuberta, Enrique Vidal, and Juan Miguel Vilar, Architectures for speech-to-speech translation using finite-state models, in Proceedings of Workshop on Speech-to-Speech Translation: Algorithms and Systems, Philadelphia, [3] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Prentice-Hall, [4] S. Furui, Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoust., Speech, Signal Processing, vol. 29, pp , Apr [5] D. O Shaughnessy, Speech Communication-Human and machine. Addison-Wesley, 1987.