Improving Artificial Neural Network Estimates of Posterior Probabilities of Speech Sounds. Doctoral Thesis Proposal

Transcription

1 Improving Artificial Neural Network Estimates of Posterior Probabilities of Speech Sounds Doctoral Thesis Proposal Samuel Thomas Department of Electrical and Computer Engineering Johns Hopkins University Hynek Hermansky (Advisor) Aren Jansen Mounya Elhilali October 14, 21 1

2 Abstract Speech contains information of at least three sources - the message that is being communicated, the speaker who is communicating and the environment. In this work we propose several approaches to improve the recognition of speech sounds that convey information about the message. We use phonemes which occur at the rate of few tens of milliseconds in the speech signal as basic units. Improvements in recognizing these units result in considerable performance gains in applications like automatic speech recognition (ASR) where the goal is to transcribe the message into text and automatic speaker verification that uses information in the speaker component to verify the the speaker s claimed identity. We propose several approaches to improve phoneme posterior estimates from artificial neural networks. These include combination of information for multiple acoustic streams and different neural network training architectures. For speech recognition, especially in lowresource scenarios where the amount training data is limited (for example 1 hour of training), features extracted from better phoneme posteriors using the proposed techniques provide significant word recognition improvements. For speaker recognition these posteriors are used in a recently proposed neural network architecture to give considerable improvements over the earlier neural network based approaches. In future we would like to investigate the enhancement of phoneme posteriors for these applications in noisy environments. In a multistream speech recognition framework, we propose to use statistics derived from phoneme posteriors to determine the reliability of individual streams and how the streams can be effectively combined to derive better posteriors. We would also like to explore how these posteriors can be used for reliable voice activity detection in these environments. 2

3 Contents 1 Introduction Overview of Speech and Speaker Recognition Systems Main Contributions Deriving Phoneme Posteriors From the Acoustic Signal to Features Spectral Envelope Features Modulation Features Estimating Phoneme Posteriors Phoneme Posteriors for Speech Recognition Posterior Features for Low-resource Languages Mapping Languages with a Common Phone set Training MLPs with Language Specific Output Layers Enhancing Acoustic Features with Out-of-language Posteriors Phoneme Posteriors for Speaker Verification Traditional Approaches to Speaker Verification Improvements to Neural Networks for Speaker Verification Conclusions and Future Directions Conclusions Future Directions Robust Estimation of Posteriors using Multistream Processing Posteriors for Voice Activity Detection

4 1 Introduction 1.1 Overview of Speech and Speaker Recognition Systems Acoustic models play an important role in both automatic speech recognition (ASR) and speaker verification tasks. Traditionally, generative models for example Gaussian mixture models, have been used to model the underlying distribution of basic acoustic units in speech. In ASR, Gaussian mixture models (GMMs) are used with hidden Markov models (HMMs) along with separate modules that model the language and pronunciation, to decode the message [1]. For speaker verification, GMMs are first used to train a universal background model (UBM) that captures the general acoustic space of all speakers [2]. The UBM which is trained on large amounts of data is is then adapted for each enrolled speaker. During test, acoustic evidence in the form of scores from both the UBM and the claimed speaker are compared to verify if the claim is true. In advanced speaker verification systems, factor analysis techniques are used with super vectors formed from GMMs to model speakers [3]. More recently, artificial neural networks are being used as acoustic models for these applications. For speech recognition, discriminately trained multi-layer perceptrons (MLPs) are used to generate acoustic evidence in the form of posterior probabilities of basic speech units like phonemes. These posteriors are used directly in hybrid HMM-ANN systems [4] or converted to features in the Tandem approach for ASR [5]. Apart from being discriminatively trained, MLPs can derive posteriors from high dimensional features without placing any assumptions about the parametric distributions or statistical independence of the features. Another class of neural networks - auto-associative neural networks (AANNs) have been recently proposed as alternative acoustic models instead GMMs for speaker verification [6]. An AANN is a feed-forward neural network trained to reconstruct its input at its output through a hidden compression layer. Similar to MLPs, AANNs also have several advantages compared to the GMMs when used to model the acoustic space - they relax the assumption of the distributions of feature vectors and can capture higher order moments. In [7] this neural network approach has been extended to use phoneme posteriors to provide additional evidence of phonetic classes to better model the acoustic space. 1.2 Main Contributions In this work we focus on improving speech recognition and verification systems using phoneme posteriors derived from MLPs. This is done by first improving phoneme posteriors at various levels and then integrating the improved posteriors. The process is carried out at multiple levels. A. Combining evidences from multiple acoustic feature streams Phoneme posteriors are derived from short-term spectral envelope and long-term modulation frequency features. These features are derived from sub-band temporal envelopes of speech estimated using Frequency Domain Linear Prediction (FDLP) [8]. While spectral envelope features are obtained by the short-term integration of the sub-band envelopes, the modulation frequency components are derived in long windows from the sub-band envelopes [9]. These features are combined at the phoneme posterior level (Section 2). 4

5 Speech Feature Extraction Multiple feature representations Short term spectral features, Long term modulation features A Multiple configurations of MLP in hierarchy or parallel Posterior Estimation using MLPs Posterior Enhancement Improving phoneme posteriors by combining posteriors from different feature streams, different domains and languages B Applications of phoneme posteriors Improved Phoneme Posteriors Parameters for HMM ANN based Phoneme Recognition Features for HMM GMM based LVCSR system Parameters for AANN based Speaker Verification Reliability scores for Multistream Speech Recognition C Figure 1: Applications of phoneme posteriors for speech and speaker recognition B. Different MLP training architectures and schemes We investigate approaches for large vocabulary continuous speech recognition (LVCSR) system for new languages or new domains using limited amounts of transcribed training data. In these low resource conditions, the performance of conventional LVCSR systems degrade significantly. We propose to train low resource LVCSR system with additional sources of information like annotated data from other languages (German and Spanish) and various acoustic feature streams (short-term and modulation features). We train multilayer perceptrons (MLPs) in different configurations on these sources of information for low resource LVCSR (Section 3). C. Integration of posteriors with speech recognition and verification systems For speech recognition systems the improved phoneme posteriors are converted back to features using the Tandem approach. These posteriors are used directly with a mixture of AANNs for speaker verification. The mixture consists of several AANNs connected using posterior probabilities of various broad phoneme classes. Since neural networks are not density models, these posteriors are obtained from a separate MLP classifier trained to estimate posterior probabilities of phoneme classes (Section 4). For multistream speech recognition, phoneme posterior probabilities are estimated from separate MLPs trained on different spectotemporal modulations of speech. Statistics derived from phoneme posteriors are then used to determine the reliability of individual streams (Section 5). Fig. 1 is a schematic of the proposed approach for using MLP based phoneme posteriors for speech and speaker recognition. 5

6 2 Deriving Phoneme Posteriors 2.1 From the Acoustic Signal to Features To extract acoustic features features we first analyze speech signals in frequency sub-bands over long temporal segments of the signal. This is done by estimating temporal envelopes in frequency sub-bands using the dual of the conventional time domain linear prediction (TDLP). In the same way as the TDLP fits an all pole model to the power spectrum of the signal, frequency domain linear prediction (FDLP) technique fits an all pole model to the squared Hilbert envelope. These representations of the speech signal are able to capture fine temporal events associated with transient events like stop bursts while at the same time summarize the signals gross temporal evolution in timescales of several hundred milliseconds. For phoneme recognition, the FDLP technique is implemented in several parts - first, the discrete cosine transform (DCT) is applied on long segments of speech to obtain a real valued spectral representation of the signal. Then, linear prediction is performed on the DCT coefficients to obtain a parametric model of the temporal envelope. After sub-band temporal envelopes are estimated using FDLP, these envelopes are converted into spectral envelope and modulation frequency features Spectral Envelope Features The Hilbert envelope, which is the squared magnitude of the analytic signal, represents the instantaneous energy of a signal in the time domain. Since integration of signal energy is identical in time and frequency domain, the sub-band Hilbert envelopes can equivalently be used for obtaining the sub-band energy based short-term spectral envelope features. This is achieved by integrating the sub band temporal envelopes in short term frames (of the order of 25 ms with a shift of 10 ms). These short term sub-band energies are then converted into 13 cepstral features along with their first and second derivatives Modulation Features The long-term sub-band envelopes from the FDLP form a compact representation of the temporal dynamics over long regions of the speech signal. The sub-band temporal envelopes are compressed using a static compression scheme - the logarithmic function and a dynamic compression scheme. The compressed temporal envelopes are divided into 200 ms segments with a shift of 10 ms. Discrete Cosine Transform (DCT) of both the static and the dynamic segments of temporal envelope yields the static and the dynamic modulation spectrum respectively. We use 14 modulation frequency components from each cosine transform, yielding modulation spectrum in the 0-70 Hz region with a resolution of 5 Hz [10]. 2.2 Estimating Phoneme Posteriors Once these FDLP based short-term spectral features and long-term modulation features have been extracted, they are used to train a phoneme posterior probability estimator. In our case we use a multilayer perceptron (MLP) to estimate phoneme posteriors. A three layered MLP is used 6

7 Static compression Modulation features (FDLPM) Speech FDLP Statically compressed sub bands envelopes Adaptive compression Posterior probability estimator Posterior probability merger Improved Phoneme Posteriors Adaptively compressed sub bands envelopes frequency Sub bands envelopes time Spectral features (FDLPS) Posterior probability estimator (FDLPS + FDLPM) Figure 2: Schematic of the estimating posterior vectors. The final posterior probabilities are derived by combining posteriors from two different representations. FDLPS FDLPM FDLPS + FDLPM Phoneme Accuracy Table 1: Phoneme recognition accuracies (%) on TIMIT. to estimate the phoneme posterior probabilities. The network is trained using the standard back propagation algorithm with cross entropy error criteria. The learning rate and stopping criterion are controlled by the error in the frame-based phoneme classification on the cross validation data. For our phoneme recognition experiments we use MLPs along with the FDLP based features. Each frame is appended with neighboring 8 frames. The static and adaptive modulation features for each sub-band are stacked together to obtain modulation features for each sub-band and used as features. Since the output of each MLP is a posterior vector for each frame, the posteriors can be combined using different probability combination rules. We combine the posteriors using the Dempster Shafer (DS) theory of evidence [12] to form a joint posterior feature set. This combination technique weights each stream using an entropy based reliability measure. Fig. 2 shows the schematic of the proposed feature extraction technique for estimating phoneme posteriors. Phoneme recognition experiments are conducted on the TIMIT database. The phoneme recognition system in our experiments is based on a hybrid HMM/MLP approach, where the posterior probability estimates of various phonemes are converted to the scaled likelihoods to model the HMM states. In these experiments, posterior probabilities are estimated in a hierarchical manner [11]. Table 1 shows the phoneme recognition accuracies that we obtain using improved posteriors. 7

8 3 Phoneme Posteriors for Speech Recognition For speech recognition, these improved posteriors are converted back to features using the Tandem approach. The phoneme posteriors are first gaussianized by using the log function and then decorrelated using the Karhunen-Loeve Transform (KLT) [5]. This reduces the dimensionality of the feature vectors by retaining only the feature components which contribute most to the variance of the data. We use 25 dimensional features in our Tandem representations similar to [13]. The proposed features are compared with three other feature extraction techniques - PLP [14] features with a 9 frame context [15] which are similar to spectral envelope features derived using FDLP (FDLP-S), M-RASTA features [16] and Modulation SpectroGram (MSG) features [17] with a 9 frame context, which are both similar to modulation frequency features (FDLP-M). We combine FDLP-S features with FDLP-M features using the DS theory of evidence to obtain a joint spectro-temporal feature set (FDLP-S+FDLP-M). Similarly, we derive two more feature sets by combining PLP features with M-RASTA features (PLP+M-RASTA) and MSG features (PLP+MSG). 25 dimensional Tandem representations of these features are used for our experiments. We also experiment with 39 dimensional PLP features without any Tandem processing (PLP-D). Features TOT AMI CMU ICSI NIST VT PLP-D PLP FDLP-S M-RASTA MSG FDLP-M PLP+M-RASTA PLP+MSG FDLP-S+FDLP-M Table 2: Word Error Rates (%) on RT05 Meeting data, for different feature extraction techniques. TOT - total WER(%) for all test sets, AMI, CMU, ICSI, NIST, VT - WER (%) on individual test sets [18] We use these features on an LVCSR task using the AMI LVCSR system for meeting transcription [18]. The training data for this system uses individual headset microphone (IHM) data from four meeting corpora; NIST (13 hours), ISL (10 hours), ICSI (73 hours) and a preliminary part of the AMI corpus (16 hours). MLPs are trained on the whole training set in order to obtain estimates of phoneme posteriors for each of the feature sets. Acoustic models are phonetically state tied triphone models trained using standard HTK maximum likelihood training procedures. The recognition experiments are conducted on the NIST RT05 [19] evaluation data. Juicer large vocabulary decoder is used for recognition with a pruned trigram language model [20]. This is used along with reference speech segments provided by NIST for decoding and the pronunciation dictionary used in AMI NIST RT05s system [18]. Table 3 shows the results for word recognition accuracies for these techniques on the RT05 meeting corpus. The proposed features (FDLP-S+FDLP-M) obtain a significant relative reduction of about 14 % in WER for the LCVSR task (compared to a relative reduction of 5% for PLP+M-RASTA and PLP+MSG features). 8

9 3.1 Posterior Features for Low-resource Languages An important factor that impacts performance of posterior features for LVCSR is the amount of data used to train the MLP systems. For new languages with only few hours of transcribed data, the performance of these data driven features is low. A potential solution to this problem is to use transcribed data available from other languages to build models which can be shared with the low-resource language. However training such systems requires all the multilingual data to be transcribed using a common phone set across the different languages. This common phone set can be derived either in a data driven fashion or using phonetic sets such as the International Phonetic Alphabet (IPA) [21]. More recently cross-lingual training with Subspace Gaussian Mixture Models (SGMM) [22] have also been proposed for this task. We propose three different approaches to improve posteriors for low-resource languages Mapping Languages with a Common Phone set In the first approach we explore a data driven approach for finding a common phone set across different languages [23]. In this method we initially train a cross-lingual MLP systems on data from multiple languages using an available phone set that covers phonemes from the languages. However this phone set might be different from that of the low resource language for which we need to build the LVCSR system. In order to describe the low resource training data in terms of cross-lingual phone set, we use a count based approach. The accumulated posterior outputs can hence be considered as soft counts corresponding to the presence or absence of different phoneme classes. The first step in this approach is to forward pass the low resource training data (in-language) through the cross lingual MLP to obtain phoneme posteriors. Using these posterior probabilities (described in terms of cross-lingual phone set) and their true labels from the low resource phone set, we estimate the following counts - c(x) - total instances when a particular label x of low resource phone set is present in the input. c(y) - accumulated posterior value for cross-lingual phoneme y. c(x,y) - accumulated posterior value for cross-lingual phoneme y whenxis the true label. With these counts, we now find C(x,y) = c(x,y). For each label x, the more frequently a c(x)c(y) particular label y occurs, higher the value of C(x,y). This measure can hence be used to map a label in cross-lingual phone set with a particular label in low-resource phone set. In our experiments we first train a cross-lingual MLP using German and Spanish data on a phone set of 52 phone set (combined set of phonemes which cover German and Spanish data). One hour of English data (considered as the low-resource language) is forward passed using the cross lingual MLP to obtain phoneme posteriors in terms of 52 cross-lingual phones. The true labels for English data contains 47 English phonemes. Using the mapping technique described above we then determine to which phoneme in the German-Spanish set we can map English phonemes to. Each English phoneme is mapped to the phone which gives the highest score in the German- Spanish set. Once the English data has been mapped, the cross-lingual MLPs are adapted using 1 hour of English data. We adapt the MLP by retraining it using the new data after initializing 9

10 Cross lingual MLP Trained on German and Spanish data Cross lingual MLP adapted using 1 hour of English data Modulation features Low resource MLP Posterior probability merger Tandem processing Features for ASR Spectral envelope features Low resource MLP Cross lingual MLP Cross lingual MLP adapted using 1 hour of English data Trained on German and Spanish data Figure 3: Deriving cross-lingual and multi-stream posterior features for low resource LVCSR systems Baseline PLP features 28.8 Multi-stream Cross-lingual Tandem features 36.5 Table 3: Word Recognition Accuracies (%) using multi-stream cross-lingual posterior features it with its original weights. Fig. 3 is a schematic of the proposed approach. All the data from these experiments are from the LDC Callhome Corpus. We use 30 dimensional Tandem features to train the subsequent single pass HTK based recognizer with 600 tied states and 4 mixtures per state. Table 3 shows the improvements we get by using posterior features over conventional PLP features in a low-resource setting with only 1 hour of data Training MLPs with Language Specific Output Layers In our second approach we propose a different MLP architecture and training method for deriving posteriors for low-resource languages. The primary advantage of this new architecture is that it does not require the multilingual data to be mapped using a common phoneme set across various languages. In the proposed architecture, we train a 4 layer multilayer perceptron. The MLP has a linear input layer with a size corresponding to the dimension of the input feature vector, followed by two non-linear hidden layers and a final linear layer with a size corresponding to the phoneme set of the language the MLP is being trained. Similar to bottleneck MLPs or the HATS approach, while the dimension of first hidden layer is high, the second hidden layer is low dimensional and is known as the bottleneck layer. While training on multiple languages with different phoneme sets, the first 3 layers are shared. The last layer that is specific to the phoneme set of each language 10

11 Input layer with size corresponding to input feature set PLP features Bottleneck layer introduced to allow the network to learn a common low dimensional representation among languages Expansion layer Layers common across languages Intermediate output layer with size corresponding to the phoneset of the high resource language. The network is first trained on the high resource with its phoneset Bottleneck features We derive two kinds of features from the bottleneck and the final layer Posterior features Final output layer trained on the low resource language with its phoneset Figure 4: Block schematic for the proposed MLP training scheme for low-resource languages. Baseline PLP features 28.8 Tandem features derived posterior features using Spanish and German with 1 hour of English 35.8 and 2 feature representations Bottleneck features with the same setup 37.2 Table 4: Word Recognition Accuracies (%) using multi-stream cross-lingual posterior features is then modified. Modifying only this layer allows us to train across different languages. Fig. 4 is a schematic of the proposed architecture for two languages a high-resource language (with several hours of data) and a low-resource language (with only few hours of data) each having different phoneme sets. We derive two kinds of features for LVCSR task from these networks - A. Tandem features - These features are derived from the posteriors estimated by the MLP at the fourth layer. When networks are trained on multiple feature representations, better posterior estimates can be derived by combining the outputs from different system using posterior probability combination rules. Phoneme posteriors are then converted to features by gaussianizing the posteriors using the log function and decorrelating them. A dimensionality reduction is also performed by retaining only the feature components which contribute most to the variance of the data. B. Bottleneck features - Unlike Tandem features, bottleneck features are derived as linear outputs of the neurons from the bottleneck layer. These outputs are used directly as features for LVCSR 11

12 Low resource (English) FDLPM features High resource out of language MLP trained on 200 hrs High resource out of language MLP trained on 200 hrs Posterior Combination Tandem Processing 25D Tandem Features derived from high resource (Spanish) posteriors Low resource (English) PLP features English MLP trained on M hours English Posteriors Figure 5: Low-resource MLP systems trained with acoustic features enhanced with out-oflanguage posteriors from multiple acoustic representations features without applying any transforms. When bottleneck features are derived from multiple feature representations, these features are appended together and a dimensionality reduction is performed using KLT to retain only relevant components. Both of these MLP features are derived using two acoustic feature representations - short-term spectral PLP features and long-term modulation features using frequency domain linear prediction (FDLP-M). Table 4 summarizes the word recognition accuracies for the same LVCSR task described earlier with 2 languages - Spanish and German along with 1 hour of English in a low-resource setting Enhancing Acoustic Features with Out-of-language Posteriors In this approach acoustic features used to train the low-resource MLPs with posteriors are enhanced with posteriors derived from large amounts of out-of-language data. Fig. 5 is a schematic of the proposed approach where posterior features from a separate MLPs trained on large amounts of out-of-language data (200 hours of Spanish) is used to enhance acoustic features used to train low-resource MLPs on fewer amounts of data (English M hours). Spanish MLPs are trained on two feature streams - short-term spectral PLP features and long-term modulation features derived using FDLP (FDLPM). Posterior features from the two acoustic streams (PLP and FDLP-M) are combined at the posterior level. This allows us to obtain more accurate and robust estimates of the out-of-language posteriors for LVCSR. Tandem representations of these features are appended along with 351 dimensional PLP features to train the low-resource English nets as shown in 5. The comparison of LVCSR results for the baseline HMM-GMM setup (CTS data from Call- Home English) and the performance of MLP systems trained with enhanced posteriors from outof-language MLPs are shown in Fig. 6. The plot summarizes the effect of enhanced posterior features as a function of the equivalent amount of additional in-language training data. The dotted lines indicate the correspondence of enhanced posteriors with respect to an equivalent performance of the baseline system using conventional PLP features with higher amounts of in-language data. With the enhancement of out-of-language posteriors on 1 hour of in-language data, we obtain an 12

13 Word Recognition Accuracy (%) In language data (English) In language data (English) enhanced with out of language posteriors (Spanish) Equivalent in language performance Amount of in language data (hours) Figure 6: Word accuracy improvements for low-resource LVCSR systems with out-of-language posteriors LVCSR performance equivalent to 4 hours of in-language data, an increase of 300% in the amount of in-language training data. However, the improvements using higher amounts of in-language training data are subsequently lower (for example, starting with 5 hours of in-language data, the improvements using out-of-language posteriors is equivalent to 8 hours of in-language training which is an additional increase of 60% on the original 5 hours). 4 Phoneme Posteriors for Speaker Verification 4.1 Traditional Approaches to Speaker Verification The goal of speaker verification is to verify the truth of a speaker s claimed identity. Majority of current speaker verification systems model overall acoustic feature vector space using a Universal Background Model (UBM), trained on large amounts of data from multiple speakers. In the enrollment phase, the UBM is adapted to model each enrollment speaker using limited amount of speech from the speaker. During test, the likelihood of the test speaker from both the UBM and the claimed speaker model are derived. If the claimed identity of the speaker is true, the likelihood from the claimed speaker model is assumed to be more than the likelihood of the utterance using the UBM and vice versa if false. Likelihood ratio of the adapted speaker model and the UBM is hence commonly used as an indication of the target speaker. Both UBM and the speaker-specific models are typically multivariate single-state GMMs with large number of mixture components. The GMM assumes that the data is composed by normally distributed clusters, each cluster representing a group of speech sounds. During the adaptation, the components that represent sounds present in the adaptation data are well adapted, the other components remain close to as they were in the UBM. The UBM-GMM method has proved to be successful and remains in use for the past two decades. 13

14 Acoustic Features Broad Class Posteriors A1 A2 UBM Model A3 A4 A5 UBM average reconstruction error Test Utterance Acoustic Features MLP Speaker Model Decision Logic Decision Acoustic Features A1 A2 A3 A4 A5 Speaker average reconstruction error Component AANNs Figure 7: Block schematic of the proposed AANN based speaker verification system 4.2 Improvements to Neural Networks for Speaker Verification A more recently proposed alternative for modeling the data distribution is the Auto-Associative Neural Network (AANN). AANNs are feed-forward neural network with equal number of input and output nodes, trained to learn an identity mapping from the input to the output layer with a restricted number of nodes at its hidden compression layer. In [6], these networks have been used instead of GMMs for speaker verification. However, the performance of AANN speaker verification systems so far do not meet the performance of the GMM based systems. We attribute this to the relatively unconstrained way in which AANNs are adapted to target speakers. We propose the following improvements to train these models better - Forming several independent class-specific AANNs as a UBM. The composite UBM-AANN is trained additionally using side information about the class of sounds present in the data. We use estimates of posterior probabilities of 5 broad phoneme classes (vowels, fricatives, plosives, nasals and silence) from a multilayer perceptron (MLP) as side information. Training separate AANNs on different channel conditions - microphone and telephone. Adapt parameters of the each class-specific AANN only after the compression layer instead of retraining the entire network, since the adaptation data is limited. The performance of the proposed modeling technique is evaluated on a decimated set of the 8 core conditions of the NIST 2008 speaker recognition evaluations. We train gender specific UBM- AANNs for both the microphone and telephone conditions. FDLP features described earlier are used to train these models. For each speaker in the enroll set we adapt the UBM-AANN to create a speaker specific AANN. The difference between average reconstruction error of both the UBM and the claimed model is used a score for each test speaker. The final recognition performance is then computed by finding the EER on these scores. 14

15 Cond. Task 1. Interview speech in training and test. 2. Interview speech from the same microphone type in training and test. 3. Interview speech from different microphones types in training and test. 4. Interview training speech and telephone test speech. 5. Telephone training speech and noninterview microphone test speech. 6. Telephone speech in training and test from multiple languages. 7. English language telephone speech in training and test. 8. English language telephone speech spoken by a native U.S. English speaker in training and test. Table 5: Core evaluation conditions in NIST 2008 SRE task. System Cond. 1 Cond. 2 Cond. 3 Cond. 4 Cond. 5 Cond. 6 Cond. 7 Cond. 8 GMM AANN Combined Table 6: Performance of various features in terms of min DCF ( 10 3 ). In order to train the composite UBMs and create speaker specific models, posteriors from MLPs trained on large amounts of conversational telephone and microphone speech are used. We use the proposed features to train these MLPs. Phoneme posteriors obtained at the outputs of these networks are combined appropriately to obtain 5 broad phonetic class posteriors corresponding to vowels, fricatives, plosives, nasals and silence. Table 6 below shows the NIST detection cost function (DCF) scores for all the 8 conditions (Table 5) using both conventional GMM based systems and the proposed AANN system. A simple weighted combination of scores from both the systems improves the performance still further by minimizing the DCF. More recently, factor analysis of GMMs has been used as a front-end for extracting a lower dimensional representations that capture both speaker and channel variabilities of mean supervectors known as i-vectors. In a simple i-vector system, cosine distance between test and enrollment i-vectors is used as a score. Similarly, we model the adaptation parameters (last layer weights) of mixture of AANNs in a lower dimensional subspace that captures both speaker and channel variabilities. The learning of subspace is formulated as a regularized weighted least squares problem. Posterior probabilities play a significant role by serving as soft counts in determining the number of points that align with each component of the composite AANN in this formulation [7]. The results using the proposed factor analysis are summarized in Table 7 for conditions 6,7,8 of NIST We use the same UBMs described above. Gender specific 300 dimensional subspaces are trained for mixture of AANNs. A 400 dimensional total variability space of GMMs is also trained as a baseline. For both the approaches, cosine distance between test and enrollment i- vectors is used as a score [24]. The proposed subspace approach improves the basic mixture of AANNs system (see Table 6) and combines well with the state-of-the-art GMM i-vector system yielding 10% relative improvement in DCF. 15

16 System Cond. 6 Cond. 7 Cond. 8 Mixture of AANNs (300 dim. i-vectors) GMM (400 dim. i-vectors) Score combination ( ) Table 7: Performance in terms of Min DCF ( 10 3 ) using subspace approaches. 5 Conclusions and Future Directions 5.1 Conclusions We have presented novel methods for improving phoneme posteriors - by combining posteriors from different streams and training methods for MLPs. We have applied the improved posteriors for a variety of tasks in speech and speaker recognition. The results show the usefulness of the proposed techniques for these applications. In future we propose to extend the work in noisy environments. 5.2 Future Directions Robust Estimation of Posteriors using Multistream Processing In the multistream recognition paradigm for processing of corrupted signals, various representations of the signal from different frequency bands of the spectrum are processed and classified in separate processing channels [25]. This is done to adaptively alleviate the corrupted channels while preserving the uncorrupted channels for further processing. We pursue this approach by deriving several streams from the power spectrum of speech using a bank of 2D Gabor filters tuned to different spectral (scale) and temporal (rate) modulations [26]. We train MLPs on each of these streams and derive statistics from the estimated posteriors by computing its autocorrelation matrix. The diagonal elements of this autocorrelation matrix reflect the occurrence frequency of each phoneme and the off-diagonal values correspond to the coactivation of different phoneme posteriors. Autocorrelation does not tell anything about whether the posterior estimates were correct, it merely reflects the first (diagonal) and second order (offdiagonal) statistics of the estimated posteriograms. However, the off diagonal elements reflect confusions among phoneme classes because an ideal posteriogram has only one phoneme active at each time instance. The autocorrelation matrix computed from posteriograms of undistorted speech also summarizes the behavior of each stream in the clean condition. Any additional distortion of the posteriogram due to any factor results in the change of these statistics. Thus, computing a measure of similarity between the autocorrelation matrices derived from the clean signal and from the signal corrupted by any means indicates the degradation of the stream due to the distortion. Using the Pearsons correlation as a measure of similarity for our initial experiments seems to effective predictor of streams recognition accuracy in both clean and distorted cases [27]. We propose to investigate how posteriors from each streams can be combined based on these reliability measures to obtain better estimates in noise. 16

17 5.2.2 Posteriors for Voice Activity Detection In most speech processing systems, the first step in dealing with a speech signal is the reliable detection of speech activity. We propose to explore the use of MLP posteriors for voice activity detection. For VAD, MLP phoneme posteriors corresponding to speech classes can be merged to give a two class posterior probability vector with speech/non-speech probabilities. These probabilities can then be hard thresholded to speech/non-speech decisions. A Viterbi decoder can be further used to smooth the decisions. However, this VAD decoder performs well only under matched conditions of training and test. In order to improve the applicability of the MLP based VAD for mis-matched scenarios, there is a need to develop robust phoneme posterior estimation techniques (especially in noisy and low-resource settings). By deriving improved posteriors using some of the approaches described above, we plan to improve VAD performance for various tasks like speech recognition and speaker verification References [1] F. Jelinek, Statistical Methods for Speech Recognition, The MIT Press, [2] D. Reynolds, T. Quatieri and R. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, [3] P. Kenny, G. Boulianne, P. Oullet and P. Dumouchel, Joint factor analysis versus eigenchannes in speaker recognition, IEEE Transactions on Audio, Speech, and Language Processing, [4] H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach, Springer, [5] H. Hermansky, D.P.W. Ellis and S. Sharma, Tandem connectionist feature extraction for conventional HMM systems, in IEEE ICASSP, pp , [6] B. Yegnanarayana and S. Kishore, AANN: an alternative to GMM for pattern recognition, Neural Networks, [7] G.S.V.S. Sivaram, S. Thomas and H. Hermansky, Mixture of Auto-Associative Neural Networks for Speaker Verification, in ISCA Interspeech, 21. [8] M. Athineos and D.P.W. Ellis, Frequency-domain linear prediction for temporal features, in IEEE ASRU, [9] S. Thomas, S. Ganapathy and H. Hermansky, Phoneme Recognition Using Spectral Envelope and Modulation Frequency Features, in IEEE ICASSP, [10] S. Ganapathy, S. Thomas and H. Hermansky, Modulation Frequency Features For Phoneme Recognition In Noisy Speech, JASA - Express Letters,

18 [11] J. Pinto, G.S.V.S. Sivaram, M. Magimai-Doss, H. Hermansky, and H. Bourlard, Analyzing MLP Based Hierarchical Phoneme Posterior Probability Estimator, IEEE Transactions on Audio, Speech, and Language Processing, 21. [12] F. Valente and H. Hermansky, Combination of Acoustic Classifiers based on Dempster- Shafer Theory of Evidence, in IEEE ICASSP, [13] Q. Zhu, B. Chen, N. Morgan and A. Stolcke, On using MLP features in LVCSR, in ISCA Interspeech, [14] H. Hermansky, Perceptual Linear Predictive (PLP) Analysis of Speech, JASA, [15] J. Pinto, B. Yegnanarayana, H. Hermansky, and M.M. Doss, Exploiting contextual information for improved phoneme recognition, in ISCA Interspeech, [16] H. Hermansky and P. Fousek, Multi-resolution RASTA filtering for TANDEM-based ASR, in ISCA Interspeech, [17] B. Kingsbury, Perceptually-inspired signal processing strategies for robust speech recognition in reverberant environments, Ph.D. thesis, University of California Berkeley, [18] T. Hain et.al., The 2005 AMI system for the transcription of speech in meetings, NIST RT05 Workshop, [19] The NIST Rich Transcription Spring 2005 Evaluation, Online Web Link: [20] D. Moore et.al., Juicer: A weighted finite state transducer speech coder, Lecture Notes in Computer Science, [21] H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C. Lee, A study on Multilingual Acoustic Modeling for Large Vocabulary ASR, in IEEE ICASSP, [22] L. Burget et. al., Multilingual Acoustic Modeling for Speech Recognition based on Subspace Gaussian Mixture Models, in IEEE ICASSP, 20. [23] S. Thomas, S. Ganapathy and H. Hermansky, Cross-lingual and Multi-stream Posterior Features for Low-resource LVCSR Systems, in ISCA Interspeech, 20. [24] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 20. [25] H. Hermansky, S. Timberwala, and M. Pavel, Towards ASR on partially corrupted speech, in IEEE ICSLP, [26] T. Chi, P. Ru, and S.A. Shamma, Multiresolution spectrotemporal analysis of complex sounds, JASA, [27] N. Mesgarani, S. Thomas and H. Hermansky, Toward Optimizing Stream Fusion in Multistream Recognition of Speech, JASA - Express Letters,