Data driven design of filter bank for speech recognition

Transcription

1 Data driven design of filter bank for speech recognition Lukáš Burget 12 and Hynek Heřmanský 23 1 Oregon Graduate Institute, Anthropic Signal Processing Group, 2 NW Walker Rd., Beaverton, Oregon , USA, {hynek,lukas}@ece.ogi.edu 2 International Computer Science Institute, 1947 Center Street Suite 6, Berkeley, CA , USA, hynek@icsi.berkeley.edu 3 Brno Univ. of Technology, Inst. of Radioelectronics, Purkyňova 118, 612, Brno, Czech Republic burget@urel.fee.vutbr.cz Abstract. Filter bank approach is commonly used in feature extraction phase of speech recognition (e.g. Mel frequency cepstral coefficients). Filter bank is applied for modification of magnitude spectrum according to physiological and psychological findings. However, since mechanism of human auditory system is not fully understood, the optimal filter bank parameters are not known. This work presents a method where the filter bank, optimized for discriminability between phonemes, is derived directly from phonetically labeled speech data using Linear Discriminant Analysis. This work can be seen as another proof of the fact that incorporation of psychoacoustic findings into feature extraction can lead to better recognition performance. 1 Introduction Feature extraction is an important part of speech recognition process where input waveform is processed for the following pattern classification. While classification is usually based on stochastic approaches where models are trained on data, feature extraction is generally based on knowledge and beliefs. Current methods of feature extraction are mostly based on short term Fourier spectrum and its changes in the time. Auditory-like modifications inspired by physiological and psychological findings are performed on spectra of each speech frame in the sequence. Mel frequency cepstral coefficients [2] are commonly used as feature extraction method where energies in spectrum are integrated by a set of band limited triangular weighting functions (filter bank). These weighting functions are equidistantly distributed over mel scale according to psycho-acoustic findings where better resolution in spectrum is preserved for lower frequencies than for higher frequencies. The log of integrated spectral energies is taken (which corresponds to human perception of loudness) and finally a projection to cosine bases is performed. However, since mechanism of human auditory system is not fully

2 understood, the optimal system for feature extraction is not known. Moreover, psychoacoustic findings often describe limitations of human auditory system and we do not know if modeling of those limitations is useful for speech recognition. This work presents a method where the filter bank is derived directly from phonetically labeled speech data. We can obtain both, frequency warping and shape of individual weighting function of filter bank as result of this method. 2 Linear Discriminant Analysis The method is based on Linear Discriminant Analysis (LDA) proposed by Hunt [3]. LDA is a technique looking for such linear transform which allows dimension reduction of input data. However, it preserves information important for linear discrimination among input vectors which belong to different classes. The output of LDA is a set of linear independent vectors which are bases of a linear transform and which are sorted by their importance for discrimination among different classes. Since we have also information about importance of particular base vectors, we can pick up only several first basis which preserve almost all the variability in the data important for the discriminability. In other words, the resulting transformation matrix contains only several first columns of matrix obtained by LDA. y C 1 m 1 m 2 C 2 x z Fig. 1. Linear discriminant analysis for 2-dimensional data

3 The figure 1 demonstrates effect of LDA for 2-dimensional data vectors which belong to two classes. The grey and the empty ellipses represent distributions of data of two different classes C 1 and C 2 with mean vectors m 1 and m 2. The axes X and Y are coordinates of the original space. Large overlap of the class distributions can be seen in both directions of these original coordinates. The axis Z then shows the direction obtained by LDA. The classes are well separated after their projection into this direction. Since this example deals just with two classes and since LDA assumes that distributions of all classes are Gaussian with the same covariance matrix, no other direction can be obtained for better discrimination. Base vectors of LDA transforms are given by the eigen vectors of a matrix Σ 1 wc Σ ac. The within-class covariance matrix Σ wc represents unwanted variability in data and it is computed as the weighted mean of covariance matrices of classes: Σ wc = E[Σ p ] where Σ p is covariance matrix of particular class. The across-class covariance matrix Σ ac represents the wanted variability in data and it is computed as an estimation of covariance matrices for mean vectors of classes. Σ ac = E[(µ p µ)(µ p µ) T ] where µ p is mean vector for particular class and µ is global mean vector. An eigen value associated with one eigen vector represents the amount of variability (necessary for the discriminability) preserved by the projection of input vectors to this particular eigen vector. a dimension reduction. If LDA is to be used for dimension reduction, only several eigen vectors corresponding to the highest eigen values can be used. 3 Filter bank derived from data Filter bank is derived directly from phonetically labeled speech data using LDA described in previous section. In this case the magnitude Fourier spectra of all training data frames are directly used for computation of across-class and within-class covariance matrices. In our speech recognition task, we want to distinguish between different phonemes. Spectra representing speech frames labeled by the same phoneme belong to one class. Examples of across-class covariance and within-class covariance matrices derived this way from speech data from TIMIT database are shown in figure 2. Half of symmetric magnitude spectrum (129 points) was used as vectors for deriving these covariance matrices. The figure 3 shows first 5 LDA spectral bases given by the eigen vectors of the matrix Σ 1 wc Σ ac. The eigen values in figure 3a indicate that almost all variability in data important for class separability is preserved by the projection to only several first base vectors. The linear transform can be performed by the multiplication of an input vector and a matrix M, where columns are the base vectors. In our case, we choose only 13 first base vectors, so the transform matrix M has 129 rows and 13 columns.

4 Fig. 2. Across-class and within-class covariance matrix computed from magnitude spectrum a) Eigen values b) 1st Eigen vector c) 2nd Eigen vector.3 d) 3rd Eigen vector e) 4th Eigen vector.3 f) 5th Eigen vector Fig. 3. Basis derived using LDA from magnitude spectrum

5 3.1 Smoothing of speech spectra The projection of magnitude spectrum of one speech frame into these selected basis results in new vector (13 points) which should contain almost the same information for correct recognition as the original spectrum. Since the base vectors are linear independent, it is possible to obtain another transform which projects the reduced vector back into the original space - spectrum (129 points long). This transform is given by the pseudoinverse transform matrix M 1. We will obtain a final transform by joining (multiplying) both mentioned matrices M M 1. This transform projects the magnitude spectrum into its smoothed version where the information useless for discriminability among phonemes is removed. Each column of the final transformation matrix represents a weighting function for integrating band of frequencies around the point corresponding to the index of given column. Every 5-th of these weighting functions are shown in figure 4a. The resulting weighting functions for integration of lower frequencies are very narrow (integrating only several points of spectra and preserving more details), while functions integrating higher frequencies are much wider. This fact corresponds also with psychoacoustic findings about human frequency resolution. 3.2 Deriving of filter bank It is also possible to derive frequency warping by measuring and integrating bandwidths (widths) of consequent weighting functions (figures 4b and 4c). The smoothed spectrum can be represented by selecting only some of its samples without loosing any information. It means that we can pick up only several weighting functions and perform projection of original spectrum into them. Their selection must be done according to the warping derived. This way we end up with a set of weighting functions which are very similar to commonly used Mel filter bank (figure 4d). 4 Limitations of the method and conclusions Our experience shows that recognizers based on feature extraction inspired by psychoacoustic findings about nonuniform human resolution in frequencies can perform better than those based on pure short term Fourier spectrum. This work can be seen as another proof of the fact that incorporation of those psychoacoustic findings into feature extraction leads to better separability among phonemes in low dimensional feature space and also to better recognition performance. However, the LDA technique expects that data which belong to individual classes have the same Gaussian distribution and that also mean values of classes obey a Gaussian distribution. Of course this is not true for magnitude spectra of speech. The quest for optimal filter bank for speech recognition is therefore still open.

6 .3 a) Every 5 th weighting function.3 b) Bandwidth of weighting funtions warped spectrum (129 ~ 4kHz) c) Estimated warping of spectrum d) Derived filter bank Fig. 4. Filter bank and warping derived using LDA References 1. B. Gold and N. Morgan. Speech and Audio Signal Processing, New York, S. B. Davis and P. Mermelstein. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences IEEE Trans. on Acoustics, Speech & Signal Processing, vol. 28, no. 4, pp , M. J. Hunt. A statistical approach to metrics for word and syllable recognition J. Acoust Soc. Am., vol. 66(S1), S35(A), N. Malayath. Data-Driven Methods for Extracting Features from Speech Ph.D. thesis, Oregon Graduate Institute, Portland, USA, H. Hermansky and N. Malayath. Spectral Basis Functions from Discriminant Analysis in Proceedings ICSLP 98, Sydney, Australia, November L. Rabiner and B. H. Juang. Fundamentals of speech recognition Signal Processing. Prentice Hall, Engelwood cliffs, NJ, S. Young. The HTK Book Entropics Ltd. 1999