Robust Speaker Identification System Based on Two-Stage Vector Quantization

Similar documents
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

Ericsson T18s Voice Dialing Simulator

School Class Monitoring System Based on Audio Signal Processing

Establishing the Uniqueness of the Human Voice for Security Applications

Myanmar Continuous Speech Recognition System Based on DTW and HMM

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Available from Deakin Research Online:

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

Developing an Isolated Word Recognition System in MATLAB

Hardware Implementation of Probabilistic State Machine for Word Recognition

Artificial Neural Network for Speech Recognition

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Solutions to Exam in Speech Signal Processing EN2300

Emotion Detection from Speech

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Vector Quantization and Clustering

Convention Paper Presented at the 135th Convention 2013 October New York, USA

L9: Cepstral analysis

A Novel Method to Improve Resolution of Satellite Images Using DWT and Interpolation

Advanced Signal Processing and Digital Noise Reduction

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

TRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY

Online Diarization of Telephone Conversations

A Secure File Transfer based on Discrete Wavelet Transformation and Audio Watermarking Techniques

IEEE Proof. Web Version. PROGRESSIVE speaker adaptation has been considered

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Speech Signal Processing: An Overview

SPEECH SIGNAL CODING FOR VOIP APPLICATIONS USING WAVELET PACKET TRANSFORM A

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Gender Identification using MFCC for Telephone Applications A Comparative Study

Loudspeaker Equalization with Post-Processing

Speech Recognition on Cell Broadband Engine UCRL-PRES

Subjective SNR measure for quality assessment of. speech coders \A cross language study

A Digital Audio Watermark Embedding Algorithm

Implementation of Digital Signal Processing: Some Background on GFSK Modulation

T Automatic Speech Recognition: From Theory to Practice

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Audio Scene Analysis as a Control System for Hearing Aids

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

JPEG Image Compression by Using DCT

THE problem of tracking multiple moving targets arises

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Introduction to Medical Image Compression Using Wavelet Transform

Recent advances in Digital Music Processing and Indexing

Sachin Dhawan Deptt. of ECE, UIET, Kurukshetra University, Kurukshetra, Haryana, India

Speech recognition for human computer interaction


Grant: LIFE08 NAT/GR/ Total Budget: 1,664, Life+ Contribution: 830, Year of Finance: 2008 Duration: 01 FEB 2010 to 30 JUN 2013

Region of Interest Access with Three-Dimensional SBHP Algorithm CIPR Technical Report TR

ROBUST TEXT-INDEPENDENT SPEAKER IDENTIFICATION USING SHORT TEST AND TRAINING SESSIONS

FEATURE SELECTION AND CLASSIFICATION TECHNIQUES FOR SPEAKER RECOGNITION

STUDY OF MUTUAL INFORMATION IN PERCEPTUAL CODING WITH APPLICATION FOR LOW BIT-RATE COMPRESSION

Probability and Random Variables. Generation of random variables (r.v.)

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

IBM Research Report. CSR: Speaker Recognition from Compressed VoIP Packet Stream

Short-time FFT, Multi-taper analysis & Filtering in SPM12

Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations

Annotated bibliographies for presentations in MUMT 611, Winter 2006

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Combating Anti-forensics of Jpeg Compression

Voice---is analog in character and moves in the form of waves. 3-important wave-characteristics:

Sampling Theorem Notes. Recall: That a time sampled signal is like taking a snap shot or picture of signal periodically.

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Towards usable authentication on mobile phones: An evaluation of speaker and face recognition on off-the-shelf handsets

A WEB BASED TRAINING MODULE FOR TEACHING DIGITAL COMMUNICATIONS


PCM Encoding and Decoding:

Medical image compression using a new subband coding method

Separation and Classification of Harmonic Sounds for Singing Voice Detection

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

BROADBAND ANALYSIS OF A MICROPHONE ARRAY BASED ROAD TRAFFIC SPEED ESTIMATOR. R. López-Valcarce

HOG AND SUBBAND POWER DISTRIBUTION IMAGE FEATURES FOR ACOUSTIC SCENE CLASSIFICATION. Victor Bisot, Slim Essid, Gaël Richard

How to Improve the Sound Quality of Your Microphone

2695 P a g e. IV Semester M.Tech (DCN) SJCIT Chickballapur Karnataka India

Image Authentication Scheme using Digital Signature and Digital Watermarking

Integration of Negative Emotion Detection into a VoIP Call Center System

QMeter Tools for Quality Measurement in Telecommunication Network

Audio Coding Algorithm for One-Segment Broadcasting

CS Introduction to Data Mining Instructor: Abdullah Mueen

Predict the Popularity of YouTube Videos Using Early View Data

Time series analysis Matlab tutorial. Joachim Gross

Audio Content Analysis for Online Audiovisual Data Segmentation and Classification

Speech and Network Marketing Model - A Review

Denoising Convolutional Autoencoders for Noisy Speech Recognition

From Concept to Production in Secure Voice Communications

Convention Paper Presented at the 118th Convention 2005 May Barcelona, Spain

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW

Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis

Tagging with Hidden Markov Models

Transcription:

Tamkang Journal of Science and Engineering, Vol. 11, No. 4, pp. 357 366 (2008) 357 Robust Speaker Identification System Based on Two-Stage Vector Quantization Wan-Chen Chen 1,2, Ching-Tang Hsieh 2 * and Chih-Hsu Hsu 3 1 Department of Electronic Engineering, St. John s University, Taipei, Taiwan 251, R.O.C. 2 Department of Electrical Engineering, Tamkang University, Tamsui, Taiwan 251, R.O.C. 3 Department of Information Technology, Ching Kuo Institute of Management and Health, Keelung, Taiwan 203, R.O.C. Abstract This paper presents an effective method for speaker identification system. Based on the wavelet transform, the input speech signal is decomposed into several frequency bands, and then the linear predictive cepstral coefficients (LPCC) of each band are calculated. Furthermore, the cepstral mean normalization technique is applied to all computed features in order to provide similar parameter statistics in all acoustic environments. In order to effectively utilize these multi-band speech features, we propose a multi-band 2-stage vector quantization (VQ) as the recognition model in which different 2-stage VQ classifiers are applied independently to each band and the errors of all 2-stage VQ classifiers are combined to yield total error and a global recognition decision. Finally, the KING speech database is used to evaluate the proposed method for text-independent speaker identification. The experimental results show that the proposed method gives better performance than other recognition models proposed previously in both clean and noisy environments. Key Words: Speaker Identification, Wavelet Transform, Linear Predictive Cepstral Coefficient (LPCC), 2-Stage Vector Quantization 1. Introduction In general, speaker recognition can be divided into two parts: speaker verification and speaker identification. Speaker verification refers to the process of determining whether or not the speech samples belong to some specific speaker. However, in speaker identification, the goal is to determine which one of a group of known voices best matches the input voice sample. Furthermore, in both tasks, the speech can be either textdependent or text-independent. Text-dependent means that the text used in the test system must be the same as that used in the training system, while text-independent means that there is no limitation on the text used in the *Corresponding author. E-mail: hsieh@ee.tku.edu.tw test system. Certainly, the method used to extract and model the speaker-dependent characteristics of the speech signal seriously affects the performance of a speaker recognition system. Many researches have been done on the feature extraction of speech. The linear predictive cepstral coefficients (LPCC) were used because of their simplicity and effectiveness in speaker/speech recognition [1,2]. Other widely used feature parameters, namely, the mel-scale frequency cepstral coefficients (MFCC) [3], were calculated by using a filter-bank approach, in which the set of filters had equal bandwidths with respect to the mel-scale frequencies. This method was based on the fact that human perception of the frequency contents of sounds did not follow a linear scale. The above two most commonly used feature extraction techniques do not provide in-

358 Wan-Chen Chen et al. variant parameterization of speech; the representation of the speech signal tends to change in various noise conditions. The performance of these speaker identification systems may be severely degraded when a mismatch between the training and testing environments occurs. Various types of speech enhancement and noise elimination techniques have been applied to feature extraction. Typically, Furui [4] used the cepstral mean normalization (CMN) technique to eliminate the channel bias by subtracting off the global average cepstral vector from each cepstral vector. In past studies on recognition models, the hidden Markov model (HMM) [5,6], Gaussian mixture model (GMM) [7 10], and vector quantization (VQ) [11 13] were used to perform speaker recognition. HMM [5,6] is widely used in speech recognition, and it is also commonly used in text-dependent speaker verification. GMM [7 10] provides a probabilistic model of the underlying sounds of a person s voice. It is computationally more efficient than HMM and has been widely used in textindependent speaker recognition. It has been shown that VQ [11 13] is very effective for speaker recognition. Although the performance of VQ is not as good as that of GMM [7], VQ is computationally more efficient than GMM. VQ is also widely used in coding of speech signals and the multi-stage VQ (MSVQ) [14] can achieve very low bit rates and storage complexity in comparison with conventional VQ. Conventionally, feature extraction is carried out by computing acoustic feature vectors over the full band of the spectral representation of speech. The major drawback of this approach is that even a partial band-limited noise corruption affects all feature vector components. The multi-band approach dealt with this problem by performing acoustic feature analysis independently on a set of frequency subbands [15]. Since the resulting coefficients were computed independently, a band-limited noise signal did not spread over the entire feature space. In our previous works [16,17], we proposed a multi-band feature extraction method in which the LPCC features extracted from the full band and different subbands are combined to form a single feature vector. This feature extraction method was evaluated in a speaker identification system using VQ and GMM as identifiers. The experimental results showed that this multi-band feature was more effective and robust than the full-band LPCC and MFCC features, particularly in noisy environment. In other previous works, we proposed the multi-layer eigencodebook VQ (MLECVQ) [18] and likelihood combination GMM (LCGMM) [19] recognition models. The general idea was to split the whole frequency band into full-band and a few subbands on which different eigencodebook vector quantization (ECVQ) and GMM classifiers were independently applied and then recombined to yield global likelihood scores and a global recognition decision. The experimental results showed that the MLECVQ and LCGMM recognition models were more effective and robust than the conventional GMM using full-band LPCC and MFCC features, particularly in noisy environment. In this study, the multi-band linear predictive cepstral coefficients (MBLPCC) proposed previously [16 19] were used as the front end of the speaker identification system. Then, the cepstral mean normalization was applied to these multi-band speech features to provide similar parameter statistics in all acoustic environments. In order to effectively utilize these multi-band speech features, we proposed a multi-band 2-stage VQ as the recognition model. Different 2-stage VQ classifiers were applied independently to each band, and then the errors of all 2-stage VQ classifiers were combined to yield total error and a global recognition decision. By evaluating the proposed method, we can see that the proposed method outperforms the MLECVQ and LCGMM recognition models proposed previously. This paper is organized as follows. The proposed algorithm for extracting speech features is described in section 2. The multi-band speaker recognition model is presented in section 3. Experimental results and comparisons with the baseline GMM, MLECVQ and LCGMM models are presented in section 4. Concluding remarks are made in section 5. 2. Multi-Band Features Based on Wavelet Transform The recent interest in the multi-band feature extraction approach has mainly been attributed to Allen s paper [20], where it is argued that the human auditory system processes features from different subbands independently, and that the merging is done at some higher point of processing to produce a final decision. The advantages of

Robust Speaker Identification System Based on Two-Stage Vector Quantization 359 using multi-band processing are multifold and have been described in earlier publications [21 23]. The major drawback of a pure subband-based approach may be that information about the correlation between different subbands is lost. Therefore, we suggest that the full-band features are not to be ignored, but should be combined with subband features to maximizing the recognition accuracy. A similar approach that combined information from the full band and subbands at the recognition stage was found to improve recognition performance [23]. It is not a trivial matter to decide at which temporal level the subband features should be combined. In the multi-band approaches [21,22], different recognizers for each band were applied, and likelihood recombination was done at the HMM state, phone or word level to yield global score and a global recognition decision. In our approach, the full band and subband features were used in the recognition model. Based on time-frequency multi-resolution analysis, the effective and robust MBLPCC features are used as the front end of the speaker identification system. First, the LPCC is extracted from the full-band input signal. Then the wavelet transform is applied to decompose the input signal into two frequency subbands: a lower frequency subband and a higher frequency subband. To capture the characteristics of an individual speaker, the LPCC of the lower frequency subband is calculated. There are two main reasons for using the LPCC parameters: their good representation of the envelope of the speech spectrum of vowels, and their simplicity. Based on this mechanism, we can easily extract the multi-resolution features from all lower frequency subband signals simply by iteratively applying the wavelet transform to decompose the lower frequency subband signals, as depicted in Figure 1. As shown in Figure 1, the wavelet transform can be realized by using a pair of finite impulse response (FIR) filters, h and g, which are low-pass and high-pass filters, respectively, and by performing the down-sampling operation ( 2). The down-sampling operation is used to discard the odd-numbered samples in a sample sequence after filtering is performed. The schematic flow of the proposed feature extraction method is shown in Figure 2. After the full-band LPCC is extracted from the input speech signal, the discrete wavelet transform (DWT) is applied to decompose the input signal into a lower frequency subband, and the subband LPCC is extracted from this lower frequency subband. The recursive decomposition process enables us to easily acquire the multi-band features of the speech signal. Based on the concept of the proposed method, the number of subbands depends on the level of the decomposition process. If speech signals band-limited from 0 to 4000 Hz are recursively decomposed two times, then three bands signals, 0 4000 Hz, 0 2000 Hz, and 0 1000 Hz, will be generated. Since the spectra of the three bands will overlap in the lower frequency region, the proposed multi-band feature extraction method focuses on the spectrum of the speech signal in the low frequency region similar to MFCC features. Finally, cepstral mean normalization is applied to normalize the feature vectors so that their short-time means are normalized to zero as follows: X () t X () t (1) k k k where X k (t) is the kth component of feature vector at time (frame) t, and k is the mean of the kth component of the feature vectors of a specific speaker s utterance. Figure 1. Two-band analysis tree for a discrete wavelet transform.

360 Wan-Chen Chen et al. Figure 2. Features extraction algorithm of MBLPCC. 3. Multi-Band Speaker Recognition Models 3.1 2-Stage Vector Quantization In MSVQ, each stage consists of a stage codebook. A reproduction of an input vector x is formed by selecting from each stage codebook a stage code vector, q i for p stage i, and forming their sum, i.e., x q i 1 i where P denotes the number of stages. For simplicity, MSVQ implemented in this paper is restricted to only two stages. The basic structure of 2-stage VQ is depicted in Figure 3. The quantizer Q 1 in first stage uses a codebook Y ={y 1, y 2,,y N } and the second stage quantizer Q 2 uses a codebook Z ={z 1, z 2,, z M }. In this paper, a sequential search procedure is used in 2-stage VQ. The input vector x is quantized with the first stage codebook producing the selected first-stage code vector y = Q 1 (x) where Q i ( ) denotes the ith stage quantization operation. The code vector y in codebook Y nearest in Euclidean distance to the input x is found such that 2 2 i x y x y, i 1,2,, N Figure 3. Structure of 2-stage VQ. (2) A residual vector e 1 is formed by subtracting y from x. e 1 = x y (3) Then residual e 1 is quantized with the code vector z = Q 2 (e 1 ) which minimizes the Euclidean distance over all other code vectors in codebook Z. 2 e1 z e1 z, j 1,2,, M j The input vector x is quantized to x where 2 (4) x y z (5) The error d(x, ) between the original vector x and reproduction vector x by the 2-stage VQ is equal to the quantization error in the last (second) stage. d( x, ) x xˆ x y z e z (6) In this paper, the conventional way of designing the stage codebooks for sequential search MSVQ with a mean-squared error (MSE) overall distortion criterion is to apply the Linde-Buzo-Gray (LBG) algorithm [24] stage-by-stage for a given training set of input vectors. The LBG algorithm is first applied to the training set to design the first stage codebook. A training set for the second stage is then constructed by pooling together the residual vectors resulting from the quantization of the 1

Robust Speaker Identification System Based on Two-Stage Vector Quantization 361 original training set with the first stage codebook. The second stage codebook is then designed with the LBG algorithm applied to the derived residual training set. 3.2 Multi-Band 2-Stage VQ Recognition Model As described in Section 1, MSVQ is widely used in coding of speech signals and can achieve very low bit rates and storage complexity in comparison with conventional VQ. Here, we use 2-stage VQ as the classifier. Our multi-band scheme combines the error of the independent 2-stage VQ for each band, as illustrated in Figure 4. We name this recognition model the multi-band 2-stage VQ. First, the input signal is decomposed into L subbands. Then the LPCC features extracted from each band are further normalized to zero mean by using the cepstral mean normalization technique. Finally, different 2-stage VQ classifiers are applied independently to each band, and then the errors of all 2-stage VQ classifiers are combined to yield total error and a global recognition decision. For speaker identification, a group of S speakers is represented by the multi-band 2-stage VQs, 1, 2,, S. Assume a given testing utterance X is divided into T frames, X={X (1), X (2),, X (T) }. For the jth frame X (j), X (j) is decomposed into L subbands. Let X i (j) and ki be the feature vector of the jth frame and the associated 2-stage VQ of a specific speaker k in band i, respectively. The jth frame error d(x (j), k ) for the multi-band 2-stage VQ of a specific speaker k is determined as the sum of the errors of all bands as follows: L ( j) ( j) (, k) d( Xi, ki) i 0 d X (7) When i =0,d(X 0 (j), k0 ) represents the jth frame error between the full-band feature and the associated 2-stage VQ for a specific speaker k. Therefore, the total error for the T-frame testing speech utterance X is given by (8) For a given testing speech utterance X, X is classified to belong to the speaker S who has the minimum error d( X, ). S T T L ( j) ( j) k k i ki j 1 j 1 i 0 d( X, ) d( X, ) d( X, ) Sˆ arg min d( X, ) 1 k S k 4. Experimental Results (9) This section presents experiments conducted to evaluate application of the proposed multi-band 2-stage VQ to closed-set text-independent speaker identification. The first experiment studied the effect of the number of bands used in the proposed recognition model. The next experiment evaluated the effect of the number of code vectors in first and second stage codebooks under both clean and noisy environments. The last experiment compared the performance of the multi-band 2-stage VQ with that of the MLECVQ [18] and LCGMM [19] models proposed previously. 4.1 Database Description and Parameters Setting The proposed methods were evaluated using the KING speech database [25] for closed-set text-independent Figure 4. Structure of multi-band 2-stage VQ model.

362 Wan-Chen Chen et al. speaker identification. The KING database is a collection of conversational speech from 51 male speakers. For each speaker there are 10 sections of conversational speech that were recorded at different times. Each section consists of about 30 seconds of actual speech. The speech from a section was recorded locally using a microphone and was transmitted over a long distance telephone link, thus providing a high-quality (clean) version and a telephone quality version of the speech. The speech signals were recorded at 8 khz and 16 bits per sample. In our experiments, the noisy speech was generated by adding Gaussian noise to the clean version speech at the desired signal to noise ratio (SNR). In order to eliminate the silence segments from an utterance, simple segmentation based on the signal energy of each speech frame was applied. All experiments were performed using five sections of speech from 20 speakers. For each speaker, 90 seconds of speech cut from four clean version sections provided the training utterances. Typically, we need utterances longer than 2 seconds to achieve adequate accuracy in speaker identification [7]. Hence, the other one section was divided into no overlapping segments 2 seconds in length and provided the testing utterances. The evaluation of the average speaker identification performance was conducted in the following manner. The identified speaker of each test segment was compared with the actual speaker of the test utterance and the number of segments correctly identified was tabulated. The above steps were repeated for all test segments from each speaker in the population consisting of 20 speakers. The average identification rate was computed as follows: 4.2 Contribution of the Multi-Band As explained in section 2, the number of subbands depends on the decomposition level of the wavelet transform. The first experiment evaluated the effect of the number of bands used in the multi-band 2-stage VQ model. In this experiment, the 2-stage VQ in each band had 128 code vectors in first stage and 32 code vectors in second stage. The experimental results are shown in Figure 5. In Figure 5, the single-band using only the fullband signal had the poorest performance. We could see that the identification rate initially increased as the number of bands increased, and the best performance could be achieved when the number of bands was set to be three. Because the signals were decomposed into several frequency subbands and the spectra of the subbands overlapped in the lower frequency region, the success of the MBLPCC features could be attributed to the emphasis on the spectrum of the signal in the low-frequency region. It was found that increasing the number of bands to more than three not only increased the computation time but also decreased the identification rate. In this case, the signals of the lowest frequency subband were located in the very low frequency region, which put too much emphasis on the lower frequency spectrum of speech. In addition, the number of samples within the lowest frequency subband was so small that the spectral characteristics of speech could not be estimated accurately. Consequently, the poor result in the lowest frequency subband degraded the system performance. The experimental results gave us a distinct guide for choosing the good decomposition level for the wavelet transform. identification rate N / N correct total (10) where N correct is the number of correctly identified segments, and N total is the total number of test segments. In all experiments conducted in this study, each frame of an analyzed utterance had 256 samples with 192 overlapping samples. Furthermore, 20 orders of LPCC for each frequency band were calculated and the first order coefficient was discarded, which leads to a 19 dimensional feature vector. For our multi-band approach, we used 2, 3 and 4 bands as follows: 2 bands: (0 4000), (0 2000) Hz; 3 bands: (0 4000), (0 2000), (0 1000) Hz; 4 bands: (0 4000), (0 2000), (0 1000), (0 500) Hz. Figure 5. Effect of number of bands on the identification performance of the multi-band 2-stage VQ model with 128 code vectors in first stage and 32 code vectors in second stage in clean environment.

Robust Speaker Identification System Based on Two-Stage Vector Quantization 363 4.3 Effects of the Number of Code Vectors in First and Second Stage Codebooks This experiment evaluated the effect of the number of code vectors in first and second stage codebooks of the 3-band 2-stage VQ in both clean and noisy environments. The experimental results are shown in Table 1. One could see that the 3-band 2-stage VQ with 128 code vectors in first stage and 32 code vectors in second stage achieved best identification rate except in 10 db SNR condition. Consequently, the experimental results gave us a distinct guide for choosing the good number of code vectors in first and second stage codebooks. 4.4 Comparison with Other Existing Models First, we would compare the performance of the 2- stage VQ model with that of the GMM and ECVQ [18] models using full-band LPCC features under Gaussian noise corruption. The parameters of the 2-stage VQ model were defined as follows: 128 code vectors in first stage and 32 code vectors in second stage, 19 dimensional LPCC feature vectors. The parameters of the GMM model were defined as follows: 50 mixtures, 19 dimensional LPCC feature vectors. The parameters of the ECVQ model were defined as follows: 64 code vectors, one projection basis vector, 15 dimensional LPCC feature vectors, the adjustment factor R j = 0.03. The experimental results were shown in Figure 6. We could see that the performance of the GMM model was seriously degraded by Gaussian noise corruption. On the other hand, the 2-stage VQ and ECVQ models were more robust under Gaussian noise corruption. The 2-stage VQ had slightly better performance compared with ECVQ model except in 5 db SNR condition. In next experiment, we would compare the performance of the proposed model with other multi-band models. The performance of the 3-band 2-stage VQ recognition model with 128 code vectors in first stage and 32 code vectors in second stage was compared with that of the baseline GMM, MLECVQ [18] and LCGMM [19] models proposed previously under Gaussian noise cor- Table 1. Effects of number of code vectors in first and second stage codebooks on identification rates for the 3-band 2- stage VQ in both clean and noisy environments Number of code vectors in first stage codebook Number of code SNR vectors in second stage codebook Clean 20 db 15 db 10 db 5 db 80 80 94.35% 89.7% 85.38% 72.43% 49.17% 100 60 95.35% 90.7% 83.72% 73.42% 53.16% 128 32 95.35% 93.02% 86.38% 75.08% 53.49% 144 16 95.02% 90.37% 84.72% 76.08% 53.16% 152 08 94.35% 90.37% 84.39% 75.08% 51.83% Figure 6. Identification rates for GMM, ECVQ, and 2-stage VQ models using full-band LPCC features with white noise corruption.

364 Wan-Chen Chen et al. ruption. The parameters of the baseline GMM model were defined as follows: 50 mixtures, the features of full band, 19 dimensional LPCC feature vectors. The parameters of the MLECVQ model were defined as follows: 64 code vectors, one projection basis vector, the features of four bands, 15 dimensional LPCC feature vectors, the adjustment factor R j = 0.03. The parameters of the LCGMM model were defined as follows: 50 mixtures, the features of three bands, 19 dimensional LPCC feature vectors. The experimental results were shown in Table 2. We could see that the performance of the baseline GMM was seriously degraded as the SNR decreased and achieved the poorest performance among all the models since our previous work [19] indicated that the MBLPCC features was more robust than the full-band LPCC features used by the baseline GMM. The 3-band 2-stage VQ could yield at least comparable performance in clean and 5dB SNR conditions while providing best robustness among all models in the case of noisy speech. The 3-band LCGMM achieved better performance in clean, 20dB, and 15 db SNR conditions but poorer performance in lower SNR conditions compared with the 4-band MLECVQ. Since above experiment indicated that the ECVQ was more robust than GMM under lower SNR conditions in single band scheme, the MLECVQ would be more robust than LCGMM under lower SNR conditions in multi-band scheme. Based on these results, it can be concluded that the 3-band 2-stage VQ is effective and robust to represent the characteristics of individual speakers in additive Gaussian noise conditions. Wu et al. [26] proposed an auditory model according to the mechanism of human auditory periphery and cochlear nucleus, based on which four channels of auditory features were extracted. For each channel of auditory feature, a GMM classifier was independently applied and then combined the scores of these channels to yield global likelihood scores. They used speech utterances of 49 speakers in clean and telephonic version of the KING speech database to evaluate the performance of the auditory model for text-independent speaker identification. For each speaker, 3 sections of speech (about 90 seconds) provided the training utterances. The segment length of testing utterances was 6.4 seconds. The experimental results were shown in Table 3. The auditory model achieved more robust performance below 20 db SNR but poorer performance in clean and 30dB SNR conditions compared with the GMM using MFCC features. Since the performance of the 3-band 2-stage VQ outperform that of the baseline GMM from the clean speech data down to 5 db SNR, the proposed 3-band 2-stage VQ performed better than the auditory model [26]. 5. Conclusions In this study, the effective and robust MBLPCC features were used as the front end of a speaker identification system. In order to effectively utilize these multiband speech features, we proposed a multi-band 2-stage VQ as the recognition model. Different 2-stage VQ classifiers were applied independently to each band, and then Table 2. Identification rates for 4-band MLECVQ, 3-band LCGMM, and 3-band 2-stage VQ with white noise corruption SNR Clean 20dB 15dB 10dB 5dB Method baseline GMM 95.02% 77.23% 54.87% 27.95% 08.49% 4-band MLECVQ 95.02% 88.04% 82.39% 70.43% 54.82% 3-band LCGMM 96.68% 92.36% 83.72% 65.12% 37.87% 3-band 2-stage VQ 95.35% 93.02% 86.38% 75.08% 53.49% Table 3. Identification rates for auditory computational model [26] and GMM with white noise corruption SNR Clean 30 db 20 db 10 db Method GMM + MFCC 88.09% 88.16% 43.76% 14.67% Auditory model [26] 76.83% 73.10% 65.03% 50.59%

Robust Speaker Identification System Based on Two-Stage Vector Quantization 365 errors of all 2-stage VQ classifiers were combined to yield total error. Finally, the KING speech database was used to evaluate the proposed method for text-independent speaker identification. The experimental results show that the proposed method is more effective and robust than the baseline GMM, MLECVQ and LCGMM models proposed previously. References [1] Atal, B., Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification, Journal of Acoustical Society America, Vol. 55, pp. 1304 1312 (1974). [2] White, G. M. and Neely, R. B., Speech Recognition Experiments with Linear Prediction, Bandpass Filtering, and Dynamic Programming, IEEE Trans. on Acoustics, Speech, Signal Processing, Vol. 24, pp. 183 188 (1976). [3] Vergin, R., O Shaughnessy, D. and Farhat, A., Generalized Mel Frequency Cepstral Coefficients for Large- Vocabulary Speaker-Independent Continuous-Speech Recognition, IEEE Trans. on Speech and Audio Processing, Vol. 7, pp. 525 532 (1999). [4] Furui, S., Cepstral Analysis Technique for Automatic Speaker Verification, IEEE Trans. on Acoustics, Speech, Signal Processing, Vol. 29, pp. 254 272 (1981). [5] Tishby, N. Z., On the Application of Mixture AR Hidden Markov Models to Text Independent Speaker Recognition, IEEE Trans. on Signal Processing, Vol. 39, pp. 563 570 (1991). [6] Yu, K., Mason, J. and Oglesby, J., Speaker Recognition Using Hidden Markov Models, Dynamic Time Warping and Vector Quantisation, IEE Proceedings Vision, Image and Signal Processing, Vol. 142, pp. 313 318 (1995). [7] Reynolds, D. A. and Rose, R. C., Robust Text- Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Trans. on Speech and Audio Processing, Vol. 3, pp. 72 83 (1995). [8] Miyajima, C., Hattori, Y., Tokuda, K., Masuko, T., Kobayashi, T. and Kitamura, T., Text-Independent Speaker Identification Using Gaussian Mixture Models Based on Multi-space Probability Distribution, IEICE Trans. on Information and Systems, Vol. E84- D, pp. 847 855 (2001). [9] Alamo, C. M., Gil, F. J. C., Munilla, C. T. and Gomez, L. H., Discriminative Training of GMM for Speaker Identification, Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 1996), Vol. 1, pp. 89 92 (1996). [10] Pellom, B. L. and Hansen, J. H. L., An Efficient Scoring Algorithm for Gaussian Mixture Model Based Speaker Identification, IEEE Signal Processing Letters, Vol. 5, pp. 281 284 (1998). [11] Soong, F. K., Rosenberg, A. E., Rabiner, L. R. and Juang, B. H., A Vector Quantization Approach to Speaker Recognition, Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 1985), Vol. 10, pp. 387 390 (1985). [12] Burton, D. K., Text-Dependent Speaker Verification Using Vector Quantization Source Coding, IEEE Trans. on Acoustics, Speech, Signal Processing, Vol. 35, pp. 133 143 (1987). [13] Zhou, G., Mikhael, W. B., Speaker Identification Based on Adaptive Discriminative Vector Quantisation, IEE Proceedings Vision, Image and Signal Processing, Vol. 153, pp. 754 760 (2006). [14] Juang, B. H. and Gray, A. H., Multiple Stage Vector Quantization for Speech Coding, Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 1982), Vol. 7, pp. 597 600 (1982). [15] Hermansky, H., Tibrewala, S. and Pavel, M., Toward ASR on Partially Corrupted Speech, Proc. of Int. Conf. on Spoken Language Processing, pp. 462 465 (1996). [16] Hsieh, C. T. and Wang, Y. C., A robust Speaker Identification System Based on Wavelet Transform, IEICE Trans. on Information and Systems, Vol. E84-D, pp. 839 846 (2001). [17] Hsieh, C. T., Lai, E. and Wang, Y. C., Robust Speaker Identification System Based on Wavelet Transform and Gaussian Mixture Model, Journal of Information Science and Engineering, Vol. 19, pp. 267 282 (2003). [18] Hsieh, C. T., Lai, E. and Chen, W. C., Robust Speaker Identification System Based on Multilayer Eigen- Codebook Vector Quantization, IEICE Trans. on Information and Systems, Vol. E87-D, pp. 1185 1193 (2004). [19] Chen, W. C., Hsieh, C. T. and Lai, E., Multi-Band Approach to Robust Text-Independent Speaker Identification, Journal of Computational Linguistics and

366 Wan-Chen Chen et al. Chinese Language Processing, Vol. 9, pp. 63 76 (2004). [20] Allen, J. B., How do Humans Process and Recognize Speech?, IEEE Trans. on Speech and Audio Processing, Vol. 2, pp. 567 577 (1994). [21] Bourlard, H. and Dupont, S., A New ASR Approach Based on Independent Processing and Recombination of Partial Frequency Bands, Proc. of Int. Conf. on Spoken Language Processing, Vol. 1, pp. 426 429 (1996). [22] Tibrewala, S. and Hermansky, H., Sub-Band Based Recognition of Noisy Speech, Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 1997), Vol. 2, pp. 1255 11258 (1997). [23] Mirghafori, N. and Morgan, N., Combining Connectionist Multiband and Full-Band Probability Streams for Speech Recognition of Natural Numbers, Proc. of Int. Conf. on Spoken Language Processing, Vol. 3, pp. 743 747 (1998). [24] Linde, Y., Buzo, A. and Gray, R. M., An Algorithm for Vector Quantizer Design, IEEE Trans. on Communications, Vol. 28, pp. 84 95 (1980). [25] Godfrey, J., Graff, D. and Martin, A., Public Databases for Speaker Recognition and Verification, Proc. of ESCA Workshop Automat. Speaker Recognition, Identification, Verification, pp. 39 42 (1994). [26] Wu, X., Luo, D., Chi, H., and H., S., Biomimetics Speaker Identification Systems for Network Security Gatekeepers, Proc. of Int. Joint Conf. on Neural Networks, Vol. 4, pp. 3189 3194 (2003). Manuscript Received: Jul. 8, 2007 Accepted: Mar. 10, 2008