CHAPTER 5 SPEAKER IDENTIFICATION USING SPEAKER- SPECIFIC-TEXT

Transcription

1 52 CHAPTER 5 SPEAKER IDENTIFICATION USING SPEAKER- SPECIFIC-TEXT 5.1 MOTIVATION FOR USING SPEAKER-SPECIFIC-TEXT Better classification accuracy can be achieved if the training technique is able to capture the unique features of a class, the features that discriminate a class from another. In chapter 4, a GMM technique was proposed to equip a classifier to capture the unique features of a class and to make decisions based on the unique features alone. During testing, feature vectors that are unique to a class is derived thereby the classification accuracy is increased. One of the drawbacks is that, if the test utterance does not contain the unique features then the classification accuracy can be drastically reduced. Another drawback is, unique features have to be identified from the test utterances during testing thus increases the computation time. If the speaker is able to utter the word which contains only the unique features then the computation time will be reduced. Even though the unique feature vectors are known, one cannot expect / force a speaker to utter speech segments, that contain these features alone. On the other hand, if we know unique phoneme list apriori, one can formulate a text, to be uttered, using such phonemes alone. In this chapter, we investigate the effect of a subset of phonemes, that are unique to a speaker in the acoustic sense on a speaker recognition task. The proposed technique involves three main steps:

2 53 1. To find out confusing speaker for each speaker.to derive acoustically dissimilar phoneme set for each speaker when compared to his/her confusing speaker. 2. To test the system using utterances which will have maximum number of acoustically dissimilar phonemes. The proposed technique is experimented on speaker identification task using TIMIT speech corpus. The results are compared with the performance of a conventional GMM-based classifier. 5.2 EXPERIMENTAL SETUP The TIMIT speech corpus is used for both training and testing. The TIMIT corpus of read speech is designed to provide speech data for acousticphonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). Each speaker has 10 utterances and each of these utterances are approximately of 3 second duration. For the current study, only the female speakers (192 in number) are considered, due to the reason that the classification accuracy for female data is inferior to that of male data. For each speaker, among ten sentences, first 8 sentences are used for training and the last 2 sentences are used for testing.

3 54 The total number of training utterances is 1536 and the total number of test utterances is 385. For each speaker, a GMM with 64 mixture components has been trained, considering Mel-frequency cepstral coefficients (13 static + 13 dynamic + 13 acceleration) as the features. The training utterances of each speaker have been tested with 192 speaker models. Based on the loglikelihoods, two best results have been derived. The second speaker is considered as a closely resembling speaker. This process is repeated for all the 192 speakers and a confusing speaker list is derived. To derive speaker-specific-text of a speaker, as an initial step, we have to find out the acoustically dissimilar phonemes of the corresponding speaker. The common phonemes (the corresponding speech segment) of the speaker and her confusing speaker, available in the training utterances, are tested with her model and her confusing speaker model. Average log likelihood 2 of each phoneme is computed for the first speaker and her confusing speaker. If the mean difference is greater than a specific threshold, then the corresponding phoneme is considered as an acoustically dissimilar phoneme. The same process is repeated for the phonemes of all the speakers. During testing, the speaker-specific-text (the utterances which have acoustically dissimilar phonemes) is used. Since the TIMIT speech corpus is used, speaker-specific-text cannot be formulated using only the acoustically dissimilar phonemes. Therefore the speaker-specific-text is derived from the two test utterances by taking the words which have maximum number of acoustically dissimilar phonemes. Results were compared with the words which has maximum number of acoustically dissimilar phonemes and words without considering the acoustically dissimilar phonemes. When the system is 2 Since the number of examples for each of the phonemes used in the work is less, product of likelihood-gaussians used in the feature-level approach cannot be used.

4 55 tested using speech utterances which correspond to speaker-specific-text, the confusion error is found to be reduced considerably than that of the conventional GMM-based classification technique, as discussed in the next section. 5.3 PERFORMANCE ANALYSIS The performance of the system has been analyzed using acoustically dissimilar phonemes. The various values of the threshold (average log likelihood difference between the speaker and her confusing speaker) is set and different constraints are used for testing the performance of the system. Since the TIMIT corpus is used, we cannot formulate the text using only the acoustically dissimilar phonemes for testing. To derive speaker characteristics, the constraint that is set in our work is that the test utterances (words) should have at least six phonemes. Among six phonemes, the word should have a minimum of three acoustically dissimilar phonemes (ADPs) i.e., the word should contain 50% ADPs. For each speaker one such a word (satisfies the constraints) has been chosen for testing. The performance analysis of such a system is tabulated in Table 5.1. Table 5.1 Speaker identification performance of the system based on different thresholds and constraints (ADP- Acoustically Dissimilar Phonemes) Case Threshold Constraints (No. of phonemes in the test utterance / No. of ADPs) No. of speakers satisfy the constraints No. of Speakers recognized correctly Identification accuracy 1 >=9 6 / {>3} % 2 >=10 6 / {>3} % 3 >=11 6 / {>3} %

5 56 From Table 5.1, it can be noted that even with a single word, that contains more than or equal to 3 acoustically dissimilar phonemes, the classification accuracy is reasonably good (i.e., above 80%). Further, the deviation in the performance for various thresholds 3 is only minor. This shows that the performance of the system is not very sensitive to the threshold. Speaker identification performance is compared between the utterances with acoustically dissimilar phonemes and without considering the acoustically dissimilar phonemes. To derive speaker characteristics, the constraint that is set in our work is that the test utterances (words) should have at least six phonemes. Each phoneme may have approximately 80ms duration. Therefore, each test utterance is divided into 500 ms speech signal and given for testing. This 500ms speech signal may contain both acoustically similar and dissimilar phonemes(segments correspond to any silences(more than 100ms) are not considered). Table 5.2 Speaker identification performance of the system without considering the acoustically dissimilar phonemes (The speakers that satisfy the constraints given in case 1,2,3 from table 5.1 are considered for testing) Case No. of speakers for testing (as in Table 5.1) No. of 500ms speech utterances No. of times recognized correctly Identification accuracy % % % 3 Since the TIMIT corpus is used, the authors do not have the control over the number of speakers who satisfy the constraints.

6 57 From Table 5.1 and Table 5.2, it can be noted that there is a 16% performance improvement by using speaker-specific-text, as specified in row 2 of Table 5.1. The speaker identification performance is measured based on the number of acoustically dissimilar phonemes in the test utterance. From each test utterance, the words, with minimum of six phonemes and less than or equal to two acoustically dissimilar phonemes have been taken for testing. Similarly, the words with minimum of six phonemes and greater than or equal to three acoustically dissimilar phonemes have been taken for testing. The results are tabulated in Table 5.3. The number of speakers taken for the following experiment is 40. Table 5.3 Speaker identification performance based on number of acoustically dissimilar phonemes Case No. of acoustically dissimilar phonemes No. of speakers No. of speakers recognized correctly Identification accuracy 1 <= % 2 >= % From Table 5.3, it can be noted that the classification performance is improved when the number of acoustically dissimilar phonemes is increased. The speaker identification performance is measured by comparing the acoustically similar phonemes and acoustically dissimilar phonemes in the test utterance. Feature vectors of acoustically similar and dissimilar phonemes are extracted and given for testing. That is, testing is done with feature vectors extracted from the utterance of a single phoneme. The experimental results show that even with the single acoustically dissimilar phoneme the

7 58 speakers can be identified with reasonable accuracy which is shown in Figure 5.1. Figure 5.1 Comparison between acoustically similar and dissimilar phonemes(adp- Acoustically Dissimilar phonemes, ASP- Acoustically similar phonemes) From Figure 5.1, it can be noted that the acoustically dissimilar phonemes have accuracy greater than that of the acoustically similar phonemes. The speakers (9, 10, 11) have lower accuracy for the acoustically dissimilar phonemes. However, majority of the speakers were identified even with single acoustically dissimilar phonemes. This result shows that, if the test utterance contains only the acoustically dissimilar phonemes confusion error can be reduced and the classification accuracy can be increased. Computation time also reduced because unique features (acoustically dissimilar phonemes) are alone considered before testing i.e., testing is done using speech utterances correspond to a speaker-specific-text alone. Further, this shows that the duration of the test utterances can be reduced drastically without making a compromise on the classification accuracy.

8 SUMMARY In this chapter, we have proposed to use speech utterances that correspond to a speaker-specific-text for speaker recognition tasks. Here, the speaker-specific-text is formed using the unique phonemes of a speaker, in other words, a set of phonemes that are acoustically dissimilar when compared with that of a competing (acoustically closely resembling) speaker. We have shown that the classification accuracy, in a speaker identification task, is considerably higher than that of a conventional GMM-based technique, if the speech utterances correspond to the unique phonemes are used. Further, we have shown that, even with a single phoneme, if it is unique to a speaker, the classification accuracy is quite satisfactory. These results show that the duration of the test utterances can also be reduced considerably without compromising on the accuracy.