Text Independent Speaker Verification System

Transcription

1 Text Independent Speaker Verification System Project Advisor: Professor Lawrence Saul 1

2 Abstract: User identification and verification are very important aspects of any security system today, as cheaters find more and more ways to break into even the most complex of security measures. Biometric recognition systems are in demand today due to their reliance of human features that are unique to a person and cannot be forged easily such as face, fingerprints and voice. Like fingerprints, a person s voice has particular unique features and using this voiceprint, their identity can be verified. The goal of my project is to design and implement a text-independent speaker identification system. This means that regardless of what the user speaks, the system should be able to verify whether he is the person he claims to be. Such a system would be useful in banks, at ATMs, as well as telephone-based applications, where there is no way to identify a user based on fingerprint or face. Related Work: Speech recognition is not a new subject, however it is a growing industry and continuously new methods to tap this human quality are being developed. A lot of research has been done on text-independent speaker verification systems using Gaussian mixture models and my project is a simple implementation of that. I will be using published papers on this topic to assist me in my goal. 2

3 Technical Approach: The object of this project is to implement a single speaker verification system. Statistically speaking, it is a hypothesis test between two hypotheses: where p(y H0) p(y H1) > θ, accept H0 < θ, accept H1 H0: Y is from the hypothesized speaker S H1: Y is not from the hypothesized speaker S 1 Figure taken from A Tutorial on Text-Independent Speaker Verification The output of front end processing is a sequence of feature vectors X = {x1, x2,t}, where xt is a feature vector indexed at discrete time t [1,2,3..., T]. These features are then used to compute the likelihood ratios of H0 and H1. The log of the likelihood ratio above would then be: Λ(X) = log p(x H0) log p(x H1) We need to generate two models for this test to work the speaker model as well as the background model. I have planned three stages for implementing this system Training Phase, Tuning Phase and Testing Phase. Training Phase: Generate the background model Tuning Phase: Generate the individual speaker models Testing Phase: Test the system using new wave files from test speakers I m using the Gaussian Mixture Model for the likelihood function and so the mixture density for the likelihood function, for a D-dimensional feature vector x, is: 1 A Tutorial on Text-Independent Speaker Verification 3

4 QuickTime and a TIFF (LZW) decompressor are needed to see this picture. The GMM parameters (mean, variance, etc) are calculated using the Expectation- Maximization (EM) Algorithm. It is an iterative process that monotonically increases the likelihood of the estimated model for the observed feature vectors such that for iterations k and k+1, p(x λ (k+1) ) p(x λ (k) ) The weight, mean, and variance parameters: 4

5 Data Collection: To implement this system, test data is required. I have recorded clips from 25 speakers. Each speaker data set consists of 15 speech clips, of varying lengths. This data set I split into three categories Training, Tuning and Testing. These are the three phases of the project and the data will be required in each stage. So 9 out of 15 clips I have used for training, 3 more for tuning and the rest for testing the application. To record the clips, I used a microphone and recording software called GoldWave. One factor that affected the results, was the distance between the microphone and the speaker s mouth. Too close or too far and the results were skewed. I realized this at a later stage, and so had to ask a few speakers to record more test clips. Training Phase: In the Training phase, the background model is created. The background model is basically a large pool of all sample data, just a large Gaussian mixture model. I have converted the wave files into a different format so that they can be used for this analysis. The wave file is a continuous signal, which must be broken down in discrete parameter vectors. Each vector is about 10ms long, because we assume that in this duration the vector is stationary. This is not strictly true, but it is a reasonable approximation to make. The format I ve used is MFCC, which stands for Mel Frequency Ceptral Coefficients. The conversion can be done as follows: 1. Divide signal into frames. 2. For each frame, obtain the amplitude spectrum. 3. Take the logarithm. 4. Convert to Mel (a perceptually-based) spectrum. 5. Take the discrete cosine transform (DCT). 2 However, instead of doing this manually, I used the HTK Toolkit in order to automate the process. Once the files are in the correct format, from each frame it is important to discard all the silence and keep the speech samples. So, then I generate mfcc.speech files. 2 Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling 5

6 One of the features vectors extracted is energy, which corresponds to the loudness or softness of the speaker s voice. In order to avoid bad results due to this, I removed the energy vector from the speech files. Now, the speech files can be combined to generate the background model file. This model must now be trained. We must decide on the number of Gaussians to work with. In order to make that decision, you look at the log_likelihood values at the end of the training process, and compare the values. For example: Number of samples Log_Likelihood in Loop 4 Number of Gaussians The optimal number of Gaussians is one where the log likelihood value drops for the first time, because this means that the likelihood is actually increasing. During my earlier training phase, the optimal number of Gaussians was 300, with the lowest log likelihood value. However, as the number of samples increased, I decided to continue testing with higher number of Gaussians and finally achieved best results at 600 Gaussians. As the system is scaled for use by a large of speakers, this number will increase substantially. I keep the number of Gaussians fixed for the background model and the speaker models. Tuning Phase: In the tuning phase, the individual speaker models are generated. The process for generating these models is very similar to that for generating the background model, with a few minor changes. I take the mfcd.speech files these are the mfcc files with the energy feature removed and use these to generate a model for that speaker. I keep the number of Gaussians same as the background model in this case, 600 Gaussians. The purpose of this system is to test whether a given voice print belongs to the person the speaker claims to be. In order to achieve this, I needed to device a method to calculate a threshold value, which would make it easy to identify the speaker/imposter. An imposter is a user who claims to be somebody else, to try and cheat the system. To 6

7 do this, I used three test files from each user. I compared the each file of the speaker to the speaker model, and based on the matching of the features, calculated the likelihood value of a test recording belonging to that speaker. For each speaker, not only did I compare the speaker s test files, but also the files from other speakers in the background model. This provided a range of values that would be useful in calculating a threshold. Below is a sample of the data I got from running the above test. Dat file: dip divye jiten Khush madhu Speech files Dip Dip Dip Divye Divye Divye jiten jiten jiten khush khush khush madhu madhu madhu Each speech file belongs to some speaker, and the highlighted likelihood values are the results of comparing a speaker test file to the same speaker s model. The most important point that I noticed in the tuning phase results was that the likelihood values of a test file belonging to a speaker is positive when the file actually belongs to the speaker and negative when the file belongs to an imposter. I decided that the threshold had to be some function based on the average of the likelihood values of the speaker files as well as include the imposter values. The threshold function I used is: µ + xσ µ is the mean and σ is the standard deviation of all the likelihood values. x is an integer whose value is can be varied. I varied x, starting with x = 2. Using this threshold function, I computed the thresholds of all the speakers in my background model. 7

8 Testing Phase: Once we have all the threshold values and speaker models, it is time to test the remaining files. This will help us determine if the analysis done above is accurate enough. Using the threshold values calculated in the tuning phase, I tested the remaining speaker files. To ensure that the system is accurate while verifying users, we need to test the threshold values in two ways for false alarms and false rejections. If the likelihood value of an imposter file is higher than the threshold for the speaker being tested, then the system will validate the imposter as the speaker. This is a false alarm. On the other hand, sometimes a speaker s own file may not have a likelihood value higher than the threshold and so the speaker is falsely identified as an imposter. This is a false rejection. An optimal threshold value would minimize both these values, keeping the error rate low. I maintain a summary file for each user, which is generated when the testing scripts are run, recording the likelihood values, and the mean, variance and standard deviation of the results. The summary file also tracks the number of false alarms and rejections. mean = var = stdev = threshold for khush is number of false alarms with threshold are 1 number of false rejections with threshold are 0 As mentioned above, I started by keeping the value of x=2. This threshold gave a very high rate of error, allowing many imposters to be validated as another speaker. However, there were very few false rejections. So I experimented by varying the value of x to 3 and then finally x = 4. Currently, I have fixed the value of x as 4. However, with an increase in number of speakers, this would vary. 8

9 The User Interface: In order to make this system user friendly, I have developed a GUI application, which is simple and hides the layer of complexity from the user. There are two parts, for training a new speaker and to test a returning user. It is important to implement these features in a very short time, while demonstrating the application. I have incorporated a recorder in the GUI, so that no separate recording software is required. In theory, a new speaker would be added to the system offline. The background model need not contain all the users that are added to the system, but if there were a huge discrepancy in the actual number of users and the user data in the background sample pool, the results would get skewed. However, for the purpose of demonstration, while adding a new user, the background model is not modified. The entire procedure is automated using perl scripts. Once the user records a voice clip, and selects to be added to the system or to be identified as a particular speaker, all the processes are implemented and the result is shown on the screen. A new speaker is added by the following procedure: Speaker records a voice clip. Voice clip is converted into speech file of the correct format. 9

10 Using the data, the speaker model is generated. Using the same speech file, and the tuning files of the existing users, the threshold value of the speaker is generated. A speaker s identity is verified by the following procedure: User records voice clip Voice clip is converted into speech file. User selects his username from a drop down menu. Based on the user s selection, the likelihood value of the speech file is compared with the threshold value of the selected identity. If the likelihood value is higher than the threshold value, user is identified as speaker. If the likelihood value is below the threshold value, user is identified as imposter. 10

11 Conclusion: The aim of this project was to implement an application that would verify a speaker s identity by using the speaker s voice print characteristics that distinguish the speaker from other speakers. I wanted to implement a simple application using the algorithms already in existence. Data collection was a very important aspect of this project. It was a challenge to figure out how many speakers I should use. I initially had about 10, but then I increased that number to 25. It was also important to figure out what kind of data I should work with. Should I have multiple files or just one with a lot of speech? How many files for testing phase and tuning phase? I decided the details for data collection after a lot of trial and error. One of the challenges was understanding how the Hidden Markov Model Toolkit worked. It was important to extract the features that I needed for my experiments, and being able to manipulate the data the right way. One of the features extracted from the voice recording is energy. This energy corresponds to the loudness of the speaker s voice and would skew results if taken into account. So I to figure out how to remove the energy vector from the feature vectors that HTK generated. While the application gives pretty accurate results, it works well only under certain environmental circumstances. I recorded most of the data in a room with very little disturbance in the background. This is meant to be a single speaker verification system, so no other speakers should be heard in the background. Also, the microphone used for all the test speakers is the same. The mike is placed at a fixed distance from the speaker s mouth while recording the clip. Using a different microphone or adjusting the distance between the mike and speaker s mouth causes results to be skewed. So, the application works under this scenario, but not necessarily under any other circumstances. I would ve liked to accomplish this, but I was not successful. Overall I enjoyed working on this project since it was a topic that interested me. A blessing in disguise was my lack of information and awareness in this field, as it forced me to read and learn a lot on my own. Also, I learnt how to work on a large project with very little structure. It was important to set deadlines for myself, and keep working towards the end goal. There were times when everything went wrong and it was important not to give up. I am glad that I was able to achieve the goals I set for myself. 11

12 References: Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling Schmidt, Regina. Identity Confirmed, Access Permitted: The Basics On Voice Authentication, Security And Consumer Use Of An Emerging Biometric. BiometriTech. 3 Sep < A Tutorial on Text-Independent Speaker Verification EURASIP Journal on Applied Signal Processing 2004 < Reynolds, Douglas. A., Quatieri, Thomas. F., Dunn, Robert B., Speaker Verification Using Adapted Gaussian Mixture Models Digital Signal Processing, The HTK Book 12