Developing an Isolated Word Recognition System in MATLAB



Similar documents
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

Introduction to Object-Oriented Programming in MATLAB

Establishing the Uniqueness of the Human Voice for Security Applications

Emotion Detection from Speech

Ericsson T18s Voice Dialing Simulator

School Class Monitoring System Based on Audio Signal Processing

Converting Models from Floating Point to Fixed Point for Production Code Generation

Artificial Neural Network for Speech Recognition

L9: Cepstral analysis

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

VOICE RECOGNITION KIT USING HM2007. Speech Recognition System. Features. Specification. Applications

Advanced Signal Processing and Digital Noise Reduction

Maschinelles Lernen mit MATLAB

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

MATLAB Distributed Computing Server Licensing Guide

Sistemi di acquisizione. Grazie al sito National Instruments

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

DAQ in MATLAB HANS-PETTER HALVORSEN,

Speech Signal Processing: An Overview

Hardware Implementation of Probabilistic State Machine for Word Recognition

Speech recognition technology for mobile phones

EE289 Lab Fall LAB 4. Ambient Noise Reduction. 1 Introduction. 2 Simulation in Matlab Simulink

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

Solutions to Exam in Speech Signal Processing EN2300

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Introduction to MATLAB Gergely Somlay Application Engineer

Manual Analysis Software AFD 1201

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Annotated bibliographies for presentations in MUMT 611, Winter 2006

Structural Health Monitoring Tools (SHMTools)

Data Analysis with MATLAB The MathWorks, Inc. 1

Thirukkural - A Text-to-Speech Synthesis System

Product Information CANape Option Simulink XCP Server

PRODUCT INFORMATION. Insight+ Uses and Features

Machine Learning with MATLAB David Willingham Application Engineer

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Understanding CIC Compensation Filters

Department of Electrical and Computer Engineering Ben-Gurion University of the Negev. LAB 1 - Introduction to USRP

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

Embedded Vision on FPGAs The MathWorks, Inc. 1

Recent advances in Digital Music Processing and Indexing

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW

Figure1. Acoustic feedback in packet based video conferencing system

Building a Simulink model for real-time analysis V Copyright g.tec medical engineering GmbH

Analyzing LASIK Optical Data Using Zernike Functions

Analog Devices Welcomes Hittite Microwave Corporation NO CONTENT ON THE ATTACHED DOCUMENT HAS CHANGED

Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA

POTENTIAL OF STATE-FEEDBACK CONTROL FOR MACHINE TOOLS DRIVES

Speech recognition for human computer interaction

Installation, Licensing, and Activation License Administration Guide

MFCC-Based Voice Recognition System for Home Automation Using Dynamic Programming

IBM Research Report. CSR: Speaker Recognition from Compressed VoIP Packet Stream

Alignment and Preprocessing for Data Analysis

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Probability and Random Variables. Generation of random variables (r.v.)

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

RF Measurements Using a Modular Digitizer

Introduction to Pattern Recognition

Statistical Machine Learning

Agilent Creating Multi-tone Signals With the N7509A Waveform Generation Toolbox. Application Note

IVR Primer Introduction

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Dream DRM Receiver Documentation

Gender Identification using MFCC for Telephone Applications A Comparative Study

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

Simple Voice over IP (VoIP) Implementation

Integrating MATLAB into your C/C++ Product Development Workflow Andy Thé Product Marketing Image Processing Applications

WebSphere Business Monitor

Online Diarization of Telephone Conversations

CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

Product Development Flow Including Model- Based Design and System-Level Functional Verification

Origins, Evolution, and Future Directions of MATLAB Loren Shure

Impedance 50 (75 connectors via adapters)

STA 4273H: Statistical Machine Learning

Hideo Okawara s Mixed Signal Lecture Series. DSP-Based Testing Fundamentals 46 Per-pin Signal Generator

Non-Data Aided Carrier Offset Compensation for SDR Implementation

Image Compression through DCT and Huffman Coding Technique

Lecture 1-6: Noise and Filters

Exploratory Data Analysis with MATLAB

Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT)

Statistical Machine Learning from Data

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

Echtzeittesten mit MathWorks leicht gemacht Simulink Real-Time Tobias Kuschmider Applikationsingenieur

[Download Tech Notes TN-11, TN-18 and TN-25 for more information on D-TA s Record & Playback solution] SENSOR PROCESSING FOR DEMANDING APPLICATIONS 29

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Demo: Real-time Tracking of Round Object

High Quality Integrated Data Reconstruction for Medical Applications

Transcription:

MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling on mobile phones, and many other everyday applications. A robust speech-recognition system combines accuracy of identification with the ability to filter out noise and adapt to other acoustic conditions, such as the speaker s speech rate and accent. Designing a robust speech-recognition algorithm is a complex task requiring detailed knowledge of signal processing and statistical modeling. Products Used M AT L A B Data Acquisition Toolbox Signal Processing Toolbox Statistics Toolbox This article demonstrates a workflow that uses built-in functionality in MATLAB and related products to develop the algorithm for an isolated digit recognition system. The system is speaker-dependent that is, it recognizes speech only from one particular speaker s voice. The Development Workflow There are two major stages within isolated word recognition: a training stage and a testing stage. Training involves teaching the system by building its dictionary, an acoustic model for each word that the system needs to recognize. In our example, the dictionary comprises the digits zero to nine. In the testing stage we use acoustic models of these digits to recognize isolated words using a classification algorithm. The development workflow consists of three steps: Speech acquisition Speech analysis User interface development Acquiring Speech For training, speech is acquired from a microphone and brought into the development environment for offline analysis. For testing, speech is continuously streamed into the environment for online processing. During the training stage, it is necessary to record repeated utterances of each digit in the dictionary. For example, we repeat the word one many times with a pause between each utterance. Using the following MATLAB code with a standard PC sound card, we capture ten Classifying Speech- Recognition Systems Most speech-recognition systems are classified as isolated or continuous. Isolated word recognition requires a brief pause between each spoken word, whereas continuous speech recognition does not. Speech-recognition systems can be further classified as speakerdependent or speaker-independent. A speaker-dependent system only recognizes speech from one particular speaker s voice, whereas a speakerindependent system can recognize speech from anybody. 1 MATLAB Digest www.mathworks.com

seconds of speech from a microphone input at 8000 samples per second: Fs = 8000; % Sampling Freq (Hz) Duration = 10; % Duration (sec) y = wavrecord(duration*fs,fs); We save the data to disk as mywavefile.wav : This approach works well for training data. In the testing stage, however, we need to continuously acquire and buffer speech samples, and at the same time, process the incoming speech frame by frame, or in continuous groups of samples. We use Data Acquisition Toolbox to set up continuous acquisition of the speech signal and simultaneously extract frames of data for processing. The MATLAB code shown in Figure 1 uses a Windows sound card to capture data at a sampling rate of 8000 Hz. Data is acquired and processed in frames of 80 samples. The process continues until the RUNNING flag is set to zero. Analyzing the Acquired Speech We begin by developing a word-detection algorithm that separates each word from ambient noise. We then derive an acoustic model that gives a robust representation of each word at the training stage. Finally, we select an appropriate classification algorithm for the testing stage. Developing a Speech-Detection Algorithm The speech-detection algorithm is developed by processing the prerecorded speech frame by frame within a simple loop. For example, the MATLAB code in Figure 2 continuously reads 160 sample frames from the data in speech. % Define system parameters framesize = 80; % Framesize (samples) Fs = 8000; % Sampling Frequency (Hz) RUNNING = 1; % A flag to continue data capture % Setup data acquisition from sound card ai = analoginput('winsound'); addchannel(ai, 1); % Configure the analog input object. set(ai, 'SampleRate', Fs); set(ai, 'SamplesPerTrigger', framesize); set(ai, 'TriggerRepeat',inf); set(ai, 'TriggerType', 'immediate'); % Start acquisition start(ai) % Keep acquiring data while RUNNING ~= 0 while RUNNING % Acquire new input samples newdata = getdata(ai,ai.samplespertrigger); % Do some processing on newdata <DO _ SOMETHING> % Set RUNNING to zero if we are done if <WE_ARE_DONE> RUNNING = 0; end end % Stop acquisition stop(ai); % Disconnect/Cleanup delete(ai); clear ai; Figure 1. MATLAB code for capturing speech data. To detect isolated digits, we use a combination of signal energy and zero-crossing counts for each speech frame. Signal 2 MATLAB Digest www.mathworks.com

% Define system parameters seglength = 160; % Length of frames overlap = seglength/2; % # of samples to overlap stepsize = seglength - overlap; % Frame step size nframes = length(speech)/stepsize-1; % Initialize Variables samp1 = 1; samp2 = seglength; %Initialize frame start and end for i = 1:nframes % Get current frame for analysis frame = speech(samp1:samp2); % Do some analysis <DO _ SOMETHING> % Step up to next frame of speech samp1 = samp1 + stepsize; samp2 = samp2 + stepsize; end Figure 2. MATLAB code for reading a speech sample frame by frame. Since the Yule-Walker algorithm fits an autoregressive linear prediction filter model to the signal, we must specify an order of this filter. We select an arbitrary value of 12, which is typical in speech applications. Figures 3a and 3b plot the PSD estimate of three different utterances of the words one and two. We can see that the peaks in the PSD remain consistent for a particular digit but differ between digits. This means that we can derive the acoustic models in our system from spectral features. From the linear predictive filter coefficients, we can obtain several feature vectors using Signal Processing Toolbox functions, including reflection coefficients, log area ratio parameters, and line spectral frequencies. One set of spectral features commonly used in speech applications because of its robustness is Mel Frequency Cepstral Coefficients (MFCCs). MFCCs give a measure of the energy within overlapping frequency bins of a spectrum with a warped (Mel) frequency scale 1. energy works well for detecting voiced signals, while zero-crossing counts work well for detecting unvoiced signals. Calculating these metrics is simple using core MATLAB mathematical and logical operators. To avoid identifying ambient noise as speech, we assume that each isolated word will last at least 25 milliseconds. Developing the Acoustic Model A good acoustic model should be derived from speech characteristics that will enable the system to distinguish between the different words in the dictionary. We know that different sounds are produced by varying the shape of the human vocal tract and that these different sounds can have different frequencies. To investigate these frequency characteristics we examine the power spectral density (PSD) estimates of various spoken digits. Since the human vocal tract can be modeled as an all-pole filter, we use the Yule-Walker parametric spectral estimation technique from Signal Processing Toolbox to calculate these PSDs. After importing an utterance of a single digit into the variable speech, we use the following MATLAB code to visualize the PSD estimate: order = 12; nfft = 512; Fs = 8000; pyulear(speech,order,nfft,fs); Since speech can be considered to be shortterm stationary, MFCC feature vectors are calculated for each frame of detected speech. Using many utterances of a digit and combining all the feature vectors, we can estimate a multidimensional probability density function (PDF) of the vectors for a specific digit. Repeating this process for each digit, we obtain the acoustic model for each digit. During the testing stage, we extract the MFCC vectors from the test speech and use a probabilistic measure to determine the source digit with maximum likelihood. The challenge then becomes to select an appropriate PDF to represent the MFCC feature vector distributions. 1 We modified an MFCC implementation from the Auditory Toolbox, available via the MATLAB Central Link Exchange. www.mathworks.com/auditory-toolbox 3 MATLAB Digest www.mathworks.com

Figure 3a. Yule Walker PSD estimate of three different utterances of the word ONE. Figure 3b. Yule Walker PSD estimate of three different utterances of the word TWO. 4 MATLAB Digest www.mathworks.com

Figure 4a shows the distribution of the first dimension of MFCC feature vectors extracted from the training data for the digit one. We could use dfittool in Statistics Toolbox to fit a PDF, but the distribution looks quite arbitrary, and standard distributions do not provide a good fit. One solution is to fit a Gaussian mixture model (GMM), a sum of weighted Gaussians (Figure 4b). The complete Gaussian mixture density is parameterized by the mixture weights, mean vectors, and covariance matrices from all component densities. For isolated digit recognition, each digit is represented by the parameters of its GMM. To estimate the parameters of a GMM for a set of MFCC feature vectors extracted from training speech, we use an iterative expectation-maximization (EM) algorithm to obtain a maximum likelihood (ML) estimate. Given some MFCC training data in the variable MFCCtraindata, we use the Statistics Toolbox gmdistribution function to estimate the GMM parameters. This function is all that is required to perform the iterative EM calculations. Figure 4a. Distribution of the first dimension of MFCC feature vectors for the digit one. %Number of Gaussian component densities M = 8; model = gmdistribution.fit(mfcctraindata,m); Selecting a Classification Algorithm After estimating a GMM for each digit, we have a dictionary for use at the testing stage. Given some test speech, we again extract the MFCC feature vectors from each frame of the detected word. The objective is to find the digit model with the maximum a posteriori probability for the set of test feature vectors, which reduces to maximizing a log-likelihood value. Given a digit model gmmmodel and some test feature vectors Figure 4b. Overlay of estimated Gaussian components (red) and overall Gaussian mixture model (green) to the distribution in 4(a). 5 MATLAB Digest www.mathworks.com

If the goal is to implement the speechrecognition algorithm in hardware, we could use MATLAB and related products to simulate fixed-point effects, automatically generate embedded C code, and verify the generated code. Resources visit www.mathworks.com Technical Support www.mathworks.com/support Online User Community www.mathworks.com/matlabcentral Figure 5: Interface to final application. testdata, the log-likelihood value is easily computed using the posterior function in Statistics Toolbox: [P, log _ like] = posterior(gmmmodel,testdata); We repeat this calculation using each digit s model. The test speech is classified as the digit with the GMM that produces the maximum log-likelihood value. For More Information Demo: Acoustic Echo Cancellation www.mathworks.com/acoustic_demo Article: Developing Speech-Processing Algorithms for a Cochlear Implant www.mathworks.com/cochlear_implant For Download Auditory Toolbox and the MFCC function www.mathworks.com/auditory-toolbox Demos www.mathworks.com/demos Training Services www.mathworks.com/training Third-Party Products and Services www.mathworks.com/connections Worldwide CONTACTS www.mathworks.com/contact e-mail info@mathworks.com 2009 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders. Building the User Interface After developing the isolated digit recognition system in an offline environment with prerecorded speech, we migrate the system to operate on streaming speech from a microphone input.we use MATLAB GUIDE tools to create an interface that displays the time domain plot of each detected word as well as the classified digit (Figure 5). Extending the Application The algorithm described in this article can be extended to recognize isolated words instead of digits, or to recognize words from several speakers by developing a speakerindependent system. 91805v00 01/10 6 MATLAB Digest www.mathworks.com