LPC ANALYSIS AND SYNTHESIS

Similar documents
Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

L9: Cepstral analysis

Linear Predictive Coding

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

Thirukkural - A Text-to-Speech Synthesis System

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

Speech Signal Processing: An Overview

Lecture 1-10: Spectrograms

Sampling Theorem Notes. Recall: That a time sampled signal is like taking a snap shot or picture of signal periodically.

Analysis/resynthesis with the short time Fourier transform

Probability and Random Variables. Generation of random variables (r.v.)

Advanced Signal Processing and Digital Noise Reduction

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

School Class Monitoring System Based on Audio Signal Processing

Solutions to Exam in Speech Signal Processing EN2300

Auto-Tuning Using Fourier Coefficients

A Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

Final Year Project Progress Report. Frequency-Domain Adaptive Filtering. Myles Friel. Supervisor: Dr.Edward Jones

B3. Short Time Fourier Transform (STFT)

Lecture 1-6: Noise and Filters

Digital Speech Coding

Design of FIR Filters

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Introduction to Digital Audio

Short-time FFT, Multi-taper analysis & Filtering in SPM12

Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT)

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA

PeakVue Analysis for Antifriction Bearing Fault Detection

The Calculation of G rms

MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music

Applications of the DFT

Agilent Creating Multi-tone Signals With the N7509A Waveform Generation Toolbox. Application Note

Figure1. Acoustic feedback in packet based video conferencing system

FFT Algorithms. Chapter 6. Contents 6.1

Voice---is analog in character and moves in the form of waves. 3-important wave-characteristics:

RF Measurements Using a Modular Digitizer

From Concept to Production in Secure Voice Communications

The Fourier Analysis Tool in Microsoft Excel

Lab 1. The Fourier Transform

AN Application Note: FCC Regulations for ISM Band Devices: MHz. FCC Regulations for ISM Band Devices: MHz

Em bedded DSP : I ntroduction to Digital Filters

Revision of Lecture Eighteen

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Doppler. Doppler. Doppler shift. Doppler Frequency. Doppler shift. Doppler shift. Chapter 19

SGN-1158 Introduction to Signal Processing Test. Solutions

PCM Encoding and Decoding:

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Time Series Analysis: Introduction to Signal Processing Concepts. Liam Kilmartin Discipline of Electrical & Electronic Engineering, NUI, Galway

The Algorithms of Speech Recognition, Programming and Simulating in MATLAB

NRZ Bandwidth - HF Cutoff vs. SNR

Non-Data Aided Carrier Offset Compensation for SDR Implementation

The front end of the receiver performs the frequency translation, channel selection and amplification of the signal.

Analog and Digital Signals, Time and Frequency Representation of Signals

Basics of Digital Recording

SOFTWARE FOR GENERATION OF SPECTRUM COMPATIBLE TIME HISTORY

Web-Conferencing System SAViiMeeting

Univariate and Multivariate Methods PEARSON. Addison Wesley

CBS RECORDS PROFESSIONAL SERIES CBS RECORDS CD-1 STANDARD TEST DISC

Acoustic Terms, Definitions and General Information

Speech Analysis for Automatic Speech Recognition

RECOMMENDATION ITU-R BO.786 *

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

Jitter Measurements in Serial Data Signals

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

PYKC Jan Lecture 1 Slide 1

GSM speech coding. Wolfgang Leister Forelesning INF 5080 Vårsemester Norsk Regnesentral

Matlab GUI for WFB spectral analysis

AN-007 APPLICATION NOTE MEASURING MAXIMUM SUBWOOFER OUTPUT ACCORDING ANSI/CEA-2010 STANDARD INTRODUCTION CEA-2010 (ANSI) TEST PROCEDURE

Ericsson T18s Voice Dialing Simulator

Time and Frequency Domain Equalization

Timing Errors and Jitter

Lecture - 4 Diode Rectifier Circuits

Convolution, Correlation, & Fourier Transforms. James R. Graham 10/25/2005

SWISS ARMY KNIFE INDICATOR John F. Ehlers

Basic Acoustics and Acoustic Filters

SIGNAL PROCESSING FOR EFFECTIVE VIBRATION ANALYSIS

DIGITAL-TO-ANALOGUE AND ANALOGUE-TO-DIGITAL CONVERSION

MICROPHONE SPECIFICATIONS EXPLAINED

1 Multi-channel frequency division multiplex frequency modulation (FDM-FM) emissions

The Fundamentals of FFT-Based Audio Measurements in SmaartLive

Emotion Detection from Speech

Precision Diode Rectifiers

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

CDMA TECHNOLOGY. Brief Working of CDMA

HD Radio FM Transmission System Specifications Rev. F August 24, 2011

Dream DRM Receiver Documentation

Understanding CIC Compensation Filters

VoIP Technologies Lecturer : Dr. Ala Khalifeh Lecture 4 : Voice codecs (Cont.)

Speech Compression. 2.1 Introduction

RADIO FREQUENCY INTERFERENCE AND CAPACITY REDUCTION IN DSL

Artificial Neural Network for Speech Recognition

Developing an Isolated Word Recognition System in MATLAB

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication

Implementation of Digital Signal Processing: Some Background on GFSK Modulation

application note Directional Microphone Applications Introduction Directional Hearing Aids

DOLBY SR-D DIGITAL. by JOHN F ALLEN

Transcription:

33 Chapter 3 LPC ANALYSIS AND SYNTHESIS 3.1 INTRODUCTION Analysis of speech signals is made to obtain the spectral information of the speech signal. Analysis of speech signal is employed in variety of systems like voice recognition system and digital speech coding system. Accepted methods of analyzing the speech signals make use of linear predictive coding (LPC). Linear prediction is a good tool for the analysis of speech signals. In linear prediction the human vocal tract is modeled as an infinite impulse response system for producing the speech signal. Voiced regions of speech have a resonant structure and high degree of similarity for time shifts that are multiples of the pitch periods, for this type of speech LPC modeling produces an efficient representation. In LPC the current sample of a speech signal is estimated by the linear combination of a series of weighted past samples of the speech signal. The series of weights or coefficients represent the LPC coefficients which are used as filter coefficients in encoding and decoding process during coding. Present days in many voice recognition systems and speech coding systems, LPC analysis techniques are used to generate the required spectral information of the speech signal. Voice recognition systems use LPC techniques to produce observation vectors (LPC coefficients). In a voice recognition system these observation vectors are used to recognize the uttered utterances. Voice recognition systems have applications in

34 various industries like telephone industry and consumer electronics. For example voice recognition is used in mobile telephony to have hands free dialing or voice dialing. LPC analysis is usually conducted at the transmitting end for each frame of the speech signal to find information like voiced and unvoiced decisions of a frame, pitch of a frame and the parameters needed to build up a filter for the current frame. This information regarding the frame has to be transmitted to the receiving end. Then the receiver performs LPC synthesis using the information received at the receiving end. In LPC analysis the input speech signal with 8000 samples per second is divided into frames containing 160 samples i.e., each frame represents 20msec of the input speech signal. The reason for framing is that speech is a non-stationary signal where its properties changes with time. This makes the use of Discrete Fourier Transform (DFT) or Autocorrelation techniques impossible. But for most phonemes the properties of the speech signal remains invariant for a short period of time (5-100msec) and hence traditional signal processing methods is applied successfully. Most of the speech processing is done in this manner. This short period of the signal is called as the frame and each frame length taken is 20msec. Due to framing, the dependency between the samples gets lost. To avoid this loss in dependency between the samples, the adjacent frames are overlapped and the overlap percent is taken as 50% on both sides. In turn overlapping results in signal discontinuities at the beginning and at the end of

35 each frame. To reduce these discontinuities each frame is multiplied using a window [25-30]. 3.2 WINDOWING Window is a region which has a zero value everywhere except for the region of interest. The function of windowing is to smooth the estimated power spectrum and to avoid abrupt transitions in the frequency response between adjacent frames. Windowing a speech signal involves multiplication of the speech signal on a frame by frame basis using a window of length equal to the frame length of the speech signal. The effect of multiplying a frame with a window of finitelength is equal to convolving the power spectrum of a frame with the frequency response of the window. This causes the side-lobes in the frequency response of the window to have an averaging effect on the power spectrum of the frame. The windows commonly used are the Rectangular window, Hamming window, Hanning window or Blackman window. The most widely used window in speech analysis is the Hamming or Hanning window. Window like rectangular window has high frequency resolution because of its narrowest main lobe, but has largest frequency leakage and is not widely used. This high frequency leakage is due to larger side lobes and this makes the speech signal noisier. This high frequency leakage tends to offset the benefits of high frequency resolution. Hence the rectangular window is not widely used in speech analysis. Windows like Hamming, Hanning and Blackman windows have smallest

36 frequency resolution and less frequency leakage, so they are widely used in speech analysis. These windows are smoother at the ends and are closer to one at the middle. The smoother ends and broader middle section produces less distortion in the signal. In this thesis Hamming window of 160 samples equal to the frame length is used. Window length is also another important parameter that affects smoothening. If the window length is too large, it may give better frequency resolution, but the spectral properties of the speech signal changes over large durations. So the frame size must be shorter over which the speech signal is considered as stationary and so the window duration needs to be shorter. Making the window short has some disadvantages, which are given by [22, 30-31]: The frame rate increases. This means more information is being processed than necessary there by increasing the computational complexity. The spectral estimates become less reliable because of the stochastic nature of the speech signal. Typically the pitch frequency relies between 80 and 500 Hz. This means that a typical pitch pulse occurs for every 2 to 12 msec. If the window size is small compared to the pitch period then the pitch pulse sometimes present and sometimes will not be present.

37 3.3 CHOOSING THE ORDER OF THE FILTER Linear predictive coding is a time domain technique that models the speech signal as a linear combination of the weighted delayed past speech sample values. LPC order is an important parameter used in linear prediction, which will affect the quality of synthesized speech signal as the order determines how many number of weighted past samples are to be used to determine the current speech sample. In this thesis an LPC order of 10 is chosen, which means the past 10 speech sample values are used to estimate the current speech sample. As the LPC of order 10 is used, the LPC model is called as the 10 th order LPC model. In linear prediction the number of prediction coefficients required to suitably model the speech signal depends on the spectral content of the source. For each formant or peak in the spectrum two poles are required to represent, where one pole requires one linear predictive coefficient to represent. Generally in human speech one peak or formant is observed for a fundamental frequency of 1000 Hz, so the best LPC order depends on the bandwidth of the sampled speech signal. In narrowband speech coding the speech signal information is band limited to around 4 KHz using a low pass filter and hence there are four formants in its spectrum, to model these four formants eight complex poles are required and so the filter order must be at least eight but in practice two poles are taken additionally to minimize the residual energy. So a total of ten poles are required to represent the four formants in a narrowband speech signal. So the LPC order is chosen as 10. For an LPC order 10, the

38 number of LPC coefficients is 11 and the first term is always assumed to be 1 in the 10 th order polynomial which is a very important assumption in LPC analysis [31-32]. 3.4 LINEAR PREDICTIVE MODELING OF SPEECH SIGNALS Linear prediction analysis is the most powerful speech analysis method. In it the short term correlation that exists between the samples of a speech signal (formants) is modeled and removed using a short order filter. 3.4.1 Source Filter Model of Speech Production The source filter model of speech production is used as a means for the analysis of speech signals. The block diagram of a source filter model [31] is shown in Fig 3.1. Pitch Period Impulse Train Generator Voiced/ Unvoiced Switch x(n) r(n) LPC Coefficients Time Varying Filter Output Speech G Random Noise Generator Fig 3.1 Source filter model of speech production The excitation signal used in this model is modeled as a train of impulses for voiced segments of speech and as random noise for unvoiced segments of speech.

39 The combined spectral contributions of the glottal flow, the vocal tract and the radiation at the lips is represented by a time varying filter with a steady state system function given by Hz Sz Xz M G 1 bj z j1 N 1 a z i1 i i j (3.1) Equation (3.1) represents the transfer function of the filter consisting of both poles and zeros, Sz is the Z-transform of the vocal tract output and Xz is the Z-transform of the vocal tract input. If the order of the denominator is high, H(Z) is approximated by an all pole model given by Hz G p 1 a z j1 where p is the order of the filter and j j G Az (3.2) p j j (3.3) j1 A z 1 a z when equation (3.2) is transformed into sampled time domain it is given by s n G x n a sn j p j (3.4) j1 Equation (3.4) represents the LPC difference equation. It states that the value of the present speech sample s n is obtained by summing the present input G xn and a weighted sum of the past speech

40 sample values. If represents the approximate of j a j then the error signal is the difference between the input and encoded speech signals and is given by equation (3.5) p j (3.5) j1 e n s n s n j The estimates are now determined by minimizing the mean squared error given by equation (3.6) p 2 j E e n E s n s n j The partial derivative of equation (3.6) with respect to zero for j=1,, p is given by That is, 2 (3.6) j1 j when set to p j for i =1,, p (3.7) j1 E s n s n j s n i 0 en is orthogonal to sn i for i = 1,.., p Equation (3.7) is arranged as p j n i, j n i,0 (3.8) j1 where n i, j E sn i sn j (3.9)

41 3.4.2 Solution to LPC Analysis The speech signal is a time varying signal and varies slowly with time. To model the time varying nature of the speech signal the analysis has been restricted to short segments of speech signal called frames. This is obtained by replacing the expectations of equation (3.8) by summation over finite limits given by equation (3.10) n i, j E sn i sn j s n m i s n m j for i = 1,., p, j = 0,.., p (3.10) m The solution to equation (3.10) is obtained using two methods namely the autocorrelation method and the covariance method. 3.4.2.1 Determination of LPC Coefficients In this thesis the LPC coefficients are determined using the autocorrelation method [31, 33]. 3.4.2.1.1 Autocorrelation Method In this method the speech signal is considered stationary over a short period of time and is assumed to be zero outside the interval 0 m N 1, where N is the length of the sample sequence. With this limit equation (3.10) is expressed as N p 1 i, j s m i s m j 1 i p, 0 j p (3.11) n n n m 0 Equation (3.11) can also be expressed as

42 N -1 - i j n i, j s n m s n m + i j 1 i p, 0 j p (3.12) m 0 From equation (3.12) it is observed that n i, j is similar to the short-time autocorrelation function and so equation (3.12) is reduced to the short-time autocorrelation function given by i, j R i j n n for i = 1,., p and j = 0,., p (3.13) where N -1 - j R j S m S m + j (3.14) n n n m 0 Using autocorrelation method equation (3.8) is expressed as p j R n i j R n i 1 i p (3.15) j 1 or in matrix form it is represented as R n 0 R n 1. R n p 1 1 R n 1 R n 1.. R n p 2 2 R n 2...... R n p 1.. R n 0 p R n p The above matrix is a symmetric matrix and all the elements along a diagonal are equal i.e., the matrix is a Toeplitz matrix. Equation (3.15) is solved by taking the inversion of p x p matrix but this result in computational errors. So the solution to equation (3.15) is to exploit the Toeplitz characteristics and to use efficient recursive procedures. The most widely used recursive procedure is the Durbin s recursive algorithm which is as follows

43 0 (3.16) E R 0 n n K i 1 i 1 R n i j R n i j j 1 E i i1 n 1 i p (3.17) i K (3.18) i i i i -1 i -1 K 1 j i -1 (3.19) j j i i j i i -1 E 1 K E (3.20) 2 n i n After solving equations (3.17) to (3.20) recursively for i = 1, 2,., p the prediction parameters where p is the prediction order. j is obtained and are given by j j p (3.21) 3.5 VOICED AND UNVOICED DETERMINATION According to the standards of LPC-10 before making voiced and unvoiced decisions of a frame it is necessary to pass each frame through a low pass filter of 1 KHz bandwidth to avoid the crisis of aliasing. Voiced and unvoiced decisions of a frame are important because of the difference that lies in the waveforms of voiced and unvoiced speech. The difference in the two waveforms creates a need for the use of two different excitation signals as inputs for the LPC filter during synthesis or decoding, one excitation signal for voiced speech and the other for unvoiced speech.

Amplitude(dB) 44 Voiced speech has distinct resonant or formant frequencies. The voiced, unvoiced and silence portions for an utterance (telephone banking) is shown in Fig 3.2. The voiced portion of the utterance has the characteristics of large amplitude and low frequencies, while unvoiced portions of the utterance have smaller amplitudes (less energy) and higher frequencies than voiced speech which is observed from Fig 3.2. In order to make a decision whether a frame is voiced or unvoiced one has to look at the energy of a frame and the number of zero-crossings encountered in that frame. Zero-crossings rate is an important consideration for deciding whether a frame is voiced or not. Voiced speech is produced due to the excitation of the vocal tract by a periodic flow of air through the vocal cords, hence voiced speech has less zero-crossings rate. Whereas unvoiced speech is produced by the turbulent flow of air through the vocal cords resulting in high zerocrossings rate. 1 0.8 0.6 0.4 UTTERENCE FOR THE WORD TELEPHONE BANKING VOICED SPEECH 0.2 0-0.2-0.4-0.6-0.8 UNVOICED SPEECH SILENCE -1 0 0.5 1 1.5 Time(msec) 2 2.5 3 x 10 4 Fig 3.2 Voiced, Unvoiced and Silence representations for an utterance Telephone Banking

45 For voiced speech most of the energy is concentrated at low frequencies and for unvoiced speech most of the energy will be present at higher frequencies. High frequencies mean high zero-crossings rate and low frequencies mean low zero-crossings rate, so a strong relationship exists between zero-crossings rate and energy distribution with frequency. A reasonable generalization is that if zero-crossings rate is high and if energy is low the speech frame is considered as unvoiced, if zero- crossings rate is low and if energy is high the speech frame is considered as voiced. The assessment of a frame as voiced and unvoiced is shown in Fig 3.3 [34-36]. Hamming window Short-time Energy Calculation (E) Speech Signal Frame By Frame Signal Processing Short-time average zero-crossing rate Calculation (ZCR) If ZCR is small and if E is high No Voiced Frame Yes Unvoiced frame Fig 3.3 Voiced and Unvoiced decision of a frame The Ideal world categorization of speech signal into voiced, unvoiced and silence is shown in Table 3.1

46 Table 3.1 Ideal world categorization scheme Short-Time energy Zero-Crossings Label High Approx. 12 Voiced Low Approx. 5. Unvoiced 0 0 Silence In practice the categorization of sounds into voiced, unvoiced and silence is shown in Table 3.2. Table 3.2 Real world categorization scheme Short-Time energy Zero-Crossings Label Approx. 0 Approx. 0 Silence Low High Unvoiced High Low Voiced High Approx. 0 Voiced High High Voiced Low Low Voiced Low Approx. 0 Unvoiced Approx. 0 High Silence

47 In real time the speech signals are not free of noise they contain some amount of background noise. Apart from the background noise it is not easy to detect silent portions of the speech signal due to the fact that the short-time energy of a breath can easily be confused with the short-time energy of a fricative sound [37]. 3.6 PITCH DETECTION 3.6.1 Introduction The process of estimating the pitch period or fundamental frequency of a periodic signal like the speech signal is referred to as pitch detection. During pitch period estimation, voiced speech is considered as being produced by passing quasi-periodic pulses of a signal through LPC filter. The interval between the pulses in the excitation signal is called as the pitch period represented by T 0. The estimation of the pitch period greatly influences the quality of the reconstructed speech signal as incorrect estimation of the pitch period greatly degrades the quality of the reconstructed speech signal. Pitch detection algorithms are classified into two types they are Frequency-domain based and Time-domain based algorithms. Frequency domain based algorithms estimate the pitch period directly using windowed segments of the speech signal by converting it from time domain to frequency domain. The conversion from time domain to frequency domain is done by applying the Fast Fourier Transform (FFT). Methods of this type are Cepstrum method, Maximum Likelihood method and Harmonic Product Spectrum method. In time-

48 domain methods the pitch period is estimated by finding the Glottal Closure Instant (GCI) and by measuring the time period between each event. Time domain methods include Average Magnitude Difference Function (AMDF) method, Average Squared Mean Difference Function (ASMDF) method and Autocorrelation method. Traditionally autocorrelation based methods are widely used in various speech coders. The vibration of vocal cords produces voiced speech, the rate of vibration of the vocal cords gives the pitch period of voiced speech. During the production of unvoiced speech the vocal cords do not vibrate, they remain open and do not contain any information regarding pitch. The estimation of the pitch period, voiced and unvoiced decisions of a frame greatly influence the quality of the reconstructed speech signal. If a voiced frame is classified as unvoiced the reconstructed speech signal is less intelligible and sounds roughly. On the other hand if an unvoiced frame is classified as voiced the reconstructed speech signal will sound annoyingly metallic or robotic [38-45]. 3.6.2 Pitch Detection Algorithm The excitation mechanism used in the source filter model depends greatly on the precise estimation of the pitch parameters, as incorrect pitch estimation reduces the quality of the reconstructed speech signal and its intelligibility by introducing artifacts into it. Intelligibility conveys whether the speech signal is clearly understood

49 or not. Therefore, the pitch estimation algorithm chosen greatly influences the quality of the reconstructed speech signal. Pitch period is the interval between two voiced excitations, varies from one cycle to the other cycle, evolves slowly and can be estimated. Estimating the pitch period is easy for highly periodic speech signals but some segments of speech do not exhibit this periodicity. Some speech segments contain information concerned to both voiced and unvoiced, the estimation of pitch period becomes inaccurate for such segments of speech. Presence of formants also creates problems in estimation of the pitch period as with formants the speech becomes highly resonant and this makes the pitch estimation inaccurate. Large amounts of background noise in the speech signal also make the pitch estimation inaccurate. In this thesis the estimation of the pitch period is done using the time-domain based autocorrelation method. 3.6.2.1 Autocorrelation Method of Pitch Detection The autocorrelation method is a method frequently used for the estimation of pitch period. The autocorrelation measures how well the input signal matches with a time-shifted version of itself. The maxima of the autocorrelation function occur at intervals of the pitch period. The autocorrelation method involves a large amount of computation i.e., multiplications and additions, but it is easy to implement in real time digital signal processing systems due to the regular form of computation. Another advantage of autocorrelation pitch determination algorithm is that it is in sensitive to phase. Hence it

50 performs well in estimating the pitch of a speech signal but suffers from some degree of phase distortion. One of the major limitations of the autocorrelation function is that it retains too much information in the speech signal [22, 31, 44-45]. Direct distance measurement is the most popular way to measure the similarity between two signals, which is expressed as N 1 1 E s n s n N n 0 2 (3.22) Where sn represent the speech samples from n 0 to N 1. Equation (3.22) assumes that the average signal level is fixed but this is not true at signal onsets and offsets. So the distance measure which takes nonstationary effects of the speech signal is expressed as N 1 1 E s n s n N (3.23) n 0 Where is the scaling factor or pitch gain which controls changes in the signal level. When the speech signal is assumed as stationary the error produced by equation (3.22) is written as where E R 0 R( ) (3.24) N 1 (3.25) R( ) s n s n n 0 2 The minimization of the error E in equation (3.22) is equivalent to maximizing the autocorrelation (or cross-correlation) R( ), where denotes the lag or delay and is equal to the value of pitch period. The

51 pitch gain is obtained by setting E, 0 in equation (3.23) and is N 1 n 0 N1 n 0 s n s n s 2 n (3.26) By substituting the pitch gain into the error function of equation (3.23), the pitch is estimated by minimizing it and is given by equation (3.27) 2 E, s n N1 N 1 n 0 N1 n 0 2 n 0 s n s n s n 2 (3.27) This is equivalent to minimizing the second term on the right hand side N1 2 n 0 n N1 R ( ) n 0 s n s n s 2 n 2 (3.28) Direct use of equation (3.28) may result in errors because the square of the autocorrelation function may result in maxima even though the correlation is negative which results in ineffective pitch estimation. In order to overcome this problem the square root of equation (3.28) is taken. As a result the square of the autocorrelation is removed and can remove the possibility of lags with negative correlation from being

52 selected as a pitch. The final normalized autocorrelation function is then given by equation (3.29) R ( ) n N 1 n 0 N 1 n 0 s n s n s 2 n (3.29) 3.6.2.2 Autocorrelation of Center-Clipped Speech Speech is not a purely periodic signal and vocal tract resonances produce additional maxima in the autocorrelation. Pitch period estimation in the speech signal using autocorrelation method directly results in multiple maxima, hence it is difficult to determine the maxima corresponding to the right pitch period. To suppress this local maximum a method called center-clipping is used. The center-clipped speech is obtained by the linear transformation [46] given by Y n C s n (3.30) where C is the center-clipping function and is shown in Fig 3.4. Cx C L C L Threshold Level Fig 3.4 Center-Clipping function

frequency 53 For samples with amplitude above CL the output of the center-clipper is equal to the input minus the clipping level. For samples with amplitude below the clipping level the output of the center-clipper is zero. Fig 3.5 shows the peaks in a segment of speech when the autocorrelation method is used for extracting the pitch before and after applying the center-clipping function. The duration of the speech segment taken is 960msec. The blue waveform represents the peaks in the autocorrelation when center-clipping is not used and the green waveform represents the peaks in the autocorrelation when centerclipping is used. 300 Pitch of a Speech Signal Before and After Center Clipping 250 200 150 100 Pitch Before CenterClip Pitch After CenterClip 50 0 5 10 15 20 25 time(ms) Fig 3.5 Peaks in the autocorrelation of a speech signal before and after center-clipping

54 Fig 3.5 shows that when center-clipping is used the autocorrelation method produces peaks of the pitch period with more eminence by reducing the peaks due to local maxima. It is observed that the peaks in the autocorrelation of the center-clipped speech are much more distinguishable than in the autocorrelation of the original speech without center-clipping. So the use of center- clipping operation enhances the quality of the speech signal. But in some cases when the speech signal contains noise or if it is mildly periodic, the center-clipping operation removes beneficial information in the speech signal there by the quality of the speech signal gets degraded. For speech signals with rapidly changing energy setting an appropriate clipping level is difficult. Here the clipping level is taken as half the maximum amplitude level of the speech signal. 3.7 LPC SYNTHESIS 3.7.1 Introduction Speech is primary a means of communication between humans as it conveys almost an infinite range of thoughts and concepts. Continuous speech is a set of complicated audio signals which makes them producing artificially difficult. Speech synthesis is a method used to produce the speech artificially. The quality of speech signal produced by a speech synthesis system is determined using two characteristics. They are intelligibility and naturalness. Intelligibility conveys whether the output of a speech synthesizer is easily understood or not. Naturalness conveys whether the output of a

55 speech synthesizer sounds like the speech of a real person or not. An ideal speech synthesizer is both intelligible and natural and every synthesis technique tries to maximize both of these characteristics. In modern systems much attention has been devoted on speech synthesis to produce high quality naturally sounding speech. In present days the production of naturally sounding speech is of earlier concern rather than producing an intelligible speech. Some of the speech synthesis systems are good at naturalness and some of them are good at intelligibility and the goal of synthesis will determine which system is to be used. There are three main methods used for generating the speech waveforms synthetically. They are concatenative synthesis, formant synthesis and articulatory synthesis [32, 47-49]. In this thesis, formant synthesis technique is used to generate the speech artificially. The three methods are briefly explained below: Concatenative Synthesis: This method is based on the concatenation of segments of the recorded speech and this method generates the most natural speech. But programmed methods for segmenting the speech waveforms and natural variations in speech occasionally produces audible glitches in the speech output and as a result the synthesized speech detracts from naturalness. Formant Synthesis: Formant synthesis produces synthesized speech using an acoustic model without using any human speech samples. In this method parameters like voiced and unvoiced information, pitch and noise levels are varied frame wise to produce the speech artificially. Many systems based on formant synthesis generate

56 artificial, robotic-sounding speech as utmost naturalness is not always the target of a speech synthesis system. Systems based on formant synthesis have some advantages over systems using concatenative synthesis. They are: Systems using formant synthesis produces speech that is very much intelligible without any audible glitches. But the problem of audible glitches is more common in concatenative systems. Formant synthesizers do not use any database of speech samples hence the programs used in formant synthesis are often smaller programs than the programs used in concatenative systems as they use a database of speech samples. Articulatory Synthesis: Articulatory synthesis is one of the widely used synthesis technique in recent days. It is based on the articulation process occurring in the vocal tract and the computational models of the human vocal tract. Few of these models are computationally efficient and advanced to use in commercial speech synthesis systems. 3.7.2 Linear Predictive Coding Synthesis Linear predictive coding (LPC) synthesis is an efficient synthesis technique used for the generation of speech artificially, where in the vocal tract parameters are represented by a set of LPC coefficients. The LPC techniques are based on the frequency domain representation of the speech signal. In LPC a time-domain speech

57 signal is transformed into a frequency-domain for extracting the parameters of the speech signal using a suitable model. LPC synthesis technique has several advantages over other speech synthesis techniques. They are: LPC techniques require lower data rates as a result the storage capacity increases and is well suited for transmitting signals in narrowband. So LPC methods are used in telecommunications, teaching aids and consumer products. LPC synthesizer produces high quality speech at bit- rates around 2.4 Kbps. In this thesis, using LPC synthesizer output speech with good quality is produced at bit-rates from 1.2 to 1 Kbps. The structure of linear predictive synthesizer is shown in Fig 3.6 [22]. Pitch Period Impulse Generator V Voiced/Unvoiced Control G LPC Filter White Noise Generator UV u (n) z -1 Speech Signal ŝn 1 z -1 2 z -1 p-1 z -1 p Fig 3.6 Linear Predictive Synthesizer

58 The time varying parameters needed by the synthesizer are the pitch period, voiced/unvoiced information, gain and linear predictive coefficients. Speech signal consists of both voiced and unvoiced information. In LPC synthesizer, an impulse generator produces a train of impulses of unit amplitude at the beginning of each pitch period or voiced segment and the impulse train is used as an excitation signal to produce voiced speech. While a random noise generator is used to produce random noise containing uncorrelated, uniformly distributed random samples with unity standard deviation and zero mean. The random noise is used as an excitation signal to produce unvoiced sounds. The selection between voiced/unvoiced sources is made using voiced/unvoiced switch. The gain control G determines the amplitude of the excitation signal. The synthetic speech samples are determined using the equation (3.31) p j (3.31) j 1 ˆ sˆ n s n - j G u n In Fig 3.6 the LPC synthesis filter is excited by an impulse train or random noise based on voiced/unvoiced decisions. The interval between each pulse of an impulse train is equal to the pitch period. The gain G represents the loudness and so it is multiplied with the excitation signal to obtain proper loudness intensity in the excitation signal. The filter network used in Fig 3.6 is a direct form filter which gives a simple and straight forward method for obtaining synthetic speech from the prediction parameters. A total of p multiplications

59 and p additions are required to generate a sample of the output, where p is the order of the filter. In this model the synthesis parameters are varied with time and are estimated at regular intervals during voiced portions of speech and these parameters are changed at the start of each period, where for unvoiced speech they are simply changed once per frame. The updating of the parameters at the beginning of each pitch period is called as Pitch Synchronous Synthesis and is found to be an effective synthesis process than Asynchronous Synthesis where the parameters are updated once per frame. The quality of the reconstructed speech signal depends on the accuracy of the extracted parameters. The main advantage of LPC is its simplicity and ease of implementation. Its main drawback is that it requires significant computational precision to synthesize the speech because the filter structure is a direct form recursive structure which tends to be quite sensitive to changes in the coefficients. So the extracted speech parameters must reflect high accuracy. The LPC coefficients shown in Fig 3.6 are the all pole filter coefficients and are used in the modeling of speech signals. In practice it is not possible to model the speech signal accurately using the delayed past sample values, there is a discrepancy or error in the reconstructed speech signal. So the goal of LPC analysis is to find a set of LPC coefficients that minimizes the mean square error 2 e n. When this happens the spectrum of the error signal becomes flat i.e., there is no change in the signal with frequency. There are only two

Amplitude (db) 60 types of time signals that can have a flat spectrum. They are impulse train and random noise (i.e., signal generated from random numbers). For this reason in LPC synthesis the filter is excited by an impulse train for voiced sounds and for unvoiced sounds the filter is excited by a random noise. The linear predictive coefficients used in the synthesis filter represent the spectral contribution from the vocal tract, glottal flow and the radiation at the lips. The waveform of a typical input speech signal is shown in Fig 3.7. 0.5 Input Speech Signal 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-0.4-0.5 0 0.5 1 1.5 Time (msec) 2 2.5 3 x 10 4 Fig 3.7 Input speech signal In the waveform of a speech signal the X-axis is calibrated to time measured in milliseconds and the Y-axis is calibrated to amplitude measured in decibels (db). The amplitude of a speech signal is measured on a decibel scale as it is best correlated with the perceived sound loudness.

Amplitude (db) 61 Fig 3.8 gives the speech signal reconstructed using speech parameters like pitch, gain and linear predictive coefficients. The total number of frames in the speech signal is 166 with 160 samples per frame. The number of linear predictive coefficients used in reconstruction is 11 per frame as the order taken is 10. Pitch and gain is calculated frame wise and are used accordingly in the reconstruction. 0.25 Reconstructed Speech Signal Using Speech Parameters 0.2 0.15 0.1 0.05 0-0.05-0.1-0.15-0.2 0 0.5 1 1.5 Time (msec) 2 2.5 3 x 10 4 Fig 3.8 Speech signal reconstructed using speech parameters From Figs 3.7 and 3.8 it is observed that the reconstructed speech signal is not having the same shape as the input speech signal but it sounds artificially same as the input speech signal. This is because parametric estimation of the speech signal does not lead to accurate reconstruction of the speech waveform. The Encoded speech signal called residue is shown in Fig 3.9 and the speech signal reconstructed using the residue is shown in Fig

Amplitude(dB) Amplitude(dB) 62 3.10. From Fig 3.10 the speech signal reconstructed using the residue is having the same shape as the input speech signal and sounds same as the input speech signal. This is because waveform approximation methods always give perfect reconstruction of the speech signal without any loss in the quality of the speech signal. 0.3 Encoded Speech Signal 0.2 0.1 0-0.1-0.2-0.3-0.4 0 0.5 1 1.5 Time (msec) 2 2.5 3 x 10 4 Fig 3.9 Residue speech signal 0.5 Reconstructed Speech Signal Using Residue 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-0.4-0.5 0 0.5 1 1.5 2 2.5 3 Time (msec) x 10 4 Fig 3.10 Speech signal reconstructed using residue

63 From Figs 3.8 and 3.10, it can be seen that Fig 3.10 looks same as the input speech signal and there is no loss in quality of the reconstructed speech signal. Where as in Fig 3.8 there is a loss in the quality of the speech signal and it looks different from the input speech signal. So it can be concluded that with parametric methods (speech signal reconstructed using speech parameters) the bit-rate of the speech signal can be reduced greatly but with a loss in the quality of the speech signal. Whereas with waveform approximation methods (speech signal reconstructed using residue) the bit-rate cannot be reduced greatly but the quality of the reconstructed speech signal can be maintained same as the input. So to achieve low bit-rates one has to go for parametric methods. If quality is to be retained one has to use waveform approximation methods.