Signal Processing for Speech Recognition

Save this PDF as:

Size: px
Start display at page:

Transcription

1 Signal Processing for Speech Recognition Once a signal has been sampled, we have huge amounts of data, often 20, bit numbers a second! We need to find ways to concisely capture the properties of the signal that are important for speech recognition before we can do much else. We have seen that a spectral representation of the signal, as seen in a spectrogram, contains much of the information we need. We can obtain the spectral information from a segment of the speech signal using an algorithm called the Fast Fourier Transform. But even a spectrogram is far too complex a representation to base a speech recognizer on. This section describes some methods for characterizing the spectra in more concise terms. 1. Filter Bank Methods One way to more concisely characterize the signal is by a filter bank. We divide the frequency range of interest (say Hz) into N bands and measure the overall intensity in each band. This could be done using hardware or digital filters directly from the incoming signal, or be computed from a spectral analysis (again derived using hardware or software such as the Fast Fourier Transform). In a uniform filter bank, each frequency band is of equal size. For instance, if we used 8 ranges, the bands might cover the frequency ranges 100Hz-1000Hz, 1000Hz-2000Hz, 2000Hz-3000Hz,..., 7000Hz-8000Hz. Consider a uniform filter bank representation of an artificially generated spectra similar to that for the vowel IY shown in Figure 1. We can measure the intensity in each band by computing the area under the curve. With a discrete set of sample points, we could simply add up all the values in range, or compute a power measure by summing the squares of the values. With the signal shown in Figure 1, we would get a representation of the spectra that consists of a vector of IY beet Frequency Figure 1: Uniform Filter Bank on Spectra for Vowel IY Signal processing 1

2 eight numbers, in this case (1.26,.27, 2.61,.62,.05,.04,.03,.02) How would we know whether this is a good representation? We d need to compare it to representations of other vowels and see whether the vector reflects differences in the vowels. If we do this, we ll see there are some problems with a uniform filter bank. Specifically, we know that much key information in vowels focuses around the formants, which should show up as intensity peaks in various bands in the filter bank. But if we look back at the frequency ranges of typical vowels in Table 1 we see a problem. The first format of all vowels will always be in range 1 so the frequency differences in the F1 of the words will be reflected in the representation. Then the second formant will almost always in in range 2, and the third formant is typically in range 3. Thus we will get little information to help distinguish the vowels. If we classify each vowel by three numbers indicating the filter banks that their formants fall into using the uniform bank. This encoding separates the vowels into only four classes, with most vowels falling into one class (all with their formants falling in banks 1, 2 and 3 respectively). A better alternative is to organize the ranges using a logarithmic scale, and this actually agrees better with human perceptual capabilities as well. We set the first range to have the width W, and then subsequent widths are a n *W. If a = 2 and W is 200 Hz, we get widths of 200 Hz, 400 Hz, 800 Hz, 1600 Hz, and so on. Our frequency ranges would now be 100Hz-300Hz, 300Hz-700Hz, 700Hz-1500Hz, 1500Hz-3100Hz, 3100Hz-6300Hz With this set of banks, we see the F1 of vowels vary over three ranges, but F2 falls in only two ranges, and F3 in one. If we lower the value of a (say to 1.5), we get ranges of widths 200, 300, 450, 675, 1012, 1518, with frequency bands starting at 100Hz, 300 Hz, 600 Hz, 1050 Hz, 1725 Hz, 2737 Hz, and 4255 Hz. With this each of the formants fall across three frequency ranges, giving us good discrimination. ARPABET Typical Word F1 F2 F3 filter #s IY beet , 3, 4 IH bit , 2, 3 EH bet , 2, 3 AE bat , 2, 3 AH but , 2, 3 AA hot , 2, 3 AO bought , 1, 3 UH foot , 2, 3 UW boot , 1, 3 ER bird , 2, 2 Table 1: Typical formants for vowels and where they fall in a uniform filter bank Signal processing 2

3 The third method is to design a non-uniform set of frequency bands that has no simple mathematical characterization but better reflects the responses of the ear as determined from experimentation. One very common design is based perceptual studies to define critical bands in the spectra. A commonly used critical band scale is called the Mel scale which is essentially linear up to 1000 Hz and logarithmic after that. For instance, we might start the ranges at 200 Hz, 400 Hz, 630 Hz, 920 Hz, 1270 Hz, 1720 Hz, 2320 Hz, and 3200 Hz. The frequency bands now look as shown in Figure 2. The Mel scale typically gives good discrimination based on expected formant intensity peaks Frequency IY beet Figure 2: A set of Mel scale frequency ranges 2. Designing a Good Window Function If we simply take the samples as they are in a segment, when we apply a spectral analysis technique like the Fast Fourier Transform, it acts like it is operating on a signal that is zero before the segment starts and then abruptly jumps to the signal during the segment and then back to zero when the segment ends. When the signal is 0 outside of the window and then jumps to the actual values within the window, this introduces significant distortion of the signal, making it appear like there is significant high frequency noise at the beginning and end points of the window. The typical method to alleviate this problem is to not count all values in a window equally. We apply a function to the samples in the window so that the samples near the beginning and end of the window slowly winnow down to zero. More specifically, if the original signal intensity is s(i) at time i, we can represented the windowed signal as s (i) = s(i) * w(i) Signal processing 3

4 where w(i) is the window function. In our segments above, w(i) was a square function that jumped to 1 inside the segment and was 0 outside. Figure 4 shows a signal for a simple sinusoidal curve after applying a window function Figure 4: The effect of windowing with a square window function A common window function that works well is called the Hamming Window and is defined by the formula w(i) = cos(2pi/(n-1)) This function is shown in Figure Figure 5: The Hamming Window Function Applying this to our sinusoidal wave above, it now looks shown in Figure 6. Compare this to Figure 4 and see that the new signal is considerably smother at the ends. This produces a much superior spectral analysis Figure 6: Applying the Hamming Window to the Sinusoidal Function Signal processing 4

5 3. LPC Analysis Another method for encoding a speech signal is called Linear Predictive Coding (LPC). LPC is a popular technique because is provides a good model of the speech signal and is considerably more efficient to implement that the digital filter bank approach. With ever faster computers, however, this is becoming less of an issue. The basic idea of LPC is to represent the value of the signal over some some window at time t, s(t) in terms of an equation of the past n samples, i.e., s(t) = a 1 *s(t -1) + a 2 *s(t - 2) a p *s(t - p) Of course, we usually can t find a set of a i s that give an exact answer for every sample in the window, so we must settle for the best approximation, s (t), that minimizes the error. We typically try to measure error by the least-squares difference, i.e., S t=1,w (s(t) - s (t))2 If the signal is periodic and hence repeats in a pattern over the window, we can get very good estimates. For instance, consider a 20 Hz square wave over a 100 ms window sampled at 200 Hz. Thus there are 20 samples in a window and the wave contains two cycles (one every 50 ms). The signal s(t) has the values: s(0) = 0, s(1) = 1, s(2) = 1, s(3) = 1, etc, producing the sequence of values 1, 1, 1, 1, 0, -1, -1, -1, -1, 0, 1, 1, 1, 1, 0, -1, -1, -1, -1, 0 Consider the errors obtained from some simple displacement approximations D i (t) = s(t - i), i.e., a i = 1, and all other a j = 0. The error for the approximation D(1) would be S t=1, 20 (s(t) - s(t - 1))2 = (1-0)2 + (1-1)2 + (1-1)2 + (1-1)2 + (0-1)2 + (-1-0)2... Most of these values are zero (a consequence of the exact nature of square waves), but 8 of them are not and give an error weight of 1 each. Thus the error for this approximation is 8. Notice we either need to start i samples into the window or we will need some values of the signal before the window. For the moment, we allow ourselves to peek at the signal before the window. With a displacement of two, we get an even larger error from terms such as at position 6 with value (-1-1)2 = 4. Table 2 in column 2 below shows the error for the values 1 through 10. Signal processing 5

6 Displacement Error Table 2: Errors for different Displacements Notice that we get an exact representation of the signal with D10 since s(t) = s(t - 10) (because it 20 Hz signal and thus takes 1/20 of a second, or 10 samples at 200 Hz sampling rate, to repeat.) Of course, when using the LPC methods, we look for the best approximation with no restriction to simple displacement functions. Typically, all coefficients a i will be non-zero. Each coefficient will indicate the contributions of frequencies that repeat every i samples. We haven t discussed what value of p is reasonable? Given the relationship between the displacement and the frequencies it detects, we should pick a p large enough to capture the lowest frequencies we care about. If this is 200 Hz, then each cycle takes 1/200 of a second, or 5 ms. If we sample at 10 khz, then 1/200 of a second involves 50 samples, so p should be at least 50. In practice, systems using LPC use even fewer parameters, with typical values between 8 and Building Effective Vector Representations of Speech Whether we use the filter bank approach or the LPC approach, we end up with a small set of numbers that characterize the signal. For instance, if we used the Mel-scale with dividing the spectra into 7 frequency ranges, we have reduced the representation of the signal over the 20 ms segment to a vector consisting of eight numbers. With a 10 ms shift in each segment, we are representing the signal by one of these vectors every 10 ms. This is certainly a dramatic reduction in the space needed to represent the signal. Rather than 10,000 or 20,000 numbers per second, we now represent the signal by 700 numbers a second! Just using the six spectral measures, however, is not sufficient for large-vocabulary speech recognition tasks. Additional measurements are often taken that capture aspects of the signal not adequately represented in the spectrum. Here are a few additional measurements that are often used: Power: a measure of the overall intensity. If the segment S k contains N samples of this signal, s(0),..., s(n-1), then the power power(s k ) is computed using power(s k ) = S i=1,n- 1s(i)2. An alternative that doesn t create such a wide difference between low and soft sounds Signal processing 6

7 uses the absolute value: S i=1,n-1 s(i). One problem with direct power measurements it that the representation is then very sensitive to how loud the speaker is speaking. To adjust for this, the power can be normalized by an estimate of the maximum power. For instance, if P is the maximum power within the last 2 seconds, the normalized power of the new segment would be power(s k )/P. The power is an excellent indicator of the voiced/unvoiced distinction, and if the signal is especially noise-free, can be used to separate silence from low intensity speech such as unvoiced fricatives. Power Difference: The spectral representation captures the static aspects of a signal over the segment, but we have seen that there is much information in the transitions in speech. One way to capture some of this is to add a measure to each segment that reflects the change in power surrounding it. For instance, we could set PowerDiff(S k )= power(s k+1 )-power(s k-1 ). Such a measure would be very useful for detecting stops. Spectral Shifts: Besides shifts in overall intensity, we saw that frequency shifts in the formants can be quite distinctive, especially with diphongs and in looking at the effects of consonants next to vowels. We can capture some of this information by looking at the difference in the spectral measures in each frequency band. For instance, if we have eight frequency intensity measures for segment Sk, f k (1),...,f k (8), then we can define the spectral change for each segment as with the power difference, i.e., df k (i) = f k-1 (i)-f k+1 (i) With all these measurements, we would end up with a 18-number vector, the 8 spectral band measures, eight spectral band differences, the overall power and the power difference. This is a reasonable approximation of the types of representations used in current state-of-the-state speech recognition systems. Some systems add another set of values that represent the acceleration, and would be computed by calculating the differences between the df k values. Note that we do not need to explicitly store the delta parameters since they can be computed quickly when needed. For instance, consider the following sequence of vectors Power m1 m2 m3 m4 m5 m6 m7 m8 v 1 (5, 0, 1, 0, 1, 0, 1, 0, 1) v 2 (3, 0, 0, 1, 0, 0, 1, 1, 1) v 3 (6, 2, 1, 1, 0, 1, 2, 1, 2) v 4 (30, 10, 10, 3, 1, 4, 6, 6, 6) v 5 (50, 15, 15, 5, 3, 8, 12, 12, 13) v 6 (52, 16, 15, 4, 3, 9, 13, 11, 13) v 7 (48, 15, 15, 6, 3, 9, 9, 11, 10) We can expand these out on the fly to extend the current vector with the delta coefficients: Dv 2 (1, 2, 0, 1, -1, 1, 1, 1, 1) Dv 3 (27, 10, 10, 2, 1, 4, 5, 5, 5) Dv 4 (46, 13, 14, 4, 3, 7, 10, 11, 11) Dv 5 (22, 6, 5, 1, 2, 5, 7, 5, 7) Dv 6 (-2, 0, 0, 1, 0, 1, -3, 1, -3) Signal processing 7

8 This example shows what we would expect to see in a voiced stop - a rapid increase in energy peaking at Dv 4 which then levels out. 5. Improving Vector Representations in Speech Recognition If the vector representation is to be useful for speech recognition, then we ll need to define a distance metric that indicates how similar two vectors sound to each other. Note that there are many possible distance metrics and only some will correspond to perceptual differences in the signal. This section explores some of these issues. A standard distance metric is a function d(v i,v j ) with the following properties The distance between two identical points is zero (i.e., d(v i,v i ) = 0) The distance between two different points is greater than 0 (i.e., d(v i,v j ) > 0 for all i j) Distances are symmetric (i.e., d(v i,v j ) = d(v j,v i ), for all i and j) Distances satisfy the triangle inequality (i.e., d(v i,v k ) d(v i,v j ) + d(v j,v k )) A very common distance metric is the Euclidean distance, which is defined as d(v i,v j ) = SQRT(S x=1,n (v i (x)-v j (x))2). where SQRT is the square root function. Another common distance measure is the city block measure, where distance is measured in terms of straight line distances along the axis. In two dimensions, this means you can only move vertically or horizontally. The metric is d(v i,v j ) = S x=1,n v i (x)-v j (x) (i.e., the sum of the absolute values for each dimension) To enable accurate speech recognition, we would like to have a vector representation and a distance measure that reflects perceptual differences found in humans. The current representation used above falls short in a number of critical areas and would classify different spectra that are perceptually quite similar to humans as quite different. For example, it makes little perceptual difference to a human whether a person is speaking slightly more loudly or softly. A person maybe able to notice the difference but it is not relevant to the speech sounds perceived. But currently our vector quantization measure is quite sensitive to intensity changes. For the sake of keeping the examples simple, lets consider a representation consisting of a 5 element vector indicating the overall magnitude and 4 Mel scale frequency ranges (30, 22, 10, 13, 5). If this same sound was uttered but with twice the intensity (say the microphone was turned up slightly), we might get measures that are essentially double, i.e., (60, 44, 20, 26, 10). A Euclidean difference measure between these would give the value 40.96, the same as the difference between the original vector and absolute silence (i.e., (0,0,0,0,0 )). Clearly this is not a good measure! Signal processing 8

9 There are several things that can be done to get around this problem. First, we might consider that human perceptual response to signals is more closely related to a logarithmic scale rather than a linear scale. This suggests that we use a log filter bank model, which uses a logarithmic scale for the values. One method would be to simply take the log of all values when we construct the vectors. For the above example, our first vector would now be (1.48, 1.34, 1, 1.11,.7) and the vector at double the intensity would be (1.78, 1.68, 1.3, 1.43, 1). Now the Euclidean distance between the two vectors is.67 compared to the distance of 2.59 between the first vector and absolute silence. So this is a good step in the right direction. Another method for compensating for differences in intensity is to normalize all intensities relative the mean (or maximum) of intensities found in the signal, like we did with power in the last section. As before, we can estimate the mean by finding the mean of the intensity measures over some previous duration of speech, say 2 seconds. If the mean vector at time i is F m, and the filter bank values at i are the vector F i, then our vector representation would be F i -F m, i.e., (F i (1)-F m (1),..., F i (k)-f m (k)). This introduces considerable robustness over differences caused by different speakers, microphones and the normal intensity variations within the speech of a single person over time. Perceptual studies have found other distortions of the signal that appear to have little perceptual relevance to recognizing what was said, including Eliminating all the frequencies below the first formant Eliminating all frequencies above the third formant Notch filtering: removing any narrow range of frequencies. On the other hand, other apparently small changes in the spectra can make large perceptual differences. Changing a formant frequency by as little as 5% can change the perceived phoneme. Clearly such a difference is not going to create much distinction in our filter bank representation. On the other hand, the three non-significant changes above could make significant changes to the values in our representation. There are several ways to compensate for this problem. One is to use a weighted Euclidean distance measure (a commonly used measure is called the Mahalanobis distance), where changes in some bands are much more critical than changes in other bands. For example, with our 8 band Mel scale filter bank, we might weight the bands possibly involving the first formant (e.g., , , ) the most, and downplay the weight of the band above 3200 Hz. Thus we define distance as between vector v i and v x as d(v i,v x ) = SQRT(S k w k * (v i (k) - v x (k))2 Signal processing 9

10 Of course, pulling the weights w k out of a hat is unlikely to produce the best performance, and a good set of weights would be derived by empirical means by testing how different weights affect the recognition rate. Another technique that has proven to be effective in practice is to compute a different set of vectors based on what are called the Mel Frequency Cepstral Coefficients (MFCC). These coefficients provide a different characterization of the spectra than filter banks and work better in practice. To compute these, we start with a log filter bank representation of the spectra. Since we are using the banks as an intermediate representation, we can use a larger number of banks to get a better representation of the spectra. For instance, we might use a Mel scale over 14 banks (ranges starting at 200, 260, 353, 493, 698, 1380, 1880, 2487, 3192, 3976, 4823, 5717, 6644, and 7595). The MFCC are then computed using the following formula: c i = S j=1,14 m j * cos(p*i*(j - 0.5)/14) for i = 1, N where N is the desired number of coefficients. What this is doing is computing a weighted sum over the filter banks based on a cosine curve. The first coefficient, c 0, is simply the sum of all the filter banks, since i = 0 makes the argument to the cosine function 0 throughout, and cos(0)=1. In essence its an estimate of the overall intensity of the spectrum weighting all frequencies equally. The coefficient c1 uses a weighting that is one half of a cosine cycle, so computes a value that compares the low frequencies to the high frequencies. These first two weighting functions are shown in Figure Expanding out the equation, we get Figure 7: The weighting given to the spectrum for c 0 and c 1 c 1 = 0.99*m *m *m *m *m *m *m *m *m *m *m *m *m * m 14 The function for c 2 is one cycle of the cosine function, while for c3 it is one and a half cycles, and so on. The functions for c 2 and c 3 are shown in Figure 8. Signal processing 10

11 Figure 8: The weighting given to the spectrum for c 2 and c 3 This technique empirically gives superior performance and this representation is used in many state-of-the-art recognition systems. In other words, an 8 element vector based on the MFCC will generally produce better recognition than an 8 element vector based on Mel scale filter banks plus the overall power. We don t need an overall power measure in the MFCC since the power is estimated well by the c 0 coefficient. When using the MFCC we also need to use delta coeeficients as well. These can be computed in the same way by substracting the values of the previous frame from the values of the following frame. Combining the MFCC and its delta coefficients, we end up with a 16 element vector representing each frame of the signal which is quite effective in practice. Signal processing 11

Overview. Speech Signal Analysis. Speech signal analysis for ASR. Speech production model. Vocal Organs & Vocal Tract. Speech Signal Analysis for ASR

Overview Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 7/2 January 23 Speech Signal Analysis for ASR Reading: Features for ASR Spectral analysis

Lecture 12: An Overview of Speech Recognition

Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated

L9: Cepstral analysis

L9: Cepstral analysis The cepstrum Homomorphic filtering The cepstrum and voicing/pitch detection Linear prediction cepstral coefficients Mel frequency cepstral coefficients This lecture is based on [Taylor,

An Approach to Extract Feature using MFCC

IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 08 (August. 2014), V1 PP 21-25 www.iosrjen.org An Approach to Extract Feature using MFCC Parwinder Pal Singh,

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals Modified from the lecture slides of Lami Kaya (LKaya@ieee.org) for use CECS 474, Fall 2008. 2009 Pearson Education Inc., Upper

LPC ANALYSIS AND SYNTHESIS

33 Chapter 3 LPC ANALYSIS AND SYNTHESIS 3.1 INTRODUCTION Analysis of speech signals is made to obtain the spectral information of the speech signal. Analysis of speech signal is employed in variety of

Artificial Neural Network for Speech Recognition

Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken

The Calculation of G rms

The Calculation of G rms QualMark Corp. Neill Doertenbach The metric of G rms is typically used to specify and compare the energy in repetitive shock vibration systems. However, the method of arriving

ECE438 - Laboratory 9: Speech Processing (Week 2)

Purdue University: ECE438 - Digital Signal Processing with Applications 1 ECE438 - Laboratory 9: Speech Processing (Week 2) October 6, 2010 1 Introduction This is the second part of a two week experiment.

Front-End Signal Processing for Speech Recognition

Front-End Signal Processing for Speech Recognition MILAN RAMLJAK 1, MAJA STELLA 2, MATKO ŠARIĆ 2 1 Ericsson Nikola Tesla Poljička 39, HR-21 Split 2 FESB - University of Split R. Boškovića 32, HR-21 Split

Phonetics-1 periodic frequency amplitude decibel aperiodic fundamental harmonics overtones power spectrum

24.901 Phonetics-1 1. sound results from pressure fluctuations in a medium which displace the ear drum to stimulate the auditory nerve air is normal medium for speech air is elastic (cf. bicyle pump, plastic

Sound Categories. Today. Phone: basic speech sound of a language. ARPAbet transcription HH EH L OW W ER L D

Last Week Phonetics Smoothing algorithms redistribute probability CS 341: Natural Language Processing Prof. Heather Pon-Barry www.mtholyoke.edu/courses/ponbarry/cs341.html N-gram language models are used

Data driven design of filter bank for speech recognition

Data driven design of filter bank for speech recognition Lukáš Burget 12 and Hynek Heřmanský 23 1 Oregon Graduate Institute, Anthropic Signal Processing Group, 2 NW Walker Rd., Beaverton, Oregon 976-8921,

WAVELET-FOURIER ANALYSIS FOR SPEAKER RECOGNITION

Zakopane Kościelisko, 1 st 6 th September 2011 WAVELET-FOURIER ANALYSIS FOR SPEAKER RECOGNITION Mariusz Ziółko, Rafał Samborski, Jakub Gałka, Bartosz Ziółko Department of Electronics, AGH University of

Lecture 1-10: Spectrograms

Lecture 1-10: Spectrograms Overview 1. Spectra of dynamic signals: like many real world signals, speech changes in quality with time. But so far the only spectral analysis we have performed has assumed

A practical introduction to Fourier Analysis of speech signals. Part I: Background

A practical introduction to Fourier Analysis of speech signals. Part I: Background Let's assume we have used supersine.m to synthesize with just 3 sine components a rough approximation to a square wave,

Fourier Analysis. Nikki Truss. Abstract:

Fourier Analysis Nikki Truss 09369481 Abstract: The aim of this experiment was to investigate the Fourier transforms of periodic waveforms, and using harmonic analysis of Fourier transforms to gain information

Chapter 3 Discrete-Time Fourier Series. by the French mathematician Jean Baptiste Joseph Fourier in the early 1800 s. The

Chapter 3 Discrete-Time Fourier Series 3.1 Introduction The Fourier series and Fourier transforms are mathematical correlations between the time and frequency domains. They are the result of the heat-transfer

School Class Monitoring System Based on Audio Signal Processing

C. R. Rashmi 1,,C.P.Shantala 2 andt.r.yashavanth 3 1 Department of CSE, PG Student, CIT, Gubbi, Tumkur, Karnataka, India. 2 Department of CSE, Vice Principal & HOD, CIT, Gubbi, Tumkur, Karnataka, India.

Speech Signal Processing: An Overview

Speech Signal Processing: An Overview S. R. M. Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati December, 2012 Prasanna (EMST Lab, EEE, IITG) Speech

Speech Recognition of Spoken Digits

Institute of Integrated Sensor Systems Dept. of Electrical Engineering and Information Technology Speech Recognition of Spoken Digits Stefanie Peters May 10, 2006 Lecture Information Sensor Signal Processing

Chapter 14. MPEG Audio Compression

Chapter 14 MPEG Audio Compression 14.1 Psychoacoustics 14.2 MPEG Audio 14.3 Other Commercial Audio Codecs 14.4 The Future: MPEG-7 and MPEG-21 14.5 Further Exploration 1 Li & Drew c Prentice Hall 2003 14.1

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

, pp.347-354 http://dx.doi.org/10.14257/ijmue.2014.9.8.32 A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song Myeongsu Kang and Jong-Myon Kim School of Electrical Engineering,

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

WHAT IS AN FFT SPECTRUM ANALYZER? ANALYZER BASICS The SR760 FFT Spectrum Analyzer takes a time varying input signal, like you would see on an oscilloscope trace, and computes its frequency spectrum. Fourier's

Emotion Detection from Speech

Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction

Speech Signal Processing Handout 4 Spectral Analysis & the DFT

Speech Signal Processing Handout 4 Spectral Analysis & the DFT Signal: N samples Spectrum: DFT Result: N complex values Frequency resolution = 1 (Hz) NT Magnitude 1 Symmetric about Hz 2T frequency (Hz)

Application of Vocoders to Wireless Communications

1 of 5 6/21/2006 6:37 AM DSP Chips Vocoder Boards Voice Codec Units Application Notes Papers Application of Vocoders to Wireless Communications The need for increased utilization of available wireless

Research on Different Feature Parameters in Speaker Recognition

Journal of Signal and Information Processing, 2013, 4, 106-110 http://dx.doi.org/10.4236/jsip.2013.42014 Published Online May 2013 (http://www.scirp.org/journal/jsip) Research on Different Feature Parameters

Procedure In each case, draw and extend the given series to the fifth generation, then complete the following tasks:

Math IV Nonlinear Algebra 1.2 Growth & Decay Investigation 1.2 B: Nonlinear Growth Introduction The previous investigation introduced you to a pattern of nonlinear growth, as found in the areas of a series

L6: Short-time Fourier analysis and synthesis

L6: Short-time Fourier analysis and synthesis Overview Analysis: Fourier-transform view Analysis: filtering view Synthesis: filter bank summation (FBS) method Synthesis: overlap-add (OLA) method STFT magnitude

Sampling Theorem Notes. Recall: That a time sampled signal is like taking a snap shot or picture of signal periodically.

Sampling Theorem We will show that a band limited signal can be reconstructed exactly from its discrete time samples. Recall: That a time sampled signal is like taking a snap shot or picture of signal

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be

Speech Analysis using PRAAT (A brief guide prepared by Pranav Jawale)

Speech Analysis using PRAAT (A brief guide prepared by Pranav Jawale) 1. Praat installation Windows users can install the latest version of Praat from http://www.fon.hum.uva.nl/praat/download_win.html

On the use of asymmetric windows for robust speech recognition

Circuits, Systems and Signal Processing manuscript No. (will be inserted by the editor) On the use of asymmetric windows for robust speech recognition Juan A. Morales-Cordovilla Victoria Sánchez Antonio

Data Communications Prof. Ajit Pal Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture # 03 Data and Signal

(Refer Slide Time: 00:01:23) Data Communications Prof. Ajit Pal Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture # 03 Data and Signal Hello viewers welcome

Solutions to Exam in Speech Signal Processing EN2300

Solutions to Exam in Speech Signal Processing EN23 Date: Thursday, Dec 2, 8: 3: Place: Allowed: Grades: Language: Solutions: Q34, Q36 Beta Math Handbook (or corresponding), calculator with empty memory.

10: FOURIER ANALYSIS OF COMPLEX SOUNDS

10: FOURIER ANALYSIS OF COMPLEX SOUNDS Amplitude, loudness, and decibels Several weeks ago we found that we could synthesize complex sounds with a particular frequency, f, by adding together sine waves

A NEURAL NETWORK CLUSTERING TECHNIQUE FOR TEXT-INDEPENDENT SPEAKER IDENTIFICATION

A NEURAL NETWORK CLUSTERING TECHNIQUE FOR TEXT-INDEPENDENT SPEAKER IDENTIFICATION ZAKI B. NOSSAIR AND STEPHEN A. ZAHORIAN Department of Electrical and Computer Engineering Old Dominion University, Norfolk,

Establishing the Uniqueness of the Human Voice for Security Applications

Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.

CLASSIFICATION OF STOP CONSONANT PLACE OF ARTICULATION: COMBINING ACOUSTIC ATTRIBUTES

CLASSIFICATION OF STOP CONSONANT PLACE OF ARTICULATION: COMBINING ACOUSTIC ATTRIBUTES Atiwong Suchato Speech Communication Group, Laboratory of Electronics, MIT, Cambridge, MA, USA atiwong@mit.edu ABSTRACT

Thirukkural - A Text-to-Speech Synthesis System

Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,

Developing an Isolated Word Recognition System in MATLAB

MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling

AUTOMATIC IDENTIFICATION OF BIRD CALLS USING SPECTRAL ENSEMBLE AVERAGE VOICE PRINTS. Hemant Tyagi, Rajesh M. Hegde, Hema A. Murthy and Anil Prabhakar

AUTOMATIC IDENTIFICATION OF BIRD CALLS USING SPECTRAL ENSEMBLE AVERAGE VOICE PRINTS Hemant Tyagi, Rajesh M. Hegde, Hema A. Murthy and Anil Prabhakar Indian Institute of Technology Madras Chennai, 60006,

Fourier Analysis. Evan Sheridan, Chris Kervick, Tom Power Novemeber

Fourier Analysis Evan Sheridan, Chris Kervick, Tom Power 11367741 Novemeber 19 01 Abstract Various properties of the Fourier Transform are investigated using the Cassy Lab software, a microphone, electrical

2-1 Position, Displacement, and Distance

2-1 Position, Displacement, and Distance In describing an object s motion, we should first talk about position where is the object? A position is a vector because it has both a magnitude and a direction:

USE OF SYNCHRONOUS NOISE AS TEST SIGNAL FOR EVALUATION OF STRONG MOTION DATA PROCESSING SCHEMES

USE OF SYNCHRONOUS NOISE AS TEST SIGNAL FOR EVALUATION OF STRONG MOTION DATA PROCESSING SCHEMES Ashok KUMAR, Susanta BASU And Brijesh CHANDRA 3 SUMMARY A synchronous noise is a broad band time series having

PERSONAL COMPUTER SOFTWARE VOWEL TRAINING AID FOR THE HEARING IMPAIRED

PERSONAL COMPUTER SOFTWARE VOWEL TRAINING AID FOR THE HEARING IMPAIRED A. Matthew Zimmer, Bingjun Dai, Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk,

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,

On-Track Tire Characterization

On-Track Tire Characterization Introduction Using wheel force transducers can be a useful way to generate tire data. Traditional laboratory tire testing characterizes the tire when running on the surface

2011 Intensity - 1 INTENSITY OF SOUND. To understand the concept of sound intensity and how it is measured.

2011 Intensity - 1 INTENSITY OF SOUND The objectives of this experiment are: To understand the concept of sound intensity and how it is measured. To learn how to operate a Sound Level Meter APPARATUS:

FIR Filter Design. FIR Filters and the z-domain. The z-domain model of a general FIR filter is shown in Figure 1. Figure 1

FIR Filters and the -Domain FIR Filter Design The -domain model of a general FIR filter is shown in Figure. Figure Each - box indicates a further delay of one sampling period. For example, the input to

Chapter 4: Problem Solutions

Chapter 4: Problem s Digital Filters Problems on Non Ideal Filters à Problem 4. We want to design a Discrete Time Low Pass Filter for a voice signal. The specifications are: Passband F p 4kHz, with 0.8dB

Order analysis in ArtemiS SUITE 1

08/16 in ArtemiS SUITE 1 Introduction 1 Requirement: RPM information 3 Calculating an Order Spectrum vs. RPM with the Variable DFT Size algorithm 4 Basics 4 Configuration possibilities in the Properties

Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT)

Page 1 Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT) ECC RECOMMENDATION (06)01 Bandwidth measurements using FFT techniques

Identification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models

Identification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models Denis Tananaev, Galina Shagrova, Victor Kozhevnikov North-Caucasus Federal University,

Crash course in acoustics

Crash course in acoustics 1 Outline What is sound and how does it travel? Physical characteristics of simple sound waves Amplitude, period, frequency Complex periodic and aperiodic waves Source-filter

Speech Authentication based on Audio Watermarking

S.Saraswathi S.Saraswathi Assistant Professor, Department of Information Technology, Pondicherry Engineering College, Pondicherry, India swathimuk@yahoo.com Abstract With the rapid advancement of digital

Moving Average Filters

CHAPTER 15 Moving Average Filters The moving average is the most common filter in DSP, mainly because it is the easiest digital filter to understand and use. In spite of its simplicity, the moving average

SPEAKER ADAPTATION FOR HMM-BASED SPEECH SYNTHESIS SYSTEM USING MLLR

SPEAKER ADAPTATION FOR HMM-BASED SPEECH SYNTHESIS SYSTEM USING MLLR Masatsune Tamura y, Takashi Masuko y, Keiichi Tokuda yy, and Takao Kobayashi y ytokyo Institute of Technology, Yokohama, 6-8 JAPAN yynagoya

Music instrument categorization using multilayer perceptron network Ivana Andjelkovic PHY 171, Winter 2011

Music instrument categorization using multilayer perceptron network Ivana Andjelkovic PHY 171, Winter 2011 Abstract Audio content description is one of the key components to multimedia search, classification

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS 1. Bandwidth: The bandwidth of a communication link, or in general any system, was loosely defined as the width of

RightMark Audio Analyzer

RightMark Audio Analyzer Version 2.5 2001 http://audio.rightmark.org Tests description 1 Contents FREQUENCY RESPONSE TEST... 2 NOISE LEVEL TEST... 3 DYNAMIC RANGE TEST... 5 TOTAL HARMONIC DISTORTION TEST...

Trigonometric functions and sound

Trigonometric functions and sound The sounds we hear are caused by vibrations that send pressure waves through the air. Our ears respond to these pressure waves and signal the brain about their amplitude

Kinematics is the study of motion. Generally, this involves describing the position, velocity, and acceleration of an object.

Kinematics Kinematics is the study of motion. Generally, this involves describing the position, velocity, and acceleration of an object. Reference frame In order to describe movement, we need to set a

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers Sung-won ark and Jose Trevino Texas A&M University-Kingsville, EE/CS Department, MSC 92, Kingsville, TX 78363 TEL (36) 593-2638, FAX

Advanced 3G and 4G Wireless Communication Prof. Aditya K. Jagannatham Department of Electrical Engineering Indian Institute of Technology, Kanpur

Advanced 3G and 4G Wireless Communication Prof. Aditya K. Jagannatham Department of Electrical Engineering Indian Institute of Technology, Kanpur Lecture - 3 Rayleigh Fading and BER of Wired Communication

Speech in an Hour. CS 188: Artificial Intelligence Spring Other Noisy-Channel Processes. The Speech Recognition Problem

CS 188: Artificial Intelligence Spring 2006 Lecture 19: Speech Recognition 3/23/2006 Speech in an Hour Speech input i an acoutic wave form p ee ch l a b l to a tranition: Dan Klein UC Berkeley Many lide

FEATURE EXTRACTION MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCC) MUSTAFA YANKAYIŞ

FEATURE EXTRACTION MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCC) MUSTAFA YANKAYIŞ CONTENTS SPEAKER RECOGNITION SPEECH PRODUCTION FEATURE EXTRACTION FEATURES MFCC PRE-EMPHASIS FRAMING WINDOWING DFT (FFT) MEL-FILTER

Lecture 1-4: Harmonic Analysis/Synthesis

Lecture 1-4: Harmonic Analysis/Synthesis Overview 1. Classification of waveform shapes: we can broadly divide waveforms into the classes periodic (repeat in time) and aperiodic (don't repeat in time).

Wave Hydro Dynamics Prof. V. Sundar Department of Ocean Engineering Indian Institute of Technology, Madras

Wave Hydro Dynamics Prof. V. Sundar Department of Ocean Engineering Indian Institute of Technology, Madras Module No. # 02 Wave Motion and Linear Wave Theory Lecture No. # 03 Wave Motion II We will today

10.3 POWER METHOD FOR APPROXIMATING EIGENVALUES

55 CHAPTER NUMERICAL METHODS. POWER METHOD FOR APPROXIMATING EIGENVALUES In Chapter 7 we saw that the eigenvalues of an n n matrix A are obtained by solving its characteristic equation n c n n c n n...

Curve Fitting. Next: Numerical Differentiation and Integration Up: Numerical Analysis for Chemical Previous: Optimization.

Next: Numerical Differentiation and Integration Up: Numerical Analysis for Chemical Previous: Optimization Subsections Least-Squares Regression Linear Regression General Linear Least-Squares Nonlinear

ADC Familiarization. Dmitry Teytelman and Dan Van Winkle. Student name:

Dmitry Teytelman and Dan Van Winkle Student name: RF and Digital Signal Processing US Particle Accelerator School 15 19 June, 2009 June 16, 2009 Contents 1 Introduction 2 2 Exercises 2 2.1 Measuring 48

chapter Introduction to Digital Signal Processing and Digital Filtering 1.1 Introduction 1.2 Historical Perspective

Introduction to Digital Signal Processing and Digital Filtering chapter 1 Introduction to Digital Signal Processing and Digital Filtering 1.1 Introduction Digital signal processing (DSP) refers to anything

Speaker Change Detection using Support Vector Machines

Speaker Change Detection using Support Vector Machines V. Kartik and D. Srikrishna Satish and C. Chandra Sekhar Speech and Vision Laboratory Department of Computer Science and Engineering Indian Institute

Speech Synthesis by Artificial Neural Networks (AI / Speech processing / Signal processing)

Speech Synthesis by Artificial Neural Networks (AI / Speech processing / Signal processing) Christos P. Yiakoumettis Department of Informatics University of Sussex, UK (Email: c.yiakoumettis@sussex.ac.uk)

ELINT: Pulsed Radar Signals Kyle A. Davidson, M.A.Sc. his lab serves as an introduction to the Agilent 89600 Vector Signal Analysis software, and how it can be used as a tool for analyzing pulsed radar

IB Mathematics SL :: Checklist. Topic 1 Algebra The aim of this topic is to introduce students to some basic algebraic concepts and applications.

Topic 1 Algebra The aim of this topic is to introduce students to some basic algebraic concepts and applications. Check Topic 1 Book Arithmetic sequences and series; Sum of finite arithmetic series; Geometric

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA ABSTRACT Random vibration is becoming increasingly recognized as the most realistic method of simulating the dynamic environment of military

Available from Deakin Research Online:

This is the authors final peered reviewed (post print) version of the item published as: Adibi,S 2014, A low overhead scaled equalized harmonic-based voice authentication system, Telematics and informatics,

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-

Fundamentals Series Analog vs. Digital Polycom, Inc. All rights reserved. Fundamentals Series Signals H.323 Analog vs. Digital SIP Defining Quality Standards Network Communication I Network Communication

T = 1 f. Phase. Measure of relative position in time within a single period of a signal For a periodic signal f(t), phase is fractional part t p

Data Transmission Concepts and terminology Transmission terminology Transmission from transmitter to receiver goes over some transmission medium using electromagnetic waves Guided media. Waves are guided

Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

EE4512 Analog and Digital Communications Chapter 2. Chapter 2 Frequency Domain Analysis

Chapter 2 Frequency Domain Analysis Chapter 2 Frequency Domain Analysis Why Study Frequency Domain Analysis? Pages 6-136 Why frequency domain analysis? Allows simple algebra rather than time-domain differential

Markov Chains, part I

Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet

(Refer Slide Time: 00:01:23 min)

Computer Aided Design Prof. Anoop Chalwa Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture No. # 02 Input Output Devices, Raster Graphics Today we will be talking about

Fractals 367. Self-similarity A shape is self-similar when it looks essentially the same from a distance as it does closer up.

Fractals 367 Fractals Fractals are mathematical sets, usually obtained through recursion, that exhibit interesting dimensional properties. We ll explore what that sentence means through the rest of the

Fast Fourier Transforms and Power Spectra in LabVIEW

Application Note 4 Introduction Fast Fourier Transforms and Power Spectra in LabVIEW K. Fahy, E. Pérez Ph.D. The Fourier transform is one of the most powerful signal analysis tools, applicable to a wide

The Taxman Game. Robert K. Moniot September 5, 2003

The Taxman Game Robert K. Moniot September 5, 2003 1 Introduction Want to know how to beat the taxman? Legally, that is? Read on, and we will explore this cute little mathematical game. The taxman game

Motion Graphs. It is said that a picture is worth a thousand words. The same can be said for a graph.

Motion Graphs It is said that a picture is worth a thousand words. The same can be said for a graph. Once you learn to read the graphs of the motion of objects, you can tell at a glance if the object in

Frequency Domain Characterization of Signals. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

Frequency Domain Characterization of Signals Yao Wang Polytechnic University, Brooklyn, NY1121 http: //eeweb.poly.edu/~yao Signal Representation What is a signal Time-domain description Waveform representation

The Optimal Sample Rate for Quality Audio

The Optimal Sample Rate for Quality Audio By Dan Lavry, Lavry Engineering Inc. May 3, 2012 Imagine that you and a friend were standing at the base of a lone hill in an otherwise flat plain, and you start

Purpose of Time Series Analysis. Autocovariance Function. Autocorrelation Function. Part 3: Time Series I

Part 3: Time Series I Purpose of Time Series Analysis (Figure from Panofsky and Brier 1968) Autocorrelation Function Harmonic Analysis Spectrum Analysis Data Window Significance Tests Some major purposes

Graphical Presentation of Data

Graphical Presentation of Data Guidelines for Making Graphs Titles should tell the reader exactly what is graphed Remove stray lines, legends, points, and any other unintended additions by the computer

Speech Signal Analysis Using Windowing Techniques

Speech Signal Analysis Using Windowing Techniques ISSN 2347-3983 P.Suresh Babu 1 Dr.D.Srinivasulu Reddy 2 Dr.P.V.N.Reddy 3 1 Assistant Professor, Dept of ECE, SVCE, Tirupati. Email: sureshbabu476@gmail.com

Sound absorption and acoustic surface impedance

Sound absorption and acoustic surface impedance CHRISTER HEED SD2165 Stockholm October 2008 Marcus Wallenberg Laboratoriet för Ljud- och Vibrationsforskning Sound absorption and acoustic surface impedance