L9: Cepstral analysis



Similar documents
MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Speech Analysis for Automatic Speech Recognition

T Automatic Speech Recognition: From Theory to Practice

Lecture 1-10: Spectrograms

School Class Monitoring System Based on Audio Signal Processing

Short-time FFT, Multi-taper analysis & Filtering in SPM12

SOFTWARE FOR GENERATION OF SPECTRUM COMPATIBLE TIME HISTORY

Artificial Neural Network for Speech Recognition

SGN-1158 Introduction to Signal Processing Test. Solutions

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

Speech Signal Processing: An Overview

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

SIGNAL PROCESSING & SIMULATION NEWSLETTER

Lecture 1-6: Noise and Filters

Analysis/resynthesis with the short time Fourier transform

FFT Algorithms. Chapter 6. Contents 6.1

Advanced Signal Processing and Digital Noise Reduction

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Applications of the DFT

Emotion Detection from Speech

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

Auto-Tuning Using Fourier Coefficients

Quarterly Progress and Status Report. Measuring inharmonicity through pitch extraction

Hardware Implementation of Probabilistic State Machine for Word Recognition

HD Radio FM Transmission System Specifications Rev. F August 24, 2011

Analog and Digital Signals, Time and Frequency Representation of Signals

AN-007 APPLICATION NOTE MEASURING MAXIMUM SUBWOOFER OUTPUT ACCORDING ANSI/CEA-2010 STANDARD INTRODUCTION CEA-2010 (ANSI) TEST PROCEDURE

The Fourier Analysis Tool in Microsoft Excel

Frequency Response of FIR Filters

Spectrum Level and Band Level

Introduction to Digital Filters

The Algorithms of Speech Recognition, Programming and Simulating in MATLAB

Lecture 14. Point Spread Function (PSF)

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Separation and Classification of Harmonic Sounds for Singing Voice Detection

Probability and Random Variables. Generation of random variables (r.v.)

Developing an Isolated Word Recognition System in MATLAB

Digital Speech Coding

Lab 1. The Fourier Transform

PeakVue Analysis for Antifriction Bearing Fault Detection

Convolution, Correlation, & Fourier Transforms. James R. Graham 10/25/2005

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

Signal to Noise Instrumental Excel Assignment

MFCC-Based Voice Recognition System for Home Automation Using Dynamic Programming

Appendix D Digital Modulation and GMSK

SHOCK AND VIBRATION RESPONSE SPECTRA COURSE Unit 3. Sine Sweep Frequency and Octave Calculations

Trigonometric functions and sound

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA

Jeff Thomas Tom Holmes Terri Hightower. Learn RF Spectrum Analysis Basics

Design of FIR Filters

Thirukkural - A Text-to-Speech Synthesis System

Image Compression through DCT and Huffman Coding Technique

Ericsson T18s Voice Dialing Simulator

SR2000 FREQUENCY MONITOR

CIRCUITS LABORATORY EXPERIMENT 3. AC Circuit Analysis

Basic Acoustics and Acoustic Filters

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

VCO Phase noise. Characterizing Phase Noise

Sampling and Interpolation. Yao Wang Polytechnic University, Brooklyn, NY11201

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music

Introduction to Digital Audio

(Refer Slide Time: 01:11-01:27)

Establishing the Uniqueness of the Human Voice for Security Applications

The continuous and discrete Fourier transforms

AM Receiver. Prelab. baseband

Using the Impedance Method

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

Music Genre Classification

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

SWISS ARMY KNIFE INDICATOR John F. Ehlers

CHAPTER 6 Frequency Response, Bode Plots, and Resonance

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW

PCM Encoding and Decoding:

Doppler. Doppler. Doppler shift. Doppler Frequency. Doppler shift. Doppler shift. Chapter 19

Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT)

Lecture 7 ELE 301: Signals and Systems

TCOM 370 NOTES 99-6 VOICE DIGITIZATION AND VOICE/DATA INTEGRATION

Appendix C GSM System and Modulation Description

Security in Voice Authentication

Linear Predictive Coding

Introduction and Comparison of Common Videoconferencing Audio Protocols I. Digital Audio Principles

AN Application Note: FCC Regulations for ISM Band Devices: MHz. FCC Regulations for ISM Band Devices: MHz

The Calculation of G rms

Controller Design in Frequency Domain

Application Note: Spread Spectrum Oscillators Reduce EMI for High Speed Digital Systems

Laboratory Manual and Supplementary Notes. CoE 494: Communication Laboratory. Version 1.2

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

From Concept to Production in Secure Voice Communications

T = 1 f. Phase. Measure of relative position in time within a single period of a signal For a periodic signal f(t), phase is fractional part t p

A Digital Audio Watermark Embedding Algorithm

CBS RECORDS PROFESSIONAL SERIES CBS RECORDS CD-1 STANDARD TEST DISC

Solutions to Exam in Speech Signal Processing EN2300

Timing Errors and Jitter

Transcription:

L9: Cepstral analysis The cepstrum Homomorphic filtering The cepstrum and voicing/pitch detection Linear prediction cepstral coefficients Mel frequency cepstral coefficients This lecture is based on [Taylor, 2009, ch. 12; Rabiner and Schafer, 2007, ch. 5; Rabiner and Schafer, 1978, ch. 7 ] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 1

Definition The cepstrum The cepstrum is defined as the inverse DFT of the log magnitude of the DFT of a signal c n = F 1 log F x n where F is the DFT and F 1 is the IDFT For a windowed frame of speech y n, the cepstrum is N 1 N 1 c n = log x n e j2π N kn n=0 n=0 e j2π N kn x n X k X k DFT log IDFT c n Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 2

F x n log F x n F 1 log F x n [Taylor, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 3

Motivation Consider the magnitude spectrum F x n of the periodic signal in the previous slide This spectra contains harmonics at evenly spaced intervals, whose magnitude decreases quite quickly as frequency increases By calculating the log spectrum, however, we can compress the dynamic range and reduce amplitude differences in the harmonics Now imagine we were told that the log spectra was a waveform In this case, we would we would describe it as quasi-periodic with some form of amplitude modulation To separate both components, we could then employ the DFT We would then expect the DFT to show A large spike around the period of the signal, and Some low-frequency components due to the amplitude modulation A simple filter would then allow us to separate both components (see next slide) Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 4

Liftering in the cepstral domain [Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 5

Cepstral analysis as deconvolution As you recall, the speech signal can be modeled as the convolution of glottal source u n, vocal tract v n, and radiation r n y n = u n v n r n Because these signals are convolved, they cannot be easily separated in the time domain We can however perform the separation as follows For convenience, we combine v n = v n r n, which leads to y n = u n v n Taking the Fourier transform Y e jω = U e jω V e jω If we now take the log of the magnitude, we obtain log Y e jω = log U e jω + log V e jω which shows that source and filter are now just added together We can now return to the time domain through the inverse FT c n = c u n + c v n Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 6

Where does the term cepstrum come from? The crucial observation leading to the cepstrum terminology is that the log spectrum can be treated as a waveform and subjected to further Fourier analysis The independent variable of the cepstrum is nominally time since it is the IDFT of a log-spectrum, but is interpreted as a frequency since we are treating the log spectrum as a waveform To emphasize this interchanging of domains, Bogert, Healy and Tukey (1960) coined the term cepstrum by swapping the order of the letters in the word spectrum Likewise, the name of the independent variable of the cepstrum is known as a quefrency, and the linear filtering operation in the previous slide is known as liftering Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 7

Discussion If we take the DFT of a signal and then take the inverse DFT of that, we of course get back the original signal (assuming it is stationary) The cepstrum calculation is different in two ways First, we only use magnitude information, and throw away the phase Second, we take the IDFT of the log-magnitude which is already very different since the log operation emphasizes the periodicity of the harmonics The cepstrum is useful because it separates source and filter If we are interested in the glottal excitation, we keep the high coefficients If we are interested in the vocal tract, we keep the low coefficients Truncating the cepstrum at different quefrency values allows us to preserve different amounts of spectral detail (see next slide) Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 8

K=50 K=30 K=20 K=10 Frequency (Hz) [Taylor, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 9

Note from the previous slide that cepstral coefficients are a very compact representation of the spectral envelope It also turns out that cepstral coefficients are (to a large extent) uncorrelated This is very useful for statistical modeling because we only need to store their mean and variances but not their covariances For these reasons, cepstral coefficients are widely used in speech recognition, generally combined with a perceptual auditory scale, as we see next Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 10

ex9p1.m Computing the cepstrum Liftering vs. linear prediction Show uncorrelatedness of cepstrum Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 11

Homomorphic filtering Cepstral analysis is a special case of homomorphic filtering Homomorphic filtering is a generalized technique involving (1) a nonlinear mapping to a different domain where (2) linear filters are applied, followed by (3) mapping back to the original domain Consider the transformation defined by y n = L x n If L is a linear system, it will satisfy the principle of superposition L x 1 n + x 2 n = L x 1 n + L x 2 n By analogy, we define a class of systems that obey a generalized principle of superposition where addition is replaced by convolution H x n = H x 1 n x 2 n = H x 1 n H x 2 n Systems having this property are known as homomorphic systems for convolution, and can be depicted as shown below [Rabiner & Schafer, 1978] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 12

An important property of homomorphic systems is that they can be represented as a cascade of three homomorphic systems The first system takes inputs combined by convolution and transforms them into an additive combination of the corresponding outputs D x n = D x 1 n x 2 n = D x 1 n + D x 2 n = x 1 n +x 2 n = x n The second system is a conventional linear system that obeys the principle of superposition The third system is the inverse of the first system: it transforms signals combined by addition into signals combined by convolution D 1 y n = D 1 y 1 n + y 2 n = D 1 y 1 n D 1 y 2 n = y 1 n y 2 n = y n This is important because the design of such system reduces to the design of a linear system [Rabiner & Schafer, 1978] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 13

Noting that the z-transform of two convolved signals is the product of their z-transforms x n = x 1 n x 2 n X z = X 1 z X 2 z We can then take logs to obtain X z = log X z = log X 1 z X 2 z = log X 1 z + log X 2 z Thus, the frequency domain representation of a homomorphic system for deconvolution can be represented as [Rabiner & Schafer, 1978] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 14

If we wish to represent signals as sequence rather than in the frequency domain, then the systems D and D 1 can be represented as shown below Where you can recognize the similarity between the cepstrum with the system D Strictly speaking, D defines a complex cepstrum, whereas in speech processing we generally use the real cepstrum Can you find the equivalent system for the liftering stage? [Rabiner & Schafer, 1978] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 15

Voicing/pitch detection Cepstral analysis can be applied to detect local periodicity The figure in the next slide shows the STFT and corresponding spectra for a sequence of analysis windows in a speech signal (50-ms window, 12.5-ms shift) The STFT shows a clear difference in harmonic structure Frames 1-5 correspond to unvoiced speech Frames 8-15 correspond to voiced speech Frames 6-7 contain a mixture of voiced and unvoiced excitation These differences are perhaps more apparent in the cepstrum, which shows a strong peak at a quefrency of about 11-12 ms for frames 8-15 Therefore, presence of a strong peak (in the 3-20 ms range) is a very strong indication that the speech segment is voiced Lack of a peak, however, is not an indication of unvoiced speech since the strength or existence of a peak depends on various factors, i.e., length of the analysis window Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 16

[Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 17

Cepstral-based parameterizations Linear prediction cepstral coefficients As we saw, the cepstrum has a number of advantages (source-filter separation, compactness, orthogonality), whereas the LP coefficients are too sensitive to numerical precision Thus, it is often desirable to transform LP coefficients a n into cepstral coefficients c n This can be achieved as [Huang, Acero & Hon, 2001] ln G n = 0 c n = a n + 1 n 1 kc n k a n k 1 < n p k=1 Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 18

Mel Frequency Cepstral Coefficients (MFCC) Probably the most common parameterization in speech recognition Combines the advantages of the cepstrum with a frequency scale based on critical bands Computing MFCCs First, the speech signal is analyzed with the STFT Then, DFT values are grouped together in critical bands and weighted according to the triangular weighting function shown below These bandwidths are constant for center frequencies below 1KHz and increase exponentially up to half the sampling rate [Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 19

The mel-frequency spectrum at analysis time n is defined as U r MF r = 1 V A r k X n, k r k=l r where V r k is the triangular weighting function for the r-th filter, ranging from DFT index L r to U r and U r A r = V r k 2 k=l r which serves as a normalization factor for the r-th filter, so that a perfectly flat Fourier spectrum will also produce a flat Mel-spectrum For each frame, a discrete cosine transform (DCT) of the logmagnitude of the filter outputs is then computed to obtain the MFCCs R MFCC m = 1 log MF r cos 2π R R r + 1 2 m r=1 where typically MFCC m is evaluated for a number of coefficients N MFCC that is less than the number of mel-filters R For F s = 8KHz, typical values are N MFCC = 13 and R = 22 Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 20

Comparison of smoothing techniques: LPC, cepstral and mel-cepstral [Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 21

Notes The MFCC is no longer a homomorphic transformation It would be if the order of summation and logarithms were reversed, in other words if we computed Instead of 1 A r log 1 A r U r k=l r log V r k X n, k U r V r k X n, k k=l r In practice, however, the MFCC representation is approximately homomorphic for filters that have a smooth transfer function The advantages of the second summation above is that the filter energies are more robust to noise and spectral properties Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 22

MFCCs employ the DCT instead of the IDFT The DCT turns out to be closely related to the Karhunen-Loeve transform The KL transform is the basis for PCA, a technique that can be used to find orthogonal (uncorrelated) projections of high dimensional data As a result, the DCT tends to decorrelate the mel-scale frequency log-energies Relationship with the DFT The DCT transform (the DCT-II, to be exact) is defined as N 1 X DCT k = x n cos π N n + 1 2 k n=0 whereas the DFT is defined as X DFT k = N 1 x n e j2π N kn n=0 Under some conditions, the DFT and the DCT-II are exactly equivalent For our purposes, you can think of the DCT as a real version of the DFT (i.e., no imaginary part) that also has a better energy compaction properties than the DFT Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 23

ex9p2.m Computing MFCCs Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 24