Overview. Speech Signal Analysis. Speech signal analysis for ASR. Speech production model. Vocal Organs & Vocal Tract. Speech Signal Analysis for ASR

Similar documents
L9: Cepstral analysis

Lecture 1-10: Spectrograms

Speech Signal Processing: An Overview

Speech Analysis for Automatic Speech Recognition

T Automatic Speech Recognition: From Theory to Practice

Analysis/resynthesis with the short time Fourier transform

Available from Deakin Research Online:

School Class Monitoring System Based on Audio Signal Processing

Artificial Neural Network for Speech Recognition

Emotion Detection from Speech

Separation and Classification of Harmonic Sounds for Singing Voice Detection

FFT Algorithms. Chapter 6. Contents 6.1

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Hardware Implementation of Probabilistic State Machine for Word Recognition

Short-time FFT, Multi-taper analysis & Filtering in SPM12

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Advanced Signal Processing and Digital Noise Reduction

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Lecture 1-6: Noise and Filters

Quarterly Progress and Status Report. Measuring inharmonicity through pitch extraction

Auto-Tuning Using Fourier Coefficients

T = 1 f. Phase. Measure of relative position in time within a single period of a signal For a periodic signal f(t), phase is fractional part t p

B3. Short Time Fourier Transform (STFT)

AC : MULTIMEDIA SYSTEMS EDUCATION INNOVATIONS I: SPEECH

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Time Series Analysis: Introduction to Signal Processing Concepts. Liam Kilmartin Discipline of Electrical & Electronic Engineering, NUI, Galway

Design of FIR Filters

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Linear Predictive Coding

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

Ericsson T18s Voice Dialing Simulator

Establishing the Uniqueness of the Human Voice for Security Applications

SOFTWARE FOR GENERATION OF SPECTRUM COMPATIBLE TIME HISTORY

The Fundamentals of FFT-Based Audio Measurements in SmaartLive

Thirukkural - A Text-to-Speech Synthesis System

MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music

Tutorial about the VQR (Voice Quality Restoration) technology

A Soft Computing Based Approach for Multi-Accent Classification in IVR Systems

COMPARATIVE STUDY OF RECOGNITION TOOLS AS BACK-ENDS FOR BANGLA PHONEME RECOGNITION

Speech Recognition on Cell Broadband Engine UCRL-PRES

Developing an Isolated Word Recognition System in MATLAB

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Voltage. Oscillator. Voltage. Oscillator

Applications of the DFT

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW

MATRIX TECHNICAL NOTES

The Algorithms of Speech Recognition, Programming and Simulating in MATLAB

Lab 1. The Fourier Transform

Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT)

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

MFCC-Based Voice Recognition System for Home Automation Using Dynamic Programming

WAVEFORM DICTIONARIES AS APPLIED TO THE AUSTRALIAN EXCHANGE RATE

The Fourier Analysis Tool in Microsoft Excel

Objective Speech Quality Measures for Internet Telephony

SGN-1158 Introduction to Signal Processing Test. Solutions

FOURIER TRANSFORM BASED SIMPLE CHORD ANALYSIS. UIUC Physics 193 POM

Analog and Digital Signals, Time and Frequency Representation of Signals

Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids

What Audio Engineers Should Know About Human Sound Perception. Part 2. Binaural Effects and Spatial Hearing

Speech Compression. 2.1 Introduction

Agilent Creating Multi-tone Signals With the N7509A Waveform Generation Toolbox. Application Note

Lecture 12: An Overview of Speech Recognition

Spectrum Level and Band Level

PeakVue Analysis for Antifriction Bearing Fault Detection

Adding Sinusoids of the Same Frequency. Additive Synthesis. Spectrum. Music 270a: Modulation

AN-007 APPLICATION NOTE MEASURING MAXIMUM SUBWOOFER OUTPUT ACCORDING ANSI/CEA-2010 STANDARD INTRODUCTION CEA-2010 (ANSI) TEST PROCEDURE

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Digital Speech Coding

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Online Filtering for Radar Detection of Meteors

Automatic Speech Recognition

SIGNAL PROCESSING FOR EFFECTIVE VIBRATION ANALYSIS

Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification

Convention Paper Presented at the 135th Convention 2013 October New York, USA

TRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY

Audio Content Analysis for Online Audiovisual Data Segmentation and Classification

Solutions to Exam in Speech Signal Processing EN2300

Onset Detection and Music Transcription for the Irish Tin Whistle

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

L3: Organization of speech sounds

Test for I-DEAS FEATURES APPLICATIONS A TRANSIENT POST-PROCESSING SOFTWARE FOR FIXED SAMPLE ORDER TRACKING

Ralph L. Brooker, Member, IEEE. Andrew Corporation, Alexandria, VA 22314, USA

Matlab GUI for WFB spectral analysis

PCM Encoding and Decoding:

How To Use A High Definition Oscilloscope

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

Voice---is analog in character and moves in the form of waves. 3-important wave-characteristics:

Waveforms and the Speed of Sound

Basic Acoustics and Acoustic Filters

VoIP Technologies Lecturer : Dr. Ala Khalifeh Lecture 4 : Voice codecs (Cont.)

CBS RECORDS PROFESSIONAL SERIES CBS RECORDS CD-1 STANDARD TEST DISC

Figure1. Acoustic feedback in packet based video conferencing system

Introduction to Digital Audio

High Quality Integrated Data Reconstruction for Medical Applications

Convention Paper Presented at the 112th Convention 2002 May Munich, Germany

Appendix D Digital Modulation and GMSK

Transcription:

Overview Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 7/2 January 23 Speech Signal Analysis for ASR Reading: Features for ASR Spectral analysis Cepstral analysis Standard features for ASR: MFCCs and PLP analysis Dynamic features Jurafsky & Martin, sec 9.3 P Taylor, Text-to-Speech Synthesis, chapter 2, signal processing background chapter Speech signal analysis for ASR ASR Lectures 2&3 Speech Signal Analysis Speech production model ASR Lectures 2&3 Speech Signal Analysis 2 Recorded Speech Decoded Text (Transcription) nasal cavity X( ) F F2 F3 ( db/oct.) Nasal Cavity Signal Analysis Training Data Acoustic Model Lexicon Language Model Search Space lips teeth oral cavity tongue lungs larynx pharynx vocal folds Vocal Organs & Vocal Tract H( ) V( ) lips Mouth x(t) +db/oct. Cavity F F2 F =/T (formants) F3 2dB/oct. frequency Larynx + Pharynx vocal folds T v(t) t ASR Lectures 2&3 Speech Signal Analysis 3 ASR Lectures 2&3 Speech Signal Analysis

Sampling Acoustic Features for ASR Sampled signal Microphone x(n) ASR Front End Acoustic feature vectors o t (k) Acoustic Model Sound pressure wave Speech signal analysis to produce a sequence of acoustic feature vectors Discrete digital samples ASR Lectures 2&3 Speech Signal Analysis 5 ASR Lectures 2&3 Speech Signal Analysis Acoustic Features for ASR MFCC-based front end for ASR Desirable characteristics of acoustic features used for ASR: Sampled signal Features should contain sufficient information to distinguish between phones good time resolution (ms) good frequency resolution ( 2 channels) Be separated from F and its harmonics Be robust against speaker variation Be robust agains noise or channel distortions Have good pattern recognition characteristics low feature dimension features are independent of each other Feature Transform o t (k) x(n) Preemphasis y t (j),e t y t (j), e t y t (j), e t Dynamic Features x (n) Window energy e t y t (j) x t (n) DFT Mel filterbank log( 2 ) IDFT X t (k) Y t (m) Y t (m) Acoustic Model ASR Lectures 2&3 Speech Signal Analysis 7 ASR Lectures 2&3 Speech Signal Analysis

....2 rectangle 2 2....2 hammin 2 2....2 hannin 2 2 Pre-emphasis and spectral tilt Example of pre-emphasis Pre-emphasis: Before example and after pre-emphasis Spectral slice from the vowel [aa] Pre-emphasis increases the magnitude of higher frequencies in the speech signal compared with lower frequencies Spectral Tilt The speech signal has more energy at low frequencies (for voiced speech) This is due to the glottal source Pre-emphasis (first-order) filter boosts higher frequencies: x[n ] = x[n] + αx[n ].95 < α <.99 (Jurafsky & Martin, fig. 9.9) Vowel /aa/ - time slice of the spectrum Windowing ASR Lectures 2&3 Speech Signal Analysis 9 Effect of windowing time domain ASR Lectures 2&3 Speech Signal Analysis The speech signal is constantly changing (non-stationary) Signal processing algorithms usually assume that they the signal is stationary Piecewise stationarity: model speech signal as a sequence of frames (each assumed to be stationary) Windowing: multiply the full waveform s(n) by a window w(n) (in time domain): x[n] = w[n]s[n] Simply cutting out a short segment (frame) from s(n) is a rectangular window causes discontinuities at the edges of the segment Instead, a tapered window is usually used e.g. Hamming (α =.) or Hanning (α =.5) window ( ) 2πl w[l] = ( α) α cos N : window width N ASR Lectures 2&3 Speech Signal Analysis 352 Chapter 2. Analysis of Speech Signals amplitude.5 -.5 - Rectangular Hamming Hanning 2 3 5 samples (a) Rectangular window amplitude....2 -.2 -. -. -. - amplitude....2 -.2 -. -. -. 2 3 5 samples - (c) Hamming window (Taylor, Figurefig 2.2.) Effect of windowing in the time domain 2 3 5 samples (b) Hanning window ASR Lectures 2&3 Speech Signal Analysis 2 shows a spike at the sinusoid frequency, but also shows prominent energy on either side. This

Section 2.. Short term speech analysis 353 Effect of windowing frequency domain Discrete Fourier Transform (DFT) 2 2 5 5 2 25 3 35 5-5 - -5-2 -25-3 5 5 2 25 3 35 (a) Rectangular window (b) Rectangular window 7 5 3 2 5 5 2 25 3 35 (c) Hanning window (d) Hanning window 7 5 3 2 5 5 2 25 3 35 5-5 - -5-2 -25-3 5 5 2 25 3 35 5-5 - -5-2 -25-3 5 5 2 25 3 35 Purpose: extracts spectral information from a windowed signal (i.e. how much energy at each frequency band) Input: windowed signal x[n]... x[m] (time domain) Output: a complex number X [k] for each of N frequency bands representing magnitude and phase for the kth frequency component (frequency domain) Discrete Fourier Transform (DFT): X [k] = N n= x[n] exp ( j 2πN ) kn Fast Fourier Transform (FFT) efficient algorithm for computing DFT when N is a power of 2 (e) Hamming window (f) Hamming window (Taylor, fig 2.2) Figure 2.2 Effect of windowing in the frequency domain, with magnitude spectra on the left hand side and spectra ASR Lectures on the right 2&3 hand side. Speech Signal Analysis 3 Windowing and spectral analysis Figures 2. and 2.5 show some typical power spectra for voiced and unvoiced speech. For the voiced sounds, we can clearly see the harmonics as the series of spikes. They are evenly spaced, and the fundamental frequency of the the source, the glottis, can be estimated by taking the reciprocal of the distance between the harmonics. In addition, the formants can clearly be seen as the more general peaks in the spectrum. The general shape of the spectrum ignoring the harmonics is often referred to as the envelope. windowing shift Window the signal x(n) into frames x t (n) and apply Fourier Transform to each segment. Short frame width: wide-band, high time resolution, low frequency resolution Long frame width: narrow-band, low time resolution, high frequency resolution For ASR: frame width 2ms frame shift ms Intensity frame Discrete Fourier Transform Frequency Short time power spectrum ASR Lectures 2&3 Speech Signal Analysis Wide-band and narrow-band spectrograms Figure 2. Wide band spectrogram window width = 2.5ms Figure 2.9 Narrow band spectrogram window width = 25ms (Taylor, figs 2., 2.9) ASR Lectures 2&3 Speech Signal Analysis 5 ASR Lectures 2&3 Speech Signal Analysis

Short-time spectral analysis A 25 ms Hamming-windowed signal from [iy] And its spectrum as computed by DFT (plus DFT Spectrum other smoothing) windowing shift frame Intensity Discrete Fourier Transform Frequency Short time power spectrum 2 3 5 Time (frame) 7 25ms Hamming window of vowel /iy/ and its spectrum computed by DFT (Jurafsky and Martin, fig 9.2) Frequency ASR Lectures 2&3 Speech Signal Analysis 7 ASR Lectures 2&3 Speech Signal Analysis DFT Spectrum Features for ASR Equally-spaced frequency bands but human hearing less sensitive at higher frequencies (above Hz) The estimated power spectrum contains harmonics of F, which makes it difficult to estimate the envelope of the spectrum 2 Log X(w) 5 5 2 25 Frequency bins of STFT are highly correlated each other, i.e. power spectrum representation is highly redundant Human hearing Physical quality Intensity Fundamental frequency Spectral shape Onset/offset time Phase difference in binaural hearing Technical terms equal-loudness contours masking auditory filters (critical-band filters) critical bandwidth Perceptual quality Loudness Pitch Timbre Timing Location ASR Lectures 2&3 Speech Signal Analysis 9 ASR Lectures 2&3 Speech Signal Analysis 2

Equal loudness contour Nonlinear frequency scaling Human hearing is less sensitive to higher frequencies thus human perception of frequency is nonlinear Bark scale Mel scale b(f ) = 3 arctan(.7f ) + 3.5 arctan((f /75) 2 ) M(f ) = 27 ln( + f /7) 25 Bark frequency [Bark] 2 5 5 2 2 linear frequency [Hz] ASR Lectures 2&3 Speech Signal Analysis 2 ASR Lectures 2&3 Speech Signal Analysis 22 Mel Filterbank Apply a mel-scale filter bank to DFT power spectrum to obtain mel-scale power spectrum Each filter collects energy from a number of frequency bands in the DFT Linearly spaced < Hz, logarithmically spaced > Hz DFT(STFT) power spectrum Triangular band pass filters Frequency bins Log Energy Compute the squared of each Mel filter bank output Taking the log compresses the dynamic range Human sensitivity to signal energy is logarithmic i.e. humans are less sensitive to small changes in energy at high energy than small changes at low energy Log makes features less variable to acoustic coupling variations Removes phase information not important for speech recognition (not everyone agreeswith this) Mel scale power spectrum ASR Lectures 2&3 Speech Signal Analysis 23 ASR Lectures 2&3 Speech Signal Analysis 2

DFT Spectrum Features for ASR Equally-spaced frequency bands but human hearing less sensitive at higher frequencies (above Hz) The estimated power spectrum contains harmonics of F, which makes it difficult to estimate the envelope of the spectrum 2 Log X(w) 5 5 2 25 Frequency bins of STFT are highly correlated each other, i.e. power spectrum representation is highly redundant Cepstral Analysis Source-Filter model of speech production Source: Vocal cord vibrations create a glottal source waveform Filter: Source waveform is passed through the vocal tract: position of tongue, jaw, etc. give it a particular shape and hence a particular filtering characteristic Source characteristics (F, dynamics of glottal pulse) do not help to discriminate between phones The filter specifies the position of the articulators... and hence is directly related to phone discrimination Cepstral analysis enables us to separate source and filter Cepstral Analysis ASR Lectures 2&3 Speech Signal Analysis 25 Split power spectrum into spectral envelope and F harmonics. 2 Log X(w) 5 5 2 25.9. Cepstrum.7..5..3.2. 5 5 2 25 2 Envelope (Lag=3) 5 5 2 25 Residue 2 5 5 2 25 Log Spectrum (freq domain) Inverse Fourier Transform Cepstrum (time domain) (quefrency) Liftering to get low/high part (lifter: filter used in cepstral domain) Fourier Transform Smoothed-spectrum (freq. [low-part of cepstrum] Log spectrum [high-part of cepstrum] domain) The Cepstrum ASR Lectures 2&3 Speech Signal Analysis 2 Cepstrum obtained by applying inverse DFT to spectrum (may be mel-scaled) Cepstrum is time-domain (we talk about quefrency) Inverse DFT: y t [k] = M m= ( log( Y t (m) ) cos k(m.5) π ) M K =,..., J Since log power spectrum is real and symmetric the inverse DFT is equivalent to a discrete cosine transform ASR Lectures 2&3 Speech Signal Analysis 27 ASR Lectures 2&3 Speech Signal Analysis 2

MFCCs Smoothed spectrum: transform to cepstral domain, truncate, transform back to spectral domain Mel-frequency cepstral coefficients (MFCCs): use the cepstral coefficients directly Widely used as acoustic features in HMM-based ASR First 2 MFCCs are often used as the feature vector (removes F information) Less correlated than spectral features easier to model than spectral features Very compact representation 2 features describe a 2ms frame of data For standard HMM-based systems, MFCCs result in better ASR performance than filter bank or spectrogram features MFCCs are not robust against noise Dynamic features ASR Lectures 2&3 Speech Signal Analysis 29 PLP Perceptual Linear Prediction Estimating dynamic features PLP (Hermansky, JASA 99) Uses equal loudness pre-emphasis and cube-root compression (motivated by perceptual results) rather than log compression Uses linear predictive auto-regressive modelling to obtain cepstral coefficients PLP has been shown to lead to slightly better ASR accuracy slightly better noise robustness compared with MFCCs ASR Lectures 2&3 Speech Signal Analysis 3 Speech is not constant frame-to-frame, so we can add features to do with how the cepstral coefficients change over time, 2 are delta features (dynamic features / time derivatives) c(t) Simple calculation of delta features d(t) at time t for cepstral feature c(t): c(t + ) c(t ) d(t) = 2 More sophisticated approach estimates the temporal derivative by using regression to estimate the slope (typically using frames each side) Standard ASR features are 39 dimensions: 2 MFCCs, and energy 2 MFCCs, energy 2 2 MFCCs, 2 energy t c (t ) time ASR Lectures 2&3 Speech Signal Analysis 3 ASR Lectures 2&3 Speech Signal Analysis 32

Feature Transforms Orthogonal transformation (orthogonal bases) DCT (discrete cosine transform) PCA (principal component analysis) Transformation based on the bases that maximises the separability between classes. LDA (linear discriminant analysis) / Fisher s linear discrminant HLDA (heteroscedastic linear discriminant analysis) Summary: Speech Signal Analysis for ASR Good characteristics of ASR features MFCCs - mel frequency cepstral coefficients Short-time DFT analysis Mel filter bank Log magnitude squared Inverse DFT (DCT) Use first few (2) coefficients Delta features 39-dimension feature vector: MFCC-2 + energy; + Deltas; + Delta-Deltas ASR Lectures 2&3 Speech Signal Analysis 33 ASR Lectures 2&3 Speech Signal Analysis 3