A MATLAB Software Tool for the Introduction of Speech Coding Fundamentals in a DSP Course Edward Painter, and Andreas Spanias Department of Electrical Engineering, Telecommunications Research Center Arizona State University, Tempe, Arizona 85287-7206 spanias@asu.edu, painter@asu.edu Abstract An educational software tool on speech coding is presented. Portions of this program are used in our senior-level DSP class at Arizona State University to expose undergraduate students to speech coding and present speech analysis/synthesis as an application paradigm for many DSP fundamental concepts. The simulation software provides an interactive environment that allows students to investigate and understand speech coding algorithms for a variety of input speech records. Time- and frequency-domain representations of input and reconstructed speech can be graphically displayed and played back on a PC equipped with a standard 16-bit sound card. The program has been developed for use in the MATLAB environment and includes implementations of the FS-1015 LPC-10e, the FS-1016 CELP, the ETSI GSM, the IS-54 VSELP, the G.721 ADPCM, the G.722 subband, and the G.728 LD-CELP speech coding algorithms, integrated under a common graphical interface. 1. Introduction Speech coding is an application area in signal processing concerned with obtaining compact represent a- tions of speech signals for efficient transmission or sto r- age. This requires analysis and modeling of digital speech signals which are usually represented by a compact set of quantized filter, excitation, and spectrum param e- ters. As such speech coding uses many fundamental si g- nal processing tools and concepts which are taught in u n- dergraduate DSP classes. It can therefore be used as an application paradigm for demonstrating the utility of DSP tools such as digital filtering, random signal processing, autocorrelation and PSD estimation, handling of nonstationarities, windowing, quantization of filter coefficients, estimation of periodicity, and time-varying signal modeling. An exposition to speech coding in an unde r- graduate DSP course is also motivated by the emergence of new computer and mobile communication applications that require young electrical engineers to have some fu n- damental speech processing knowledge in the context of DSP. We recently started to introduce speech coding t o- wards the end of the senior-level four-credit DSP course by devoting four lectures, two homework assignments, and one computer project to address this important appl i- cation area. As part of this effort, we developed an ed u- cational simulation program in MATLAB that can be used to provide knowledge on speech coding algorithms and demonstrate the utility of several important DSP co n- cepts. LP-based codecs which have been implemented i n- clude the FS-1015 LPC-10e, the FS-1016 CELP, the IS-54 VSELP, the ETSI GSM, the G.728 LD-CELP, the G.721 MATLAB is a trademark of The MathWorks, Inc. ADPCM, and the G.722 subband coder. These programs provide a unified exposition to the algorithms by bringing them together into a common simulation framework under MATLAB. In addition, a unified user-friendly interface is developed which enables users to experiment with a var i- ety of input signals, examine graphical representations of analysis/synthesis parameters, playback reconstructed ou t- put speech, and compare quality of output speech assoc i- ated with the different coding standards. Graphical ou t- puts may provide information to the user about underlying algorithm mechanisms. Simulations have been coded in an expository style to serve as template programs which supply working examples and important details often omitted from the general literature. The MATLAB environment offers several advantages. First, users are able to generate a variety of signal and parameter plots, exper i- ment with the effects of channel noise and network tandeming, and modify algorithm parameters in an enviro n- ment where algorithms are easily manipulated. Second, MATLAB code is compact thereby simplifying algorithm understanding. Third, MATLAB is being used widely in academic institutions to support linear systems and DSP courses. Finally, the MATLAB codecs will run on a variety of computers, i.e., DOS, Mac, UNIX, etc. This paper describes the educational software tool and gives sample simulations that can be used to assist undergraduates in the understanding of speech coding algorithms. 2. MATLAB Codec Simulations Simulations accept input samples from.wav input files, run analysis at the transmitter, transmit p a- rameters through a simulated channel, run synthesis at the receiver, and then generate.wav output files. Speech files contain 16-bit linear PCM data, sampled at 8 khz. A. Time- and Frequency-Domain Viewing Windows A time-domain viewing window allows compar i- sons between input and reconstructed output waveforms (Fig. 1a). One is able to see the differences in waveform matching behavior between a hybrid algorithm (e.g., CELP) and a vocoder (e.g., LPC-10e). Comparisons are enhanced by a facility which allows examination of the reconstruction error. Users can also observe the bitrate/performance tradeoff; higher bit-rate algorithms ge n- erate small errors, while low bit-rate algorithms produce larger errors. A frequency-domain viewing window is also available (Fig. 1b), allowing comparison of magn i- tude spectra between input and reconstructed output speech. Magnitude spectral estimates are generated using a 1024- point FFT. The LPC envelope, corresponding to quantized predictor coefficients received by the decoder, is
superimposed on both plots. One can observe spectral matching properties, e.g., a vocoder such as LPC-10e exhibits reasonable spectral matching despite low SNR. Spectral error display is also available. In all LPC coding methods, short-term spectral characteristics are captured in an all-pole synthesis filter. It is the excitation models which are different in these algorithms in terms of co m- plexity, performance, and bit rate. Our excitation viewing window allows observation of excitation sequences in time and frequency (Fig. 1c). Comparisons help users to u n- derstand different excitation methodologies. After o b- serving voiced LPC-10e excitations, for example, a glottal pulse shape invariance becomes evident; excitation changes between voiced frames occur only in the number of pulse repetitions (pitch) and the added noise. Obser v- ing GSM excitations clarifies the concept of RPE, in which each frame of regularly spaced pulses has distinctly different amplitude patterns than its predecessor. Users can observe that RPE excitations achieve performance gains relative to the simplistic two-state model used in LPC-10e. In CELP (Fig. 1c), random vectors have been combined with lag search vectors to obtain an optimal e x- citation. We have elected to present pole locations of the decoder's LPC synthesis filter through a Z-Plane view (Fig. 1d). This window also allows pole trajectory trac k- ing and animated playback, and provides information about formant locations. B. Quality Measures and Speech File View Utility Many objective quality measures have been pr o- posed to quantify coding distortion [ 1]. Our simulations incorporate spectral and temporal distortion measures in a quality display. Furthermore, there is a frame-by-frame speech file viewing utility which generates 3-D spectograms using FFT or LP-based spectral analysis (Fig. 2). Fig. 2. File Viewer 3-D LPC Spectogram. 3. MATLAB Simulation Exercises Fig. 1. Viewing Windows (CELP): (a) Time-Domain, (b) Frequency Domain, (c) Excitation, (d) Z-Plane. A. CELP Codebook Search Excitation optimization in CELP involves (in most cases) exhaustively searching two vector codebooks. Codebooks are searched sequentially, adaptive first and then stochastic. During the search process, candidate e x- citations are used to synthesize speech and generate error signals. Excitation vectors (gain- shape VQ) are chosen to minimize a perceptually weighted error measure. Our
software enables users to examine codebook (CB) search procedures. We show candidate adaptive CB vectors co r- responding to min. and max. match scores obtained from a 256-vector search space (Figs. 3a,c.). Using these exc i- tation sequences, we can synthesize and evaluate speech waveforms as shown in Figs. 3b,d, respectively. Output records are plotted with input speech to allow compar i- sons. SNRs are provided to give an objective performance measure. From Fig. 3, we observe that higher match scores correspond to higher quality excitations and higher SNR. By developing plots like Fig. 3, students are able to observe the correspondence between match scores and e x- citation quality. Furthermore, they gain knowledge on the nature of VQ excitations. (a) (a) (b) (b) Fig.4. CELP Perceptual Weighting Filter: (a) Poles/Zeros, (b) Magnitude Response and LP Envelope. poor subjective quality measurement inherent in SNR. (c) Fig. 3. Adaptive Excitation Vectors Associated with (a) Min. and (c) Max. Match Scores; Output Speech (b,d) B. CELP Perceptual Weighting Filter CELP CB search procedures minimize a perceptually weighted error. Weighting is achieved through an IIR filter which shapes the error spectrum to exploit masking properties of the ear. In particular, CELP algo rithms exploit the fact that humans have a limited ability to detect small errors in frequency bands where the speech signal has high energy, such as the formant regions. Therefore the CELP weighting filter de-emphasizes formant regions in the error spectrum. The transfer function of the weighting filter is of the form 1 A( z) W( z) = = A( z / ) 1 p i= 1 p i= 1 a a z i i i i z i (d), = 0. 8 where A( z) is the short term LP synthesis filter and a i are the predictor coefficients. The parameter expands formant bandwidths by moving poles radially inward towards the center of the unit circle. Our software enables users to examine pole/zero and frequency response plots for the PWF (Fig. 4). Users may also process speech re c- ords with and without the weighting filter. Comparing output records provides insight on the net effects of the PWF. One can observe that subjective speech quality i m- proves with the filter, despite the drop in SNR. This exe r- cise demonstrates both weighting filter behavior and the (1) C. LPC-10e Voicing Detection The voicing detection scheme in LPC-10e uses a sophisticated linear discriminant analysis procedure in which several signal parameters are linearly combined and then smoothed to generate a voicing decision for each half-frame. Our software enables students to examine the evolution of these parameters with time (Fig. 5.). Fig. 5. LPC-10e Voicing Decision Discriminant Analysis: Mable stood on the rock. D. Robustness to Channel Errors and Tandeming Codec bit streams in wireless applications are subjected to channel errors which are characterized in terms of bit error rate (BER). Coding algorithms should tolerate bit errors with minimal perceptual degradation. Our software is equipped with BER and tandeming controls (Fig. 6) that enable students to contrast error tole r-
ances between the different algorithms. As illustrated in the CELP segmental signal-to-noise ratio (SSNR) penalty plots of Fig. 7, one can investigate algorithmic perfor m- ance in the presence of channel errors. Fig. 7a shows i n- dividual bit sensitivities for the standard CELP frame bits. The reference level at 5.7 db corresponds to SSNR achieved over a clear channel. Vertical penalty lines for each bit indicate the mean SSNR penalty incurred when the corresponding bit is inverted with unity probability. The family of curves in Fig 7b illustrates parametric error sensitivities measured at BERs of 0.1%, 0.5%, 1%, 5%, and 10%. For each curve, bits for the specified parameter are randomly corrupted, while the remaining parameters are left undisturbed. Users can also employ our tools to perform subjective evaluations In addition to channel errors, a robust coding algorithm must also tolerate tandem encoding without excessive compromises in output quality. Our simulations enable users to examine algorithmic responses to multiple sy n- chronous tandems. The software allows, e.g., five-stage configurations ( T0 T5 ). Objective figures of merit are r e- ported in terms of SNR, SSNR, and CD. Example scores reported here reflect mean results after processing frames at each of six tandem nodes ( T0 T5 ). For Mean Opinion Score (MOS) trials, trained listeners could be asked to judge test sentences on a five-point MOS scale. Fig.6. BER and Tandeming Controls.. Fig. 8. Penalty Associated with CELP Tandem Encoding: (a) SSNR, (b) MOS. Tandeming scores for CELP are shown in Figs. 8a and 8b. In our example, experimentally obtained MOS values for CELP are biased by an average of -0.2 with respect to MOS BK [2]. MOS BK is a biased version of the MOS predictor proposed by Kitawaki, et al which is evaluated by our simulation tools: MOS = 0. 04CD 2 0. 80CD + 4. 86 (2) BK The preceding exercises represent the testing c a- pabilities of our software. Other beneficial topics of i n- vestigation include comparisons of performance with di f- ferent input signals/speakers, examination of parametric variations and performance tradeoffs, and evaluations of algorithmic robustness to acoustic background noise. 4. Conclusion Fig. 7. Penalty Associated with CELP Channel Errors: (a) Single Bit, (b) Parametric. We have presented new educational speech co d- ing simulation software developed to supplement our speech coding and DSP lecture courses with hands-on e x-
periments. We have also described a laboratory har d- ware/software environment and outlined simulation exe r- cises. In future work, we will incorporate additional co d- ing algorithms, including a sinusoidal transform coder. 5. References 1. A. Gray and J. Markel, "Distance Measures for Speech Processing," IEEE Trans. ASSP-24, Oct. 1976. 2. N. Kitawaki, et al., Objective Quality Evaluation for Low-Bit-Rate Speech Coding Systems, IEEE J. on Sel. Areas in Comm, pp. 242-248, Feb. 1988.
MATLAB is a trademark of The MathWorks, Inc.