Speech, Audio, Image, and Video Coding

Speech, Audio, Image, and Video Coding Douglas L. Jones ECE 497 Spring 2000 4/11/00 Source Coding D.L. Jones 1

Outline Speech Coding Speech Recognition Audio Coding Image Coding Video Coding 4/11/00 Source Coding D.L. Jones 2

Goals for this Lecture Learn basic principles underlying speech, audio, image, and video coding Understand why coding is important Understand some roles of media processing in embedded system design 4/11/00 Source Coding D.L. Jones 3

Speech Coding Speech coding is important for reduced data rate in digital communications, such as cell-phones reduced memory, storage requirements for digital answering machines, speech databases, etc. 4/11/00 Source Coding D.L. Jones 4

Speech Processing Parameters Typical sampling rate: 8 khz Typical quantization: 8-bit mu-law or A-law Base rate: 64k bits per second (64 kbps) Compressed rates range from 2.4-32 kbps Modern compression methods based on ADPCM or LPC 4/11/00 Source Coding D.L. Jones 5

ADPCM Adaptive Differential Pulse Code Modulation (ADPCM) is an old compression standard that remains the best method for half-rate (32 kbps) compression ADPCM is a waveform compression method that tries to preserve the actual speech signal waveform as much as possible. 4/11/00 Source Coding D.L. Jones 6

ADPCM codes the difference signal; that is, d(n) = x(n) - x(n-1) Since the speech waveform is primarily a low-frequency signal, the difference is usually small, so it requires fewer bits to represent Adaptive DPCM adjusts quantization step size to track difference amplitude 4/11/00 Source Coding D.L. Jones 7

A short adaptive prediction filter also reduces size of difference ADPCM at half rate (32 kbps) sounds almost indistinguishable from 64 kbps speech 4/11/00 Source Coding D.L. Jones 8

Linear Prediction Coding Linear Prediction Coding (LPC) is a fundamentally different, model based approach to speech coding Based on acoustic tube model of human speech production Models short speech segments as either white noise (unvoiced) or an impulse train (voiced) input to an all-pole (IIR) filter 4/11/00 Source Coding D.L. Jones 9

LPC-10 Input amplitude, voiced/unvoiced, pitch period, and 10th-order filter coefficients are computed for 20-30 ms blocks Instead of transmitting speech, send only the filter coefficients and other parameters! Rerun filter at the receive end to reconstruct speech 4/11/00 Source Coding D.L. Jones 10

Produces artificial-sounding but understandable speech reconstructions at rates as low as 2400 bits/sec 4/11/00 Source Coding D.L. Jones 11

Enhanced LPC Methods LPC-10 achieves excellent compression, but insufficient quality for most telephony applications Enhanced LPC methods have been developed with higher rates and performance LPC-based approaches dominate speech coding for rates at and below 16 kbps 4/11/00 Source Coding D.L. Jones 12

RELP Residual Excited Linear Prediction (RELP) retains and sends residual (prediction error) as well Sending residual back through prediction model reconstructs original waveform (in the absence of quantization) Rate remains fairly high, since residual requires many bits 4/11/00 Source Coding D.L. Jones 13

CELP Code Excited Linear Prediction (CELP) selects an excitation sequence from a codebook of possible choices Transmit code indicating selection, rather than residual Greatly reduced rate, only modest loss in performance 4/11/00 Source Coding D.L. Jones 14

There are many flavors of CELP; the better lower-rate methods based on this concept Cell-phones tend to use rates from about 4.8 to 9.6 kbps Quality noticeably inferior to telephone, but deemed acceptable Allows 3-6 times as many users in a cell! 4/11/00 Source Coding D.L. Jones 15

Hardware note... Speech coding/decoding is the primary reason for DSP up in digital cell-phones! DSP up is nearly ideal for speech coding algorithms (ASIC wouldn t be better) Since it s there anyway, DSP up also used for many other functions 4/11/00 Source Coding D.L. Jones 16

Speech Recognition Speech recognition is expected to become a very important component of many future embedded systems Convenient, natural user interface for very small embedded systems (e.g., wristwatch cell-phone, Palm-Pilot X) non-critical systems (e.g., car radio, windshield wipers) 4/11/00 Source Coding D.L. Jones 17

Speech Recognition Methods Modern speech recognition is based on short-time spectral analysis Spectral estimates usually constructed from linear prediction followed by further processing Hidden Markov Models (HMMs) perform statistical comparison with database of words and language models 4/11/00 Source Coding D.L. Jones 18

System Requirements Memory and computational requirements: Small vocabulary, isolated word recognition a few MIPS and kbs Large vocabulary, continuous speech 100s of MIPS, 100s of MBs 4/11/00 Source Coding D.L. Jones 19

Audio Coding Quality expectations considerably higher than with speech 16-bit, 44.1 khz stereo is CD standard Modern audio coding methods (e.g., mp3) based on perceptual coding tricks Exploit limitations of human hearing to reduce rate while minimizing audible artifacts 4/11/00 Source Coding D.L. Jones 20

Split signal into different frequency bands according to sensitivities of human hearing Exploit masking to remove data from inaudible bands due to loud neighbors Shape quantization noise to lie in masked regions Obtain near-cd quality at 128-256 kbps 4/11/00 Source Coding D.L. Jones 21

Image Coding Many emerging embedded system applications Digital cameras Security (e.g., fingerprint ID) Medical record storage Image usually acquired with a CCD imaging sensor 4/11/00 Source Coding D.L. Jones 22

Requirements Typical image ~ 512x512 pixels, 3 colors each at 8 bits Or binary black-and-white Two types of compression Lossless: maximum compression ratios of 2-3 Lossy: high quality with compression ratios of 10-30 4/11/00 Source Coding D.L. Jones 23

Image Compression Standards Binary images: JBIG/FAX standards Primarily based on run-length coding (i.e., number of black or white pixels in succession) 8-bit images: JPEG standard: DCT-based Emerging standards wavelet based (EZW, SPIHT, JPEG-2000) 4/11/00 Source Coding D.L. Jones 24

Principles of JPEG Image segmented into 8x8 blocks of pixels 2-D Discrete Cosine Transform (DCT) computed of each block Most of these frequency components are typically very small and can be coarsely quantized or discarded Quantized data is entropy-coded 4/11/00 Source Coding D.L. Jones 25

JPEG Characteristics At compression rates of 1 bit per pixel, quality loss is usually small Below about 0.5 bpp, blocking artifacts begin to appear; much below this is usually unacceptable 4/11/00 Source Coding D.L. Jones 26

Emerging Methods New methods based on wavelets are emerging Frequency decomposition by successive subband filtering Small coefficients discarded Artifacts generally less objectionable 4/11/00 Source Coding D.L. Jones 27

Exploitation of tree structure and dependencies yields further compression JPEG-2000 standard will be based on these methods 4/11/00 Source Coding D.L. Jones 28

Video Coding Embedded system examples: HDTV Satellite TV Set-top boxes Security systems Multimedia devices 4/11/00 Source Coding D.L. Jones 29

Motion-Based Coding Methods Modern video coding methods exploit frame-to-frame similarities to further compress video Similar to JPEG, except that motioncompensated difference frames are coded with DCT Motion vectors encode change in location of blocks 4/11/00 Source Coding D.L. Jones 30

Video Coding Standards MPEG-2 and MPEG-4 are leading standards for high (television) quality video coding H.263 is primary standard for low-rate video coding (video-phones) Compression ratios of 30-50 with good quality are usually obtained 4/11/00 Source Coding D.L. Jones 31

Summary Source coding essential to reduce memory requirements, bandwidth of multimedia data Complex DSP algorithms obtain great data reductions with little loss in quality Coding algorithms have characteristics common to other DSP computations Source coding likely to play increasingly important role in many embedded systems 4/11/00 Source Coding D.L. Jones 32