The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image and Video Processing

Similar documents

Figure 1: Relation between codec, data containers and compression algorithms.

Video Coding Basics. Yao Wang Polytechnic University, Brooklyn, NY11201

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Information, Entropy, and Coding

Quality Estimation for Scalable Video Codec. Presented by Ann Ukhanova (DTU Fotonik, Denmark) Kashaf Mazhar (KTH, Sweden)

Introduction to image coding

White paper. H.264 video compression standard. New possibilities within video surveillance.

Image Compression through DCT and Huffman Coding Technique

CHAPTER 3: DIGITAL IMAGING IN DIAGNOSTIC RADIOLOGY. 3.1 Basic Concepts of Digital Imaging

Video compression: Performance of available codec software

Bandwidth Adaptation for MPEG-4 Video Streaming over the Internet

DOLBY SR-D DIGITAL. by JOHN F ALLEN

How To Improve Performance Of The H264 Video Codec On A Video Card With A Motion Estimation Algorithm

Sampling Theorem Notes. Recall: That a time sampled signal is like taking a snap shot or picture of signal periodically.

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Study and Implementation of Video Compression Standards (H.264/AVC and Dirac)

Parametric Comparison of H.264 with Existing Video Standards

Study and Implementation of Video Compression standards (H.264/AVC, Dirac)

Compressing Moving Images. Compression and File Formats updated to include HTML5 video tag. The DV standard. Why and where to compress

Video-Conferencing System

Digital Video: A Practical Guide

Performance Analysis and Comparison of JM 15.1 and Intel IPP H.264 Encoder and Decoder

H 261. Video Compression 1: H 261 Multimedia Systems (Module 4 Lesson 2) H 261 Coding Basics. Sources: Summary:

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Video Conferencing Glossary of Terms

4 Digital Video Signal According to ITU-BT.R.601 (CCIR 601) 43

Audio and Video Synchronization:

Classes of multimedia Applications

The Scientific Data Mining Process

White paper. An explanation of video compression techniques.

The Effect of Network Cabling on Bit Error Rate Performance. By Paul Kish NORDX/CDT

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

How to Send Video Images Through Internet

Understanding HD: Frame Rates, Color & Compression

MassArt Studio Foundation: Visual Language Digital Media Cookbook, Fall 2013

CHAPTER 2 LITERATURE REVIEW

Understanding Compression Technologies for HD and Megapixel Surveillance

IMPACT OF COMPRESSION ON THE VIDEO QUALITY

Video Authentication for H.264/AVC using Digital Signature Standard and Secure Hash Algorithm

Finding Equations of Sinusoidal Functions From Real-World Data

Trigonometric functions and sound

Encoding Text with a Small Alphabet

Audio Coding Algorithm for One-Segment Broadcasting

Best practices for producing quality digital video files

Quick start guide! Terri Meyer Boake

AUDIO CODING: BASICS AND STATE OF THE ART

To determine vertical angular frequency, we need to express vertical viewing angle in terms of and. 2tan. (degree). (1 pt)

Overview: Video Coding Standards

A Short Introduction to Computer Graphics

Voice---is analog in character and moves in the form of waves. 3-important wave-characteristics:

Image Authentication Scheme using Digital Signature and Digital Watermarking

ATSC Standard: 3D-TV Terrestrial Broadcasting, Part 2 Service Compatible Hybrid Coding Using Real-Time Delivery

If A is divided by B the result is 2/3. If B is divided by C the result is 4/7. What is the result if A is divided by C?

Digital Audio Compression: Why, What, and How

(Refer Slide Time: 4:45)

We are presenting a wavelet based video conferencing system. Openphone. Dirac Wavelet based video codec

CM0340 SOLNS. Do not turn this page over until instructed to do so by the Senior Invigilator.

MEMORY STORAGE CALCULATIONS. Professor Jonathan Eckstein (adapted from a document due to M. Sklar and C. Iyigun)

Digital Versus Analog Lesson 2 of 2

Computer Vision and Video Electronics

Introduction to Medical Image Compression Using Wavelet Transform

Develop Computer Animation

Understanding Video Latency What is video latency and why do we care about it?

MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music

Introduzione alle Biblioteche Digitali Audio/Video

How To Understand The Technical Specifications Of Videoconferencing

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

MMGD0203 Multimedia Design MMGD0203 MULTIMEDIA DESIGN. Chapter 3 Graphics and Animations

ELECTRONIC DOCUMENT IMAGING

Wireless Video Best Practices Guide

Managing video content in DAM How digital asset management software can improve your brands use of video assets

DAB + The additional audio codec in DAB

Video codecs in multimedia communication

For Articulation Purpose Only

Introduction to Digital Video

PCM Encoding and Decoding:

Safer data transmission using Steganography

Compression Workshop. notes. The Illities. Richard Harrington. With RICHARD HARRINGTON

MPEG-4 Natural Video Coding - An overview

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Analog and Digital Signals, Time and Frequency Representation of Signals

REWARD System For Even Money Bet in Roulette By Izak Matatya

The Blu-ray Disc. Teacher's manual. Jean Schleipen, Philips Research, Eindhoven, The Netherlands

Third Southern African Regional ACM Collegiate Programming Competition. Sponsored by IBM. Problem Set

Case Study: Real-Time Video Quality Monitoring Explored

Factoring Patterns in the Gaussian Plane

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

Non-Data Aided Carrier Offset Compensation for SDR Implementation

Face detection is a process of localizing and extracting the face region from the

MPEG-1 and MPEG-2 Digital Video Coding Standards

Fast Hybrid Simulation for Accurate Decoded Video Quality Assessment on MPSoC Platforms with Resource Constraints

Transcription:

The Essence of Image and Video Compression E8: Introduction to Engineering Introduction to Image and Video Processing Dr. Anil C. Kokaram, Electronic and Electrical Engineering Dept., Trinity College, Dublin, Ireland, anil.kokaram@tcd.ie Overview This handout covers the basics of Image and Video compression as follows. What is compression and why is it needed?. The simplest possible compression scheme: Run Length Encoding 3. Representing signals by sums of sines and cosines [The Fourier Transform] 4. Transform compression and JPEG 5. Motion estimation and predicting pictures in a sequence 6. Video Compression

THE NEED FOR COMPRESSION The need for compression Consider a typical television image. It consists of pixels in each row, and there are rows. A 4:: (broadcast standard) video frame (as you would get from your Digital Set Top box, or DVD) represents colour as below. 4:: 4:: 4:: In one frame there are pixels. As each pixel is represented by one byte, then that is bytes. At 5 frames/sec this means a bandwidth of MB/sec is required to transmit the VIDEO ALONE! This means about to store one hour of movie. This is the RAW DATA bandwidth. The available bandwidth for a single Digital television channel is at best 6Mbits/sec. This is about 3 times smaller than the MB/sec needed. DVD can store at most 4GB, how does one fit hours of movie on a DVD? You digital mobile phone can handle maybe Mbit/sec absolute TOPS. That is 8 times smaller than required for video. Imagine you are a film and TV archive (like www.ina.fr or the BBC or rte). You need to keep a record of 4 hours of programming on s of channels daily for up to 5 years (in the case of the BBC). Hmm.. there is not enough space in a town to stack up the CD s needed to store that! So a mechanism is needed to represent images with fewer bytes than the raw data E8 Introduction to Engineering Anil Kokaram www.mee.tcd.ie/sigmedia

3 TOWARDS COMPRESSION 3 Towards compression I don t really need pixels for my inch mobile screen do I? So I can throw away every 4th pixel and 4th line (subsampling) for instance, and yield a picture instead. So now I can show the same picture for /6 the storage. Not good enough. Besides, pictures look really crap on a TV set. Format Total Active MB/sec Resolution Resolution CCIR 6 3 frames/sec, 4:3 Aspect Ratio, 4:: QCIF CIF Full CCIR 6 5 frames/sec, 4:3 Aspect Ratio, 4:: QCIF CIF Full What if I start to think about mathematical models for pictures...? Then I can send/store the parameters of my model instead of the actual pictures, and if my model is simple, I can store less parameters than pixels and get some compression. Hmmm. But pictures look pretty complicated. In fact most interesting pictures tend to be different from other pictures. Otherwise why look? It turns out that you can make some generic statements about images and image sequences.. In small, local regions, pixel intensity and colour tends to be the same or at least slowly varying. For small.. think blocks of pixels.. You can construct any picture by adding together a weighted set of pre-defined primitive pictures. These primitive pictures are in fact the D equivalent of sines and cosines. 3. In a video sequence consecutive pictures tend to look the same except for the moving bits. We ll use these ideas now. E8 Introduction to Engineering 3 Anil Kokaram www.mee.tcd.ie/sigmedia

4 RUN LENGTH ENCODING 4 Run Length Encoding Consider that you want to transmit a fax as an image. There are just colours = black and = white. Let s say your image is as below (the letter H in a binary image). Instead of sending every single pixel, since there tend to be long lengths of consecutive repeated pixels (i.e. long runs) we could send a (for instance) followed by the number of times it is repeated. So instead of sending or storing for instance, you would store, the first number being the colour, and the second being the number of times that colour occurred consecutively. Instead of storing 8 bytes, we have stored just. We have encoded some raw data of 8 zeros, as just bytes. We have achieved a compression factor of! In typical RLE schemes, you do not account for all possible runs. Instead you only allow for runs of length say to 3 for instance. Then a run of length 64 would need to be encoded as runs of length 3. Lets say for our RLE scheme we allow a maximum run length of 8, and the data is either or. The image example then can be represented by... But what about a real/grayscale image? Hmm. RLE might get inefficient if the data is not mostly flat! 3 3 8 3 5 E8 Introduction to Engineering 4 Anil Kokaram www.mee.tcd.ie/sigmedia

5 SIGNAL TRANSFORMS 5 Signal Transforms What if it were possible to change the image in some reversible process, so that we created a result that was easier to compress? In other words we take our data and transform it in some clever way to make RLE work better. This is related to another idea. Suppose I had a photoalbum/dictionary of all the possible images in the world ever made in the past and ever will be possible in the future. And suppose I gave you a copy of this dictionary in which each image was assigned a number. Then instead of having to send you the raw data, I would just send you the number of the image in the dictionary, and you could look it up and you d have the picture! This dictionary would be very large since pictures come in many flavours. To make a smaller dictionary, you can instead choose images which when added together make up the picture you want to send or store. So now to send a picture, the transmitting end has to work out which set of images could be added together to give the picture. Then the transmitter sends the indexes of those images to the receiver. The receiver then looks up the pictures and then adds them together to give the received image. About years ago, a guy called Fourier, spotted that you could actually do this with any signal. He was working on D signals but the same applies to D ones. No electricity, no computers, no cinema, no television, no hot baths, no baths, no showers. Lice in your hair all the time, no soap, no nylon, no jeans, no flushing toilets, no sewage system... E8 Introduction to Engineering 5 Anil Kokaram www.mee.tcd.ie/sigmedia

5. Representing signals with waves 5 SIGNAL TRANSFORMS 5. Representing signals with waves The brilliant discovery of Fourier, was that any D signal can be represented by a weighted sum of sines and cosines. So to make a triangle wave for instance, all you need to do is to add a bunch of sines and cosines together of different frequencies and different amplitudes..5.5.5 3 3.5 4 4.5 5.5.5.5 3 3.5 4 4.5 5...5.5.5 3 3.5 4 4.5 5.5.5.5 3 3.5 4 4.5 5 Time (seconds) /π /π 3 4 5 6 7 And he came up with a mathematical formula that says which frequencies and which amplitudes were needed to synthesise a particular signal. Since we all know what sines and cosines look like, we can summarise this signal decomposition with a graph of Amplitude versus Frequency. That graph will tell us how much of each frequency E8 Introduction to Engineering 6 Anil Kokaram www.mee.tcd.ie/sigmedia

5. Representing signals with waves 5 SIGNAL TRANSFORMS should be added together. This is the Frequency Spectrum of a signal. Given this graph, Fourier also worked out how to reconstruct the original signal. He discovered a completely reversible transform: The Fourier Transform. It converts or transforms a signal from the time domain into a frequency domain. For audio signals like music, this sorta makes intuitive sense, for images and other signals its less intuitive but no less useful. 5 years later (in the 96 s) people 3 worked out how to use this for Digital signals and how it could be automated with computers. Then Fourier s idea really became super-useful. You see: we can think of the sines and cosines at different frequencies as our dictionary, and the amplitudes as a weight attached to each one. So to transmit some data all you need to do is to work out frequencies and amplitudes and send that instead of the actual raw data. The signals in this special dictionary are called basis functions and the corresponding amplitudes needed are called coefficients. So its a bit like saying, instead of sending the sawtooth wave (in the example above), send instead the graph of amplitude versus frequency. That graph is a whole lot smaller, but it contains all the same information. Think of this. Suppose I have a music signal which is a pure sine wave lasting secs at 5 Hertz that is represented by a digital signal sampled at 44. KHz. This means that my data record is 44 K samples long. Say we re using 6bit audio, that s 44 bytes. Instead of transmitting all 88 bytes : how could I send the same signal with just 3 bytes? People were sorting out the showers, baths, electricity, lice in the meantime 3 A guy with the funny name of Tukey E8 Introduction to Engineering 7 Anil Kokaram www.mee.tcd.ie/sigmedia

5. Image Transforms 5 SIGNAL TRANSFORMS 5. Image Transforms With D signals things are a bit trickier. D sines and cosines look a bit like a wave in a wave tank, or a wave in your bath, or a wave in the sea. Except the wave is a wave in intensity or brightness. The equation for working out how much of each wave you need to make a picture is also a bit tricky. Furthermore, each wave is represented by a complex number. Urgh?.5 3 4.5 8 5 6 6 4 3 4 5 6 7 3 4 5 6 Wave is directed at degrees off horizontal, frequency is cycles per pel in that direction and phase lag. Instead electrical/signal processing engineers have come up with a simpler 4 Transform that uses only Cosine waves. This transform, known as the Discrete Cosine Transform, results in only real numbers. It is the basis of JPEG. 4 Not really E8 Introduction to Engineering 8 Anil Kokaram www.mee.tcd.ie/sigmedia

5.3 JPEG for First Year Undergraduates 5 SIGNAL TRANSFORMS 5.3 JPEG for First Year Undergraduates JPEG is based on Transforming blocks of pixels using the D DCT. For a signal of 8 samples, the 8 possible DCT basis function (the dictionary) is as below. 8-point DCT: rows to 4-4 8-point DCT: rows 5 to 8 -.5-4.5 - -5 -.5-5.5 - -6 -.5-6.5-3 -7-3.5-7.5-4 -8-4.5-8.5-5 4 6 8-9 4 6 8 The 64 D DCT basis functions and the D DCT of a block in Lenna are shown below. 3 3 4 4 5 5 6 6 7 7 8 8 3 4 5 6 7 8 3 4 5 6 7 8 Now we can see that the effect of Transforming a block of pixels is to reduce its overall energy. Its flatter in the DCT space. This means that we have less information to transmit. Here is what happens if we take every block in Lenna and transform it with the D DCT. E8 Introduction to Engineering 9 Anil Kokaram www.mee.tcd.ie/sigmedia

5.3 JPEG for First Year Undergraduates 5 SIGNAL TRANSFORMS Now we re almost there... You can see that in the Transformed images, there are many coefficients that are almost zero. So why transmit or store them at all? If we wanted to reconstruct the image exactly, we would need all these tiny values, but because we know that the Human Visual System can tolerate defects in pictures, we know that maybe we can throw away the small coefficients and keep the big ones and still have a reasonable looking picture. In fact, in JPEG what is done is to quantise the coefficients with varying degrees of accuracy. So the top left hand corner coefficient is quantised with 3 levels say, while the bottom right hand corner is quantised to levels. This is because low frequency information is more important than high frequency for visual perception. When you set the Quality setting for JPEG in Adobe Photoshop, you are changing the quantisation levels. For low quality, you throw away more information, i.e. you quantise more coarsely. For high quality you keep more information so you quantise finely. After that step JPEG uses RLE to encode each block of coefficients in a zig-zag scan. E8 Introduction to Engineering Anil Kokaram www.mee.tcd.ie/sigmedia

5.3 JPEG for First Year Undergraduates 5 SIGNAL TRANSFORMS Problems : blocking artefacts and mosquito noise. E8 Introduction to Engineering Anil Kokaram www.mee.tcd.ie/sigmedia

6 VIDEO COMPRESSION 6 Video Compression All the best codecs for media are based on transforming the data in some way. JPEG is based on a new kind of transform, the Wavelet Transform discovered only in the late 98 s. Compression of audio.mp3 is based on D DCT. MPEG (Motion Picture Experts Group) is used for compression of video for DVD or DTV [MPEG,,4]. Ireland was a major player in establishing the MPEG 4 standard. Intel Indeo, Apple Quicktime, Divx are all based on MPEGGy ideas. MPEG is based again on the 8 point DCT just like JPEG except... In video most consecutive pictures look the same. So if I knew what one picture looked like, then in theory I could build all the others by slightly adjusting that one. This is called prediction. But things move around in video, so we have to estimate that motion to work out how to shift the pixels around in order to create the next image. 6.. On Motion Compensated Prediction To understand how prediction can help with video compression, The top row of figure shows a sequence of images of the Suzie sequence. It is QCIF ( ) resolution and at a frame rate of 3 frames/sec. We have already seen that Transform coding of images yields significant levels of compression, e.g. JPEG. Therefore a first step at compressing a sequence of data is to consider each picture separately. Consider using the D DCT of blocks. The DCT coefficients for each frame of Suzie are shown in the second row of figure. The use of the DCT on the raw image data yields a compression of the original 8 bits/pel data to about.8 bits/pel on each frame. Note that the DCT coefficients have NOT been quantised using the standard JPEG Quantisation matrix for demonstration purposes. We know that most images in a sequence are mostly the same as the frames nearby except with different object locations. Thus we can propose that the image sequence obeys a simple predictive model (discussed in previous lectures) as follows: () where is some small prediction error that is due to a combination of noise and model mismatch. Thus we can measure the prediction error at each pixel in a frame as () This is the motion compensated prediction error, sometimes referred to as the Displaced Frame Difference (DFD). The only model parameter required to be estimated is the motion vector. Assume for the moment that we use some process to estimate these vectors. We will look at that later. Figure illustrates how motion compensation can be applied to predict any frame from any previous frame using motion estimation. The figure shows block based motion vectors being used to match every E8 Introduction to Engineering Anil Kokaram www.mee.tcd.ie/sigmedia

6 VIDEO COMPRESSION Block Motion Shifted Block in Frame n Motion Vector Frame n Motion Vector Location of Block in Frame n Object n n Block in Frame n Frame n Figure : Explaining how motion compensation works. block in frame with the block that is most similar in frame. The difference between the corresponding pixels in these blocks according to equation is the prediction error. In MPEG, the situation shown in figure (where frame is predicted by a motion compensated version of frame ) is called Forward Prediction. The block that is to be constructed i.e. frame is called the Target Block. The frame that is supplying the prediction is called the Reference Picture, and the resulting data used for the motion compensation (i.e. the displaced block in frame ) is the Prediction Block. 6.. Image prediction The Fourth row of Figure shows the prediction error of each frame of the Suzie sequence starting from the first frame as a reference. A three level Block Matcher was used with blocks and a motion threshold for motion detection of at the highest resolution level. The accuracy of the search was pixels. Each DFD frame is the difference between frame and a motion compensated frame, given the original frame. Again, we can compress this sequence of transformed images (including the first I frame) using the DCT of blocks of. Now the amount of data needed per is about.4 bits/pel. Substantial compression has been achieved over attempting to compress each image separately. Of course, you will have deduced that this was going to be the case because there is much less information content in the DFD frames than in the original picture data. To confirm that it is indeed motion compensated prediction that is contributing most of the benefit, the E8 Introduction to Engineering 3 Anil Kokaram www.mee.tcd.ie/sigmedia

6 VIDEO COMPRESSION Figure : Frames 5-53 of the Suzie sequence processed by various means. From Top to Bottom row: Original Frames; DCT of Top Row; Non-motion compensated DFD; Motion Compensated DFD with backward prediction; DCT of previous row. E8 Introduction to Engineering 4 Anil Kokaram www.mee.tcd.ie/sigmedia

6. Block Matching 6 VIDEO COMPRESSION I B B P B B P B Figure 3: A typical Group of Pictures (GOP) in MPEG B I 3rd row of figure shows the non-motion compensated frame difference (FD) between the frames of Suzie. There is substantially more energy in these FD frames than in the DFD frames, hence the higher bit rate. 6..3 Problems with occlusion A closer look at the DFD frame sequence in row of Figure shows that in frames 5 and 53 (in particular) there are some areas that show very high DFD. This is explained by observing the behaviour of Suzie in the top row. In those frames her head moves such that she uncovers or occludes some area of the background. The phone handset also uncovers a portion of her swinging hair. In the situation of uncovering, the data in some parts of frame simply does not exist in frame. Thus the DFD must be high. However, the data that is uncovered in frame, typically is also exposed in frame. Therefore, if we could look into the next frame as well as the previous frame we probably will be able to find a good match for any block whether it is occluded or uncovered. Using such Bi-directional prediction gives much better image fidelity. This idea is used in MPEG-. It uses both backward prediction for some frames (P frames) and bidirectional prediction for others (B frames). The sequencing is shown below. Typically MPEG encodes images in the following order IBBPBBPBBPBBPI.... I-frames (Intra-coded frames) are encoded just like JPEG i.e. without any motion compensation. This allows the codec to cope with varying image content...think what would happen if you tried to predict every image in a movie from the first frame. Its not going to work is it? So I-frames are slipped in every frames or so to give a new reference frame for prediction of the next frames. 6. Sledgehammer motion estimation: Block Matching The most popular and to some extent the most robust technique to date for motion estimation is Block Matching (BM). Two basic assumptions are made in this technique. E8 Introduction to Engineering 5 Anil Kokaram www.mee.tcd.ie/sigmedia

6. Block Matching 6 VIDEO COMPRESSION. Constant translational motion over small blocks (say or ) in the image. This is the same as saying that there is a minimum object size that is larger than the chosen block size.. There is a maximum (pre-determined) range for the horizontal and vertical components of the motion vector at each pixel site. This is the same as assuming a maximum velocity for the objects in the sequence. This restricts the range of vectors to be considered and thus reduces the cost of the algorithm. The image in frame, is divided into blocks usually of the same size,. Each block is considered in turn and a motion vector is assigned to each. The motion vector is chosen by matching the block in frame with a set of blocks of the same size at locations defined by some search pattern in the previous frame. Given a possible vector, we can define the DFD between a pixel in the current frame and its motion compensated pixel in the previous frame as (3) Define the Mean Absolute Error of the DFD between the block in the current frame and that in the previous frame as (4) Block We can use Mean Squared Error (MSE) as well, but MAE is more robust to noise. The block matching algorithm then proceeds as follows at each image block.. Pre-determine a set of candidate vectors to be tested as the motion vector for the current block. For each calculate the MAE 3. Choose the motion vector for the block as that which yields the minimum MAE. The set of vectors in effect yield a set of candidate motion compensated blocks in the previous frame for evaluation. The separation of the candidate blocks in the search space determines the smallest vector that can be estimated. For integer accurate motion estimation the position of each block coincides with the image grid. For fractional accuracy, blocks need to be extracted between locations on the image grid. This requires some interpolation. In most cases Bilinear interpolation is sufficient. Figure 4 shows the search space used in a full motion search technique. The current block is compared to every block of the same size in an area of size. The search 5 space is chosen by deciding on the maximum displacement allowed: in Figure 4 the maximum displacement estimated is for both horizontal and vertical components. The technique arises from a direct solution of equation. The BM solution can be seen to minimize the Mean Absolute DFD (or Mean Square DFD) with respect to, over the block. The chosen displacement, satisfies the model equation in some average sense. 5 There are searched locations. E8 Introduction to Engineering 6 Anil Kokaram www.mee.tcd.ie/sigmedia

6. Block Matching 6 VIDEO COMPRESSION Frame n- N+w Frame n w N+w N N w Centre pixel of block to be matched Centre pixel of candidate matching block Border of entire search area Figure 4: Motion estimation via Block Matching. The positions indicated by a in frame are searched for a match with the block in frame. One block to be examined is located at displacement, and is shaded. 6.. Computation The Full Motion Search is computationally demanding. Given a maximum expected displacement of pels, there are searched blocks (assuming integer displacements only). Each block considered requires on the order of operations to calculate the MAE. This implies operations per block for an integer accurate motion estimate. Several reduced search techniques have been introduced which lessen this burden. They attempt to reduce the operations required either by reducing the locations searched or by reducing the number of pixels sampled in each block. However, reduced searches may find local minima in the DFD function and yield spurious matches. 6.. Three step search The simplest mechanism for reducing the computational burden of Full Search BM is to reduce the number of motion vectors that are evaluated. The Three-step search is a hierarchical search strategy that evaluates first 9 then 8 and finally again 8 motion vectors to refine the motion estimate in three successive steps. At each step the distance between the evaluated blocks is reduced. The next search is centred on the position of the best matching block in the previous search. It can be generalised to more steps to refine the motion estimate further. Figure 5 shows the searched blocks in frame for this process. 6..3 Cross Search The cross search is another variant on the subsampled motion vector visiting strategy. It changes the geometry of the search pattern to a or pattern. Figure 5 shows the searched blocks in frame for this process. If the best match is found at the centre of the search pattern or the boundary of the search window, then the search step is reduced. E8 Introduction to Engineering 7 Anil Kokaram www.mee.tcd.ie/sigmedia

6. Video codec issues 6 VIDEO COMPRESSION 3 3 3 3 3 3 3 3 3 3 4 3 5 5 4 5 5 Figure 5: Illustration of searched locations (central pixel of the searched block is shown) in Three-step BM (left) and Cross-search BM (right). The search window extent is shown in red for Cross-search. The best matches at each search level are circled in blue. 6..4 Problems The BM algorithm is noted for being a robust estimator of motion since noise effects tend to be averaged out over the block operations. However, if there is no textural information in the the two blocks compared, then noise dominates the matching process and causes spurious motion estimates. This problem can be isolated by comparing the best match found ( ), to the no motion match ( ). If these matches are sufficiently different then the motion estimate is accepted otherwise no motion is assumed. A threshold acts on the ratio. The error measure used is the MAE. If, where is some threshold chosen according to the noise level suspected, then no motion is assumed. This algorithm verifies the validity of the motion estimate once motion is detected. The main disadvantages of Block Matching are the heavy computation involved (although these are byte wise manipulations) and the motion averaging effect of the blocks. If the blocks chosen are too large then many differently moving objects may be enclosed by one block and the chosen motion vector is unlikely to match the motion of any of the objects. The advantages are that it is very simple to implement 6 and it is robust to noise due to the averaging over the blocks. There are many more useful motion estimators than this. These others do give you motion better matched to what is actually going on in the scene. But we will not look at these here. 6. Video codec issues DVD and DTV both use MPEG-, and the core is exactly as described here. MPEG- became a standard around 99, and just 4 years later Digital Television was a reality. This is quite amazing considering that the advances in research in video compression that made this possible were only really about 5 years old at the time. Compare that to the years it took Fourier to be really appreciated! 6 It has been implemented on Silicon for video coding applications. E8 Introduction to Engineering 8 Anil Kokaram www.mee.tcd.ie/sigmedia

6. Video codec issues 6 VIDEO COMPRESSION Mobile phone video communications will use MPEG-4 (established around 998). Unfortunately that is going through some teething trouble at the moment. Sadly, the creation of MPEG standards is not as simple as motion estimation, DFD, DCT, quantisation and transmission. When you actually start to think about putting together codecs the following issues arise. Compression There are at least three fundamentally different types of multimedia data sources: pictures, audio and text. Different compression techniques are needed for each data type. Each piece of data has to be identified with unique codewords for transmission. Sequencing The compressed data from each source is scanned into a sequence of bits. This sequence is then packetised for transport. The problem here is to identify each different part of the bitstream uniquely to the decoder, e.g. header information, DCT coefficient information. Multiplexing The audio and video data (for instance) has to be decoded at the same time (or approximately the same time) to create a coherent signal at the receiver. This implies that the transmitted elementary data streams should be somehow combined so that they arrive at the correct time at the decoder. The challenge is therefore to allow for identifying the different parts of the multiplexed stream and to insert information about the timing of each elementary data stream. Media The compressed and multiplexed data has to be stored on some DSM and then later (or live) broadcast to receivers across air or other links. Access to different Media channels (including DSM) is governed by different constraints and this must somehow be allowed for in the standards description. Errors Errors in the received bitstream invariably occur. The receiver must cope with errors such that the system performance is robust to errors or it degrades in some graceful way. Bandwidth The bandwidth available for the multimedia transmission is limited. The transmission system must ensure that the bandwidth of the bitstream does not exceed these limits. This problem is called Rate Control and applies both to the control of the bitrate of the elementary data streams and the multiplexed stream. Multiplatform The coded bitstream may need to be decoded on many different types of device with varying processor speeds and storage resources. It would be interesting if the transmission system could provide a bitstream which could be decoded to varying extents by different devices. Thus a low capacity device could receive a lower quality picture than a high capacity device that would receive further features and higher picture quality. This concept applied to the construction of a suitable bitstream format is called Scalability. What we have covered here is the core of the standard used for image and video compression. This just says how the data itself is compressed. If you open up an.avi or.mpg file, you will not see this data in that same form. It has to be encoded into symbols, and timing and copyright information embedded at the very least. This makes the design of codecs a tricky business. But it is certainly true that without standards, there would be no business in video communications. Finally, note that none of the compression standards actually describe how you do the things you have to do. It just describes how to represent bits and package them. So you can use cleverer DCTs or cleverer motion estimators to get better speed and performance. That is why one manufacturer s codec could be better than another s even though they both create compressed video according to the same standard. E8 Introduction to Engineering 9 Anil Kokaram www.mee.tcd.ie/sigmedia