1 COMPARISON OF VIDEO CODECS AND QUALITY MEASUREMENT ZACHOS KONSTANTINOS Master of Science in Networking and Data Communications THESIS
2 Thesis Title COMPARISON OF VIDEO CODECS AND QUALITY MEASUREMENT Dissertation submitted for the Degree of Master of Science in Networking and Data Communications By ZACHOS KONSTANTINOS SUPERVISOR DR. PAPADAKIS ANDREAS KINGSTON UNIVERSITY, SCHOOL OF COMPUTING AND INFORMATION SYSTEMS TEI OF PIRAEUS, DEPARTMENTS OF ELECTRONICS AND AUTOMATION JULY 2010
3 Table of Contents List of Figures and Tables Abstract iv vii 1 Introduction 1 2 Video codecs overview MPEG MPEG-4 Visual MPEG-4 AVC VC Quality metrics General Objective Peak Signal to Noise Ratio (PSNR) Mean Square Error (MSE) Structural Similarity (SSIM) Video Quality Metric (VQM) Subjective Double Stimulus Continuous Quality Scale (DSCQS) Double Stimulus Impairment Scale (DSIS) 27 4 Comparison Video sequences Quality measurement tool Comparison procedure 36 5 Results Objective metrics 43 ii
4 5.2 Subjective metrics Comparison 57 6 Conclusion 62 References 65 Appendix 69 A Figures and Tables 69 A.1 Objective metrics 69 A.2 Subjective metrics 85 B Encoding times and File sizes 89 B.1 Encoding times 89 B.2 File sizes 89 iii
5 List of Figures and Tables Figures Figure 2.1 A simplified MPEG-1 video encoder (Ghanbari, 2003) 5 Figure 3.1 Double Stimulus Continuous Quality Scale (DSCQS) presentation procedure (ITU-R, 2002) 25 Figure 3.2 Double Stimulus Continuous Quality Scale (DSCQS) score sheet (ITU-R, 2002) 26 Figure 3.3 Double Stimulus Impairment Scale (DSIS) variant I (ITU-R, 2002) 27 Figure 3.4 Double Stimulus Impairment Scale (DSIS) variant II (ITU-R, 2002)27 Figure 4.1 Test sequences screen shots (frame 0) 32 Figure 4.2 MSU Video Quality Measurement Tool main screen 34 Figure 4.3 MSU Perceptual Video Quality Tool main screen 35 Figure 4.4 YUV to AVI Converter main screen 37 Figure 4.5 CREW CIF 768 Kbps SSIM graph 39 Figure 4.6 Comparison procedure 42 Figure 5.1 CITY CIF PSNR-Y 43 Figure 5.2 CITY CIF PSNR-V 43 Figure 5.3 CITY CIF MSE-Y 44 Figure 5.4 CITY CIF SSIM 44 Figure 5.5 CITY CIF VQM 45 Figure 5.6 CITY 4CIF PSNR-U 45 Figure 5.7 CITY 4CIF MSE-V 45 Figure 5.8 CITY 4CIF SSIM 46 Figure 5.9 CITY 4CIF VQM 46 Figure 5.10 CREW CIF PSNR-Y 47 Figure 5.11 CREW CIF MSE-Y 47 Figure 5.12 CREW CIF SSIM 47 Figure 5.13 CREW CIF VQM 48 Figure 5.14 CREW 4CIF PSNR-Y 48 Figure 5.15 CREW 4CIF MSE-Y 49 Figure 5.16 CREW 4CIF SSIM 49 iv
6 Figure 5.17 CREW 4CIF VQM 49 Figure 5.18 SOCCER CIF PSNR-Y 50 Figure 5.19 SOCCER CIF MSE-Y 50 Figure 5.20 SOCCER CIF SSIM 50 Figure 5.21 SOCCER CIF VQM 51 Figure 5.22 SOCCER 4CIF PSNR-Y 51 Figure 5.23 SOCCER 4CIF MSE-Y 52 Figure 5.24 SOCCER 4CIF SSIM 52 Figure 5.25 SOCCER 4CIF VQM 52 Figure 5.26 CITY CIF DSCQS 53 Figure 5.27 CITY CIF DSIS 53 Figure 5.28 CITY 4CIF DSCQS 54 Figure 5.29 CITY 4CIF DSIS 54 Figure 5.30 CREW CIF DSCQS 55 Figure 5.31 CREW CIF DSIS 55 Figure 5.32 CREW 4CIF DSCQS 55 Figure 5.33 CREW 4CIF DSIS 56 Figure 5.34 SOCCER CIF DSCQS 56 Figure 5.35 SOCCER CIF DSIS 56 Figure 5.36 SOCCER 4CIF DSCQS 57 Figure 5.37 SOCCER 4CIF DSIS 57 Figure 5.38 CREW CIF 192 Kbps frame 124 original 58 Figure 5.39 CREW CIF 192 Kbps frame Figure 5.40 CREW CIF 192 Kbps frame Figure 5.41 CREW CIF 192 Kbps frame Figure 5.42 CREW CIF 192 Kbps frame Tables Table 2.1 profiles and levels (Bock, 2009) 7 Table 2.2 MPEG-4 Visual profiles (Richardson, 2003) 9 Table 2.3 MPEG-4 Visual levels of the simple based profiles (Richardson, 2003) 9 v
7 Table 2.4 MPEG-4 AVC profiles and tools (Richardson, 2003) 11 Table 2.5 MPEG-4 AVC levels (Jack, 2007) 12 Table 2.6 VC-1 profiles and levels (SMPTE 421M, 2006) 13 Table 4.1 Video sequences format (Richardson, 2003) 30 Table 4.2 Video sequences format and bitrate 32 Table 4.3 Objective metrics values interpretation 39 Table 4.4 Monitor specifications 40 Table 4.5 CREW CIF 768 Kbps score sample (observer f2) 41 Table 5.1 CREW files encoding times (sec) 61 Table 5.2 CREW file sizes (KB) 61 vi
8 Abstract Digital video has the primary role in home entertainment. There is a variety of home video devices that are capable of reproducing video of various resolutions and codecs. DVB and IPTV are going to be the main media that will be able to send digital video to our homes. Moreover, the advances in technology allow the construction of smaller and more effective mobile devices that are able to receive and reproduce various kinds of digital video. One of the main advances is the design and the creation of more efficient video codecs. The purpose of this research is the comparison of the performance of the most commonly used codecs by evaluating their quality. Also it presents the most common objective and subjective video quality metrics. Furthermore it presents the evolution of the video codecs by examining the features and the functions of the selected codecs. vii
9 1 1 Introduction Nowadays, digital technology has a dominant role in our houses. All the forms of home entertainment are now in digital. There are various home devices capable of reproducing digital video such as home media players, video game consoles and of course Blu-Ray players. All of them can reproduce video of various resolutions and codecs. Also they can receive video streams using the internet. With the cease of the analogue television broadcasting, DVB-T will be the main medium that will be able to send digital video to our homes. Also IPTV is taking its own place in the digital television. Furthermore, modern mobile devices are capable of receiving and reproducing digital video in such quality that was unimaginable a few years ago. Mobile phones, portable video players and DVB-H capable devices flood the market. The purpose of this research is the comparison of the performance of the most commonly used codecs by evaluating their quality. The aims of this study are multiple. Firstly, is the comparison and the evaluation of the selected codecs. Secondly, is the comparison of the results between the objective and the subjective metrics. Next is the validation of the results of other studies and last is the evaluation of the used tools. Furthermore the results of this research could be used by other researchers on their studies. The first step of the process is the selection of the codecs. DVB television standard supports the following codecs:, MPEG-4 AVC and VC-1. Only the handheld version of digital television (DVB-H) does not support (Jack, 2007). Furthermore, Blu-Ray also supports all the previous codecs (Blu-ray Disk Association, 2005). It is obvious that MPEG-4 AVC and VC-1 are the encoding standards for all the modern video reproduction systems. The codec is used mostly for compatibility reasons. It was the standard video codec during the last decade, so a vast archive of already encoded material exists. Furthermore, many of the digital video devices that we already own are able to reproduce.
10 2 So, the selection of the codecs was simple: MPEG-4 AVC and VC-1 for the modern media players and as a standard for the evaluation of the other codecs. It is noticeable that the average viewer is already familiar with the quality of a DVD video. The last selected codec is the MPEG-4 Visual. This codec is implemented by various famous encoders. Some of these encoders are the and the DVIX. The majority of the videos that can be found on the internet, have been encoded by these encoders. The quality of a video sequence can be assessed by using two major methods. The first is the subjective and second is the objective. The subjective is based on the human observation. A group of observers watch the video sequence under specific conditions and they evaluate it according to the specifications of each metric. There are many methods but the most commonly used, are the double stimulus metrics. These are the Double Stimulus Continuous Quality Scale (DSCQS) and the Double Stimulus Impairment Scale (DSIS). These metrics require specific viewing conditions. The ITU-R BT and the ITU-T P.901 recommendations, describe the whole procedure in detail. This research is based on the previous recommendations. The objective is usually algorithms that are based on mathematical models. Their aim is to assess the quality of a video under specific conditions. The results should be easily reproduced under the same conditions. The most commonly used objective metrics are: Peak Signal to Noise Ratio (PSNR), Mean Square Error (MSE), Structural Similarity (SSIM) and Video Quality Metric (VQM). The first two are based on the error detection between the two sequences. SSIM, measures the structural distortion. This method is more reliable because it is closer to the Human Vison System (HVS). Last is the VQM that is based on subjective observations. It is more complex but it is reliable and very close to the results of the subjective metrics. Next is the selection of the video sequences. The performance of the codecs is influenced by the content of the video sequence, so the sequences are selected primarily for their content. High motion is proved to be very important for the
11 3 performance of a codec. Furthermore, frames with human faces strongly influence the opinion of the observers. So, the selected sequences must include those characteristics. The duration of the sequences is also important because subjective metrics have strict specifications. The resolution has been decided to be CIF and 4CIF. CIF is selected for its use on mobile devices, such as mobile phones and DVB-H. 4CIF is selected because this is the standard of the SDTV. DVB-T and IPTV follow SDTV standard. The selected bitrates are three for each resolution. For the CIF format:192 kbps, 384 kbps and 768 kbps. For the 4CIF format: 2 Mbps, 4 Mbps and 8 Mbps. The selection of these bitrates is primarily based on each codec s profiles and levels and then on the needs of the modern multimedia devices. The tools that are used are the MSU Video Quality Measurement Tool and the MSU Perceptual Video Quality Tool. The first is used for the objective metrics and the second for the subjective metrics. Both tools support the selected metrics and they are fully compliant with the ITU-R BT recommendation. The initial intention was to use more video sequences and more than three resolutions for each format. But after the beginning of the testing procedure it was obvious that the required time was enormous. Relevant studies use more video sequences and more resolutions but they use fewer codecs and only one or two metrics. So the overall procedure is not so time consuming. This research practically uses 8 objective (PSNR YUV, MSE YUV, SSIM and VQM) and 2 subjective metrics. As a result the actual measurements are 576 only for the objective metrics (3 files x 2 formats x 3 bitrates x 4 codecs x 8 metrics). For the subjective, the completion of the evaluation of the sequences by one observer required nearly two hours. So the overall process took more than one month excluding the data processing.
12 4 2 Video codecs overview 2.1 MPEG-1 MPEG-1 is a video compression algorithm which was developed by the International Organization for Standardization (ISO). The main goal of the algorithm is to encode a video sequence with audio (such as a movie) and compress it to a size that is able to fit on a CD. The resolution that is being used to achieve that is 352x288 at 25 fps or 352x240 at fps (SIF resolution). The bit rate was set at about 1.15 Mbps (Golston and Rao, 2006). The I frame coding is based on the JPEG standard. It uses an 8 x 8 DCT (Discrete Cosine Transform) on chrominance and on luminance. On the format there are four luminance and one of each chrominance block (4Y, 1Cb and 1Cr). These blocks are forming a macroblock (Bock, 2009). Motion compensation on MPEG-1 is based on macroblocks. It uses one motion vector for each macroblock and that means that the same vector is used for all the six luminance and chrominance blocks. The accuracy of the motion vector is 0.5 pixels. This show how smooth is the natural movement on the video sequence (Bock, 2009). Compared to the previous compression algorithms, MPEG-1 includes the use of B frames and the adaptive perceptual quantization. B frames are used for bi-directional prediction. These frames depend on both previous and following frames. Although the video quality is increased, so is increased the computational power that is needed. The computations are more complex and this adds an increased latency on the result. Each application has different requirements in latency. That is why some applications in order to perform better, are skipping the decoding of the B frames. That is commonly used in applications with low bit rate requirements. Adaptive perceptual quantization is a method that is used for improving the video quality in terms of visual perception of humans. This is achieved by applying a quantization factor to each frequency (Golston, 2004). A simplified MPEG-1 encoder is shown on the next figure.
13 5 Figure 2.1: A simplified MPEG-1 video encoder (Ghanbari, 2003) Apart from I, P and B frames, the D frames are also introduced. These frames have only low frequency information and they are used only for searching. Using these frames, a user is able to find a specific part of a video. supports D frames for compatibility reasons although they are no longer used (Bock, 2009). Originally MPEG-1 was used only on video CD and supported only progressive video signals. But later a new version of MPEG-1 was introduced in order to be used on standard television (SDTV). That version supports interlaced video and bit rates up to 10 Mbps. It is referred as MPEG-1.5 but it is not widely used due to the arrival of the encoder (Bock, 2009). 2.2 probably is the most popular video compression codec. Initially it was designed for use in digital television but soon its use was widespread in almost any application that uses digital video and compression. It supports all the standards of SDTV, such as interlaced and progressive video and the proper resolutions. For PAL television systems 720 x 576 at 50 frames per second and for NTSC television systems 720 x 480 at 60 frames per second (Golston, 2004). was improved significantly in many areas in comparison to MPEG-1. Firstly it supports interlaced video and secondly it supports motion compensation
14 6 with search ranges that are much wider than previous codecs. This is necessary in order to support much higher resolutions. Consequently the complexity and the computational power needs of an encoder are much higher (Golston and Rao, 2006). The usual compression ratio for adequate performance of the codec is 30:1. Also the bit rate has to be 4 to 8 Mbps in order to sustain a good picture quality. The most common consumer applications that use are standard and high-definition television, DVD video and satellite video (Golston, 2004). The codec is able to support many new features with the use of the right tools. The new features are: multiple layer coding, data partitioning, SNR scalability, spatial scalability and temporal scalability (Golston and Rao, 2006). The profiles are six: simple, main, 4:2:2, SNR, spatial and high. The main features of each profile are the following (Richardson, 2002). Simple: supports only I, P frame coding, 4:2:0 subsampling, low complexity Main: supports interlaced video, B frames, with 4:2:0 subsampling 4:2:2: use of 4:2:2 subsampling (four luminance and two of each Cr and Cb) SNR: same as Main with the addition of an enhancement layer for better quality Spatial: same as SNR with the use of spatial scalability for better quality High: same as spatial with support of 4:2:2 subsampling The combination of a profile with one of the three basic levels defines the use of the codec in each case. The profile that is commonly used is the Main profile. Thus the use of each level on the main profile is the following: Main profile / Low level: is basically the same with MPEG-1 Main profile / Main level: for digital television Main profile / High level: for high-definition television The use of for high definition practically canceled the plans for implementation of the MPEG-3 codec that was originally intended to be used as the standard in high definition video (Richardson, 2002).
15 7 The following table shows the profiles and levels. Level Profile Simple Main 4:2:2 SNR Spatial High Picture type I, P I, B, P I, B, P I, B, P I, B, P I, B, P Chroma format 4:2:0 4:2:0 4:2:2 4:2:0 4:2:2 4:2:2 High Samples/line Lines/frame frames/s Bit rate (Mbps) High 1440 Samples/line Lines/frame frames/s Bit rate (Mbps) Main Samples/line Lines/frame frames/s Bit rate (Mbps) Low Samples/line Lines/frame frames/s Bit rate (Mbps) 4 4 Table 2.1: profiles and levels (Bock, 2009) Initially decoders were too expensive and their needs on computational power were very high. Eventually, these decoders became cheaper and simpler because new techniques were implemented on the construction of the decoder. Also, the rapid expansion of the use of the codec led to the mass production of compatible devices. (Golston and Rao, 2006). 2.3 MPEG-4 Visual MPEG-4 Visual (Part 2) is one of the two MPEG-4 codecs that have been standardized. The other is the MPEG-4 AVC (Part 10) (Bock, 2009). Originally MPEG-4 focused on supporting video for applications that require low bit rate, because MPEG-1 and are not efficient enough. The rapid expansion of the internet and the use of video streaming has increased the need for producing quality video on low bit rates. Therefore MPEG-4 Visual is designed to be able to efficient compress videos on low bit rate. Of course now it is widely used in various applications that require various bit rates.
16 8 Another innovation is the introduction of the object based coding. Conversely to previous codecs, now a video sequence can be managed as set of single objects. This new technique opens a whole new way of processing the video sequence. The main concepts are the video object (VO) and the video scene (VS). Video scene is a group of video frames that comprise a scene. Video object is a single object that can be defined in a video scene. Of course a video scene may have multiple video objects. Also, MPEG-4 introduces the concept of the toolkit. Thus, new tools can be added to the MPEG-4 standard and create new versions (Richardson, 2002). The new tools that have been introduced in MPEG-4 Visual are the following: (Golston and Rao, 2006) Unrestricted Motion Vectors: It predicts the movement of the objects that move outside of the frame Variable Block Size Motion Compensation: It supports motion compensation for 8 x 8 and 16 x 16 blocks. Intra DCT DC/AC Prediction: It predicts DC/AC coefficients by using blocks that are above or left of a specific block. Quantized AC coefficients with extended dynamic range: It supports AC coefficients with extended dynamic range in order to improve video quality. Furthermore, in order to support packet loss recovery it introduces the following features: (Golston and Rao, 2006) Slice Resynchronization: It creates slices inside the images, so it is able to resynchronize much quicker after the occurrence of an error. Data Partitioning: It divides the data of a video packet on the DCT part and the motion part. The checks on the motion vector are more strict and accurate. So when an error occurs not all the data of the specific packet are discarded. Reversible Variable Length Codes: It allows backward decoding with the use of the VLC tables. So it can resume decoding much faster. New prediction: It is used on real time applications to request additional data when a packet is lost. MPEG-4 Visual has many different profiles. The following table shows the profiles and their main features.
17 9 Profile Main Features Simple Low-complexity coding of rectangular video frames Advanced Simple Coding rectangular frames with improved efficiency and support for interlaced video Advanced Real-Time Simple Coding rectangular frames for real-time streaming Core Basic coding of arbitrary-shaped video objects Main Feature-rich coding of video objects Advanced Coding Efficiency Highly efficient coding of video objects N-Bit Coding of video objects with sample resolutions other than 8 bits Simple Scalable Scalable coding of rectangular video frames Fine Granular Scalability Advanced scalable coding of rectangular frames Core Scalable Scalable coding of video objects Scalable Texture Scalable coding of still texture Advanced Scalable Texture Scalable still texture with improved efficiency and object-based features Advanced Core Combines the features of Simple, Core and Advanced Scalable Texture profiles Simple Studio Object-based coding of high-quality video sequences Core Studio Object-based coding of high-quality video with improved compression efficiency Table 2.2: MPEG-4 Visual profiles (Richardson, 2003) Many codecs are based on the MPEG-4 Visual algorithm. The most popular are the divx, xvid, and the quick time. Initially xvid used the simple profile but later it introduced the advanced simple profile. Also dvix implements the advanced simple profile (Golston and Rao, 2006), (Ma and Tucker, 2008). The next table shows the levels of the simple based profiles. Profile Level Typical resolution Max bitrate Max objects L kbps 1 simple Simple Advanced Simple (AS) Advanced Real-Time Simple (ARTS) L kbps 4 simple L kbps 4 simple L kbps 4 simple L kbps 1 AS or simple L kbps 4 AS or simple L kbps 4 AS or simple L kbps 4 AS or simple L Mbps 4 AS or simple L Mbps 4 AS or simple L kbps 4 ARTS or simple L kbps 4 ARTS or simple L kbps 4 ARTS or simple L Mbps 16 ARTS or simple Table 2.3: MPEG-4 Visual levels of the simple based profiles (Richardson, 2003) 2.4 MPEG-4 AVC MPEG-4 AVC (Part 10) is the second MPEG-4 compression algorithm that has been standardized. International Organization for Standardization (ISO) approved the
18 10 standard in Also International Telecommunication Union (ITU) approved it under the name H.264 (Bock, 2009), (Golston and Rao, 2006). AVC stands for Advanced Video Codec. The codec s range of application is broad. It can be used from video for mobile devices to high definition video. So, the bit rate capabilities of the codec are from very low to very high bit rates (Ali, 2008). The efficiency of the codec is significantly improved. It reduces the bit rate up to 2x in comparison to and MPEG-4 earlier codecs. Therefore, MPEG-4 AVC can be used in order to provide new services. These services are: provide video over ADSL lines, produce video equivalent to VHS quality at 600 Kbps and store and distribute high definition video using common DVD disks (Golston, 2004). MPEG-4 AVC is based on the same principles as the previous compression algorithms. However, it presents many new features that make this codec more efficient. The new features that were introduced in MPEG-4 AVC are the following: (Golston and Rao, 2006) Adaptive Loop De-blocking Filter: removes artifacts caused by block prediction errors. Context-Adaptive Binary Arithmetic Coder (CABAC): a probability model is used in order to decode and encode syntax elements such as motion vectors and transform coefficients. Entropy Coding: uses a single Universal VLC for all the symbols and a Context- Adaptive VLC for the transform coefficients. Integer Transform: uses an integer 4x4 spatial transform (an approximation of DCT) in order to reduce the quality loss that is caused by IDCT mismatches. Intra and Inter Prediction and Coding: uses spatial domain Intra prediction and Inter frame coding. Multiple Reference Frame Prediction: uses up to 16 different reference frames. Quantization and Transform Coefficient Scanning: uses scalar quantization for the transform coefficients. Quarter - Pel Motion Estimation: allows quarter - pel and half - pel motion vector resolution. Variable Vector Block Sizes: uses different block sizes for motion compensation.
19 11 Weighted Prediction: uses the weighted sum of backward and forward predictions. The MPEG-4 AVC codec initially had three profiles: baseline, main and extended. The more important profiles are the baseline and the main profile. The baseline profile is appropriate for mobile devices and generally for applications with low bit rate demands. The main profile can provide high quality video using high compression. Of course the computational needs are also very high. Finally, the extended profile is used in video streaming (Richardson, 2003). The next table shows the three profiles and their main tools. Baseline Main Extended SP and SI slices X Data Partitioning X B slices X X Weighted Prediction X X I, P slices X X X CAVLC X X X Slice Groups and ASO X X Redundant Slices X X CABAC X Interlace X Table 2.4: MPEG-4 AVC profiles and tools (Richardson, 2003) Later a new set of profiles was added. These are called high profiles and they are four: high, high 10, high 4:2:2, and high 4:4:4. These profiles add new tools that improve the efficiency of the codec. The main additions to each high profile are the following: (Jack, 2007). High (HP): supports encoder-specified frequency-dependent scaling matrices and adaptive selection between 4x4 and 8x8 block sizes. High 10 (Hi10P): supports 9 or 10 bit 4:2:0 YCbCr. High 4:2:2 (Hi422P): supports 4:2:2 YCbCr. High 4:4:4 (Hi444P): supports 4:2:2 YCbCr or RGB, 11 or 12 bit samples, predictive lossless coding and residual color transform.
20 12 The High profile (HP) also introduces the following features: (Golston and Rao, 2006) 8x8 Luminance Intra Prediction: adds eight more modes for intra prediction. Adaptive Residual Block Size and Integer 8 x 8 Transform: adds a new 16 bit integer transform for 8 x 8 blocks. Monochrome: supports black & white video coding Quantization Weighting: adds new quantization weighting matrices. The various levels of the MPEG-4 AVC codec are described in the next table. Level Maximum MB per Second Maximum Frame Size (MB) Typical Frame Resolution Typical Frames per Second Maximum MVs per Two Consecutive MBs Maximum Reference Frames Maximum Bit-Rate kbps kbps , kbps kbps Mbps , Mbps 4 Mbps 10 Mbps Mbps Mbps Mbps 50 Mbps Mbps Mbps Mbps Table 2.5: MPEG-4 AVC levels (Jack, 2007). 2.5 VC-1 VC-1 is a video compression algorithm that is able to produce high quality video. The range of the produced bit rates are from very low to very high. Although the bit rate of a high definition video can be very high the need for computational power is kept on a reasonable level (Regunathan and Srinivasan, 2005).
21 13 The VC-1 codec contains the knowledge of more than 75 companies. The codec was originally implemented by Microsoft as Windows Media Video 9 (9). Later, The Society of Motion Picture and Television Engineers (SMPTE) has standardized the codec as VC-1. The function of the VC-1 codec is analyzed in three documents. The SMPTE 421M describes the main functions of the codec. The SMPTE RP227 and the SMPTE RP228 documents, describe the specifications of the bitstream transport and the bitstream conformance (Loomis and Wasson, 2007). The range of the supporting bit rates is broad. VC-1 can produce high definition video at 1080p with 6 to 30 Mbps bit rate. Furthermore, it can produce video with resolution at 2048 x 1536 with 135 Mbps bit rate. This is the highest possible resolution. The lowest is 160 x 120 with bit rate at 10 Kbps. The profiles of VC-1 are three: simple, main and advanced. There are also various levels that combined with the proper profile result to the proper resolution and bit rate for each application. The combination of those, determine the complexity of the encoder and the decoder (Loomis and Wasson, 2007). The following table shows the profiles and the levels of the VC-1 codec. Profile Level Max Bit Rate Resolution Frame Rate Low 96 Kbps Simple Medium 384 Kbps Low 2 Mbps Main Medium 10 Mbps High 20 Mbps L0 2 Mbps L1 10 Mbps L2 20 Mbps Advanced L3 45 Mbps L4 135 Mbps Table 2.6: VC-1 profiles and levels (SMPTE 421M, 2006)
22 14 As all the previous codecs, VC-1 is based on schemes such as spatial transformation and motion compensation. The basic principles are the same. However VC-1 introduces a new set of innovative techniques that make the codec more efficient. Furthermore, the increased codec efficiency is followed by the ability to produce high definition video with high quality. These innovations are the following: (Loomis and Wasson, 2007), (Regunathan and Srinivasan, 2005) 16-bit transform implementation: transforms are constrained to 16 bit in order to keep low decoder computational complexity adaptive block size transform: uses various combinations of the 8 x 8 transform to fit better to the needs of each case advanced B-frame coding: B frames do not refer to other frames so they can be sent separate or even be omitted differential quantization: supports quantization for several levels fading compensation: adds fading parameters in the encoding procedure interlace coding: adds new characteristics from the interlaced frames loop-filtering: uses a filter to eliminate discontinuities from the block boundaries motion compensation: uses four modes in order to use the most suitable for each case. VC-1 performs far better than and MPEG-4 Visual simple profile. Also various comparisons show that video quality is even better than MPEG-4 AVC. The compression ratio is similar but the complexity is kept low. So, the computational needs of the decoder are much lower and as a result, the hardware requirements of the decoding devices are lower too. This is a great advantage against the other competitor codecs (Golston and Rao, 2006). Furthermore, VC-1 along with MPEG-4 AVC, are the two codecs that are used for high definition video in Blu-ray players (Blu-ray Disk Association, 2005).
23 15 3 Quality metrics 3.1 General Video quality assessment is a difficult and complicated task. Digital pictures are distorted during processing, compression, transmission and reproduction. Any of these factors can result to the degradation of the quality of the video. Sometimes this is preferable in order to reduce the overall size of the file for transmission and storage purposes. Thus, it is very important to know, how the degradation of the quality of a video affects the resulted video sequence and if the outcome is satisfactory for the viewers. Therefore the assessment of video quality is a process that involves human beings in order to evaluate the picture quality. This is called subjective evaluation. In this case, a group of viewers watch and evaluate the quality of a video sequence. Although this procedure is preferable, it is also not convenient. It is expensive and needs a lot of time. So, researchers develop procedures that can assess and evaluate the quality of a video sequence without the need of observers. The results of these procedures are objective and they can be reproduced easily using the same parameters. This is called objective evaluation (Ghanbari, 2003), (Bovic et al., 2004). Quality metrics can be categorized according to the type of reference and the amount of information that they require in order to assess a video sequence. The categories are the following three. Full Reference (FR) Reduced Reference (RR) No Reference (NR) Full reference metrics compares the compressed video with the original video sequence that is used as a reference for the evaluation. The original video has to be in its original uncompressed form. In order to compare the two video sequences the color and the luminance has to be calibrated and also the temporal and spatial alignment has to be precise. So the two related pixels can be easily compared.
24 16 Reduced reference metrics are using only some features of the reference video sequence. The evaluation is based only on those features. Thus the comparison procedure is faster and easier due to reduced comparison factors. Also it avoids the use of the no reference metrics assumptions. No reference metrics evaluates the video sequence without the need of the original. The real challenge is to make the distinction between the distortion and the real video content. This type of metrics makes assumptions about the video type and the distortion due to the lack of the original video sequence. Each type of metrics is used in different situations. Full reference metrics are used for offline video quality assessment. They are usually used for codec evaluation in lab environment. The other two metrics are used for online video quality assessment on different stages in the transmission system. They can monitor and evaluate the video sequence in every stage. However, reduced reference metrics must have access to the original sequence (Winkler, 2009). As it is mentioned earlier subjective metrics involve the participation of a group of observers in order to evaluate a video sequence. Each observer gives his opinion using a specific quality scale. Of course, beforehand a number of matters have to be clarified. The viewing conditions have to be strict, the observers have to match a certain profile, the test material has to follow certain parameters and the data analysis has to adhere to a specific procedure (Winkler, 2005). Although they are expensive and more time consuming, subjective metrics are widely used because they are based on the human vision system. The subjective quality metrics are the following: Double Stimulus Continuous Quality Scale (DSCQS): observers evaluate short video sequences of the original and the test sequence. Double Stimulus Impairment Scale (DSIS): observers evaluate the test sequence in comparison with the original sequence. Single Stimulus Continuous Quality Evaluation (SSCQE): observers watch a 20 to 30 minutes video sequence while they continuously rate the sequence. Absolute Category Rating (ACR): observers evaluate the video sequence only one time without the reference of the original sequence.
25 17 Absolute Category Rating with hidden reference (ACR-HR): is the same with ACR with the addition of the original version of each sequence. Degradation Category Rating (DCR): is identical to Double Stimulus Impairment Scale. Pair Comparison (PC): observers evaluate test sequences from the same scene but under dissimilar conditions in various different combinations. The observers evaluate each video sequence and the results are averaged into a single score. This results to the Mean Opinion Score (MOS) that is unique for each sequence. The number of the observers has to be at least fifteen (Winkler, 2005). Generally each metric has a different application. The metrics that are widely used today are the first three and the most popular are the double stimulus metrics (Winkler, 2005), (Bock, 2009). Objective metrics are actually algorithms based on mathematical models that are capable to assess the quality of a video sequence in order to imitate the human vision system (HVS) and match the observers opinion (Ghanbari, 2003). The classification of the objective quality metrics is the following: (Winkler, 2009) Data metrics Picture metrics Packet or bitstream based metrics Hybrid metrics Data metrics evaluate the fidelity of the video signal with no concern about the content of the signal. In this case the video content of the signal is not taken into account. The two main metrics are the Peak Signal to Noise Ratio (PSNR) and the Mean Square Error (MSE). These are similar to the bit error rate and the packet loss rate that are used for transmission errors since none of them are taking into account the content of the signal. Picture metrics evaluate the signal as a video sequence. They take into account the image distortion and the overall quality of the sequence. They are based on the human vision system (HVS) and on the analysis of specific artifacts and features of the video.
26 18 Packet or bit stream based metrics are used on packet networks for the evaluation of compressed video sequences. They do not decode the video but they assess it by checking the encoded bit stream and the packet header. The main advantage is that they require much lower processing power and that s why they are able to process multiple video streams. However, they can be used only on specific network protocols and video codecs. Hybrid metrics use a combination of the previous quality metrics. Except for the Peak Signal to Noise Ratio and the Mean Square Error, two other commonly used objective metrics are the Structural Similarity (SSIM) and the Video Quality Metric (VQM). The Video Quality Metric is an objective full reference metric that is based on subjective observations. These two metrics give the most reliable results (Wang, 2006). The standards of the objective quality measurements are based on the following criteria: (Winkler, 2009) Define the Mean Opinion Score for a specific application: A specific Mean Opinion Score has to mean the same for every similar video in terms of quality in a video sequence. Define a reliable Mean Opinion Score prediction: The tool that is used for quality measurement has to produce results similar to the observers score. Define a reproducible Mean Opinion Score prediction: The tool has to produce the same results for the same video comparison every time. The previous criteria are partially achieved by the existing standards. So far, no standard is able to fully accomplish the implementation of the three criteria. Video quality assessment is performed under specific standards that have been released by various groups and forums. The most active groups that are working on the release of standards for video quality measurements are the following: Video Quality Experts Group (VQEG) ITU-T ATIS IIF
27 19 These groups have released several recommendations regarding the video quality assessment procedure. They include recommendations for both subjective and objective assessment. The most common recommendations are the ITU-R BT and the ITU-T P.901 for subjective video assessment and the VQEG Final Report From the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment (Phase I and II) for objective video assessment. There are also several other recommendations for various different situations (Winkler, 2009). The most reliable and commonly used metrics are presented in more detail in the following sections (Hewage, 2009), (Winkler, 2009), (Wang, 2006). 3.2 Objective Peak Signal to Noise Ratio (PSNR) Peak Signal to Noise Ratio (PSNR) is one of the most widely used metric and also one of the most simple. It is calculated on a logarithmic scale and requires the calculation of the Mean Square Error (MSE). The next equation shows the calculation of the PSNR. where (2 n -1) 2, is the square of the highest possible signal value in the entire image (255 for 8-bit images) (Richardson, 2002), (Winkler, 2005). PSNR is easily calculated, therefore is the most popular quality metric. It is usually used for the comparison between compressed and uncompressed video sequences. However, PSNR has a number of disadvantages. Firstly, it is a full reference metric, so it requires the original video sequence. But the most important is that the results of the metric are not always in accordance with the results of the subjective metrics. It has been noticed that in some occasions, PSNR scores higher an image that has low subjective Mean Opinion Score.
28 20 An example of such a contrast in the results is when the compared image contains a human face. If the face is clear and bright, most observers will rate it higher even if the background is blurred. On the other hand, PSNR calculates all the pixels of an image. Therefore the rating of the PSNR in this case, is going to be lower (Richardson, 2002). After extended use of the PSNR metric, the following has been observed: (Bock, 2009) The DC levels of the two video sequences have to be the same. If they are different, the PSNR rate is lower even if the distortion of the sequence is not noticeable. PSNR has to be used only for the calculation of the distortion of similar types. This comparison is valuable in contrast with the comparison of different types of distortion that is meaningless. The two video sequences must have similar error signal distribution. If not, PSNR is not able to provide accurate ratings for the two different sequences. The PSNR metric gives more reliable results when the error signal is low and the quality of the image is high. PSNR metric should not be used on low quality images because it is not able to decide if the loss of quality is acceptable and in which condition. The size of the picture is related to the accuracy of the PSNR metric. It has been noticed that smaller image sizes result to more accurate PSNR values compared to the subjective scores. Larger sizes result to values that are much different. This happens because the observers concentrate only to a portion of the large image in comparison with the whole image on the smaller sizes. Despite the limitations of the Peak Signal to Noise Ratio (PSNR) metric, it is widely used in video quality measurement because it is easy and very simple to use Mean Square Error (MSE) Along with PSNR, Mean Square Error (MSE) is the most popular quality metrics in video quality assessment. Assuming that there are two sequences I and Ī of size X x Y with T frames, MSE calculates the squared differences between the two sequences
29 21 (the gray level values) and then calculates their mean value. The next equation shows the calculation of the MSE. Mean Square Error is very fast and easy to use. It measures the differences between the two video sequences, by comparing the two sequences pixel by pixel. As PSNR, Mean Square Error suffers from the same problems. This is obvious because the calculation of the PSNR depends on the calculation of the MSE. The main problem is that MSE can score lower an image that has high subjective mean opinion score. This happens because the observers can rate higher an image by just concentrating on a specific part of it. On the other hand, MSE and PSNR calculate the distortion on the whole image (Winkler, 2005). Even though Mean Square Error suffers from the same problems as Peak Signal to Noise Ratio, these metrics are widely used because they have low cost and they are very simple to use. Also the results of these metrics under controlled environment such as a video testing lab are trustworthy and reliable Structural Similarity (SSIM) Structural Similarity (SSIM) introduced a new approach on the assessment of video quality. All the previous methods are based on error detection between the two sequences. Conversely, SSIM measures the structural distortion. This method is more reliable than the previous, because its function is closer to the Human Vision System (HVS). Human vision is more sensitive to the structural information of an image than to the visual error extraction. In that way, the results of this metric are more accurate and closer to the subjective metrics (Wang, 2006). SSIM is a full reference metric and needs access to the original sequence in order to perform the comparison. Firstly, it assumes that the original sequence has the perfect quality. Then, similarity is used in order to compare it with the second sequence. The measurement of
30 22 similarity is divided into three separate factors: contrast, luminance and structure. Then a comparison is performed for each different factor. The three factors are independent. Thus, the structure is not affected even if contrast or luminance is changed (Bovic et al., 2004). These three comparisons lead to the following three equations, one for each situation. The first is for luminance, the second is for contrast and the last is for structure. The SSIM equation is created by the combination of the previous comparisons. So, SSIM(x,y) is the measured similarity between signal x and y. To simplify the previous equation, it is set that α = β = γ = 1 and C 3 = C 2 / 2. The simplified equation is the following. where μ x - μ y : averages of x and y, σ 2 x σ 2 y : variances of x and y, σ xy : covariance of x and y, C 1 - C 2 : constants to avoid instability when μ 2 x + μ 2 y is very close to zero. The resulted values are between -1 and 1. The -1 value represents the worst quality and the 1 value the highest quality. The quality of the entire image is expressed by the mean structural similarity (MSSIM).
31 23 where X: reference image, Y: distorted image, x j - y j : image contents at jth local window and M: number of the local windows. After testing, Structural Similarity (SSIM) metric has proved more reliable than PSNR and MSE. The results are much closer to the results of the subjective metrics. Of course it has a major disadvantage. The calculations of the equations are complex and require more time and computational power (Bovic et al., 2004), (Wang, 2006) Video Quality Metric (VQM) Video Quality Metric was developed by The Institute for Telecommunication Science (ITS) (Wang, 2006). The aim of the metric is to offer an objective assessment based on subjective observations. It is a full reference metric that measures various factors and combines them into a single result. These factors are: block distortion blurring color distortion global noise unnatural motion VQM shows that it has a high degree of correlation with the resulted scores of the subjective metrics. Video Quality Experts Group (VQEG) has evaluated the metric and rated it very high. Also ANSI approved Video Quality Metric as a standard of objective video quality measurement (Hewage, 2009), (Wang, 2006). The steps of the Video Quality Metric procedure are the following: Calibration: Calibrates the video sequence for future extraction. It estimates the correct temporal and spatial shift and makes the appropriate corrections. Then it adjusts the brightness and the contrast according to the original sequence. Quality Features Extraction: Using a mathematical function, it extracts a set of features regarding picture quality. These features are about changes in the chrominance, spatial and temporal properties of the video sequence.
32 24 Quality Parameters Calibration: Comparing the two video sequences, it spots the quality differences and calculates the appropriate parameters. VQM Calculation: Combining all the previous parameters it calculates the VQM result. Video Quality Metric can be calculated using different criteria according to the use of the video sequence. These models are: Developer, General, PSNR, Television and Videoconference (Wang, 2006). The General model uses seven parameters. The five parameters are based on the luminance component and the other two on the chrominance components. The parameters are the following: (Hewage, 2009) chroma extreme: detects color problems such as impairments that have been created by errors in transmission. chroma spread: detects variations in the spread of the color distribution ct ati gain: Temporal and contrast information hv gain: detects a shift on the orientation of the edges (diagonal to vertical and horizontal) hv loss: detects a shift on the orientation of the edges (vertical and horizontal to diagonal) si gain: measures the quality improvements that are a product of the implementation of various enhancements such as edge sharpening. si loss: detects loss of information on spatial properties In general Video Quality Metric (VQM) is one of the most reliable metrics and its results are very close to the scores of the subjective metrics. 3.3 Subjective Double Stimulus Continuous Quality Scale (DSCQS) Double Stimulus Continuous Quality Scale (DSCQS) is one of the most widely used subjective metric. It was introduced by The International Telecommunication Union. As every subjective methodology, firstly it is necessary to set the correct parameters for the test. The room environment, the number and the profile of the observers have to be set. Also the testing material has to be chosen and prepared accordingly.
33 25 In general, double stimulus methods use a repetitive presentation of the material. The original and the testing sequences are shown consequently with small intervals. Then the observers have to vote keeping in mind the two sequences. DSCQS has two presentation ways. These ways are the Variant I and the Variant II. In Variant I there is only one observer. This observer is permitted to switch between the two sequences as many times as he wants, until he has a clear opinion. A typical observer watches each set of sequences two or three times. Each sequence lasts approximately, 10 seconds. In Variant II the observers are more than one. The two sequences are shown one or more times consequently, in order to help the observers establish a clear view of their quality. After that, the sequences are shown again and each observer votes. The duration of each video sequence is 10 seconds and the repetitions are usually two. The next figure shows the structure of the presentation. Figure 3.1: Double Stimulus Continuous Quality Scale (DSCQS) presentation procedure (ITU-R, 2002) The presentation phases are the following: T1 is the test sequence A (10s) T2 is the mid-gray interval (3s) T3 is the test sequence B (10s) T4 is a mid-gray sequence (5-11s) One of the video sequences is the testing sequence, and the other is the original sequence. These sequences are not shown in a particular order. Thus the observers do not know which sequence is the original. Also it is possible that some sets do not