H.265 (HEVC) BITSTREAM TO H.264 (MPEG 4 AVC) BITSTREAM TRANSCODER DEEPAK HINGOLE. Presented to the Faculty of the Graduate School of

Transcription

1 H.265 (HEVC) BITSTREAM TO H.264 (MPEG 4 AVC) BITSTREAM TRANSCODER by DEEPAK HINGOLE Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN ELECTRICAL ENGINEERING THE UNIVERSITY OF TEXAS AT ARLINGTON December 2015

3 ACKNOWLEDGEMENTS I would like to express my heartfelt gratitude to my advisor Dr. K. R. Rao for his unwavering support, encouragement, supervision and valuable inputs throughout this research work. He has been a constant source of inspiration for me to pursue this research work. I would also like to extend my gratitude to my colleagues in Adobe Systems Incorporated for their invaluable industrial insight and experience to help me understand and grow in the field of digital video processing. Additionally I would like to thank Dr. Schizas and Dr. Dillon for serving as members of my graduate committee. A big thank you to Vasavee, Rohith, Maitri, Shwetha, Srikanth and Uma, my Multimedia Processing Lab mates for providing valuable suggestions during the course of my research work. Last but not the least; I would like to thank my parents, my siblings and my close friends for believing in me and supporting me in this undertaking. I wish for your continued support in future. November 25, 2015

4 ABSTRACT H.265 (HEVC) BITSTREAM TO H.264 (MPEG 4 AVC) BITSTREAM TRANSCODER Deepak Hingole, MS The University of Texas at Arlington, 2015 Supervising Professor: K. R. Rao With every new video coding standard the general rule of thumb has been to maintain same video quality at a reduced bit rate of about 50% as compared to the previous standard. H.265 is the latest video coding standard with support for encoding videos with wide range of resolutions, starting from low resolution to beyond High Definition i.e. 4k or 8k. H.265 also known as HEVC was preceded by H.264 which is very well established and widely used standard in industry and finds its applications in broadcast, storage, multimedia telephony. Currently almost all devices including low power handheld mobile devices have capabilities to decode H.264 encoded bitstream. HEVC achieves high coding efficiency at the cost of increased implementation complexity and not all devices have hardware powerful enough to process (decode) this HEVC bitstream. In order for HEVC coded content to be played on devices with support for H.264, transcoding of HEVC bitstream to H.264 bitstream is necessary. Different transcoding architectures will be investigated and an easy to implement scheme will be studied as part of this research.

5 TABLE OF CONTENTS ACKNOWLEDGEMENTS...iii ABSTRACT... iv LIST OF ILLUSTRATIONS... viii LIST OF TABLES... x Chapter 1 Introduction Basics of Video Compression Need for Video Compression Video Coding Standards Thesis Outline... 3 Chapter 2 Overview Of H Introduction Profiles and Levels in H Profiles in H Baseline profile Main profile Extended profile High profile Levels in H H.264 Encoder H.264 Decoder Chapter 3 Overview Of HEVC Introduction Profiles and levels in H H.265 Encoder and Decoder... 16

6 3.3.1 Coding Tree Units (CTU) and Coding Tree Block (CTB) Coding Units (CU) and Coding Blocks (CB) Prediction Units (PU) and Prediction Blocks (PB) Transform Units (TU) and Transform Blocks (TB) Motion Vector Signaling Motion Compensation Intra-picture prediction Quantization Control Entropy Coding In-loop Deblocking Filtering Sample Adaptive Offset (SAO) High-Level Syntax Architecture Parameter Set Structure NAL unit syntax structure Slices Supplemental Enhancement Information (SEI) and Video Usability Information (VUI) metadata Parallel Processing Features Tiles Wavefront Parallel Processing (WPP) Dependent slices Chapter 4 Transcoding Introduction Transcoding Architectures Open Loop Transcoding Architecture... 28

7 Closed-loop Transcoding Architecture Cascaded Pixel-domain architecture Motion Compensation in the DCT Domain Choice of Transcoding Architecture Chapter 5 Results Quality Metrics For Cascaded Implementation Peak-Signal-To-Noise-Ratio (PSNR) versus Quantization Parameter (QP) Bitrate versus Quantization Parameter Rate Distortion (R-D) Plot Chapter 6 Conclusion and Future Work APPENDIX A Test Sequences APPENDIX B Test Conditions Test Environment APPENDIX C Acronyms REFERENCES BIOGRAPHICAL INFORMATION... 62

8 LIST OF ILLUSTRATIONS Figure 1 1: I-, P- and B- frames... 2 Figure 1 2: Evolution of video coding standards... 3 Figure 2 1: Different profiles in H.264 with distribution of various coding tools among the profiles... 6 Figure 2 2: H.264 Encoder block diagram Figure 2 3: Nine prediction modes for 4 4 Luma block Figure 2 4: H.264 Decoder block diagram Figure 3 1: 4:2:0 Subsampling Figure 3 2: Typical HEVC video encoder (with decoder modeling elements shaded in light gray) Figure 3 3: HEVC Decoder block diagram Figure 3 4: Modes and directional orientations for intra-picture prediction Figure 3 5: Subdivision of a picture into slices Figure 3 6: Subdivision of a picture into tiles Figure 3 7: Illustration of Wavefront Parallel Processing Figure 4 1: Open Loop, partial decoding to DCT coefficients then requantize Figure 4 2: Closed-loop, drift compensation for requantized data Figure 4 3: Cascade decoder encoder architecture Figure 4 4: Frame based comparison of open loop, closed loop and cascaded pixel domain architecture Figure 4 5: General block diagram for proposed transcoding scheme Figure 5 1: PSNR (db) versus QP for akiyo_cif.y4m Figure 5 2: PSNR (db) versus QP for city_cif.y4m Figure 5 3: PSNR (db) versus QP for crew_cif.y4m... 39

9 Figure 5 4: PSNR (db) versus QP for flower_cif.y4m Figure 5 5: PSNR (db) versus QP for football_cif.y4m Figure 5 6: Bitrate (kbps) versus QP for akiyo_cif.y4m Figure 5 7: Bitrate (kbps) versus QP for city_cif.y4m Figure 5 8: Bitrate (kbps) versus QP for crew_cif.y4m Figure 5 9: Bitrate (kbps) versus QP for flower_cif.y4m Figure 5 10: Bitrate (kbps) versus QP for football_cif.y4m Figure 5 11: R-D plot for akiyo_cif.y4m Figure 5 12: R-D plot for city_cif.y4m Figure 5 13: R-D plot for crew_cif.y4m Figure 5 14: R-D plot for flower_cif.y4m Figure 5 15: R-D plot for football_cif.y4m... 46

10 LIST OF TABLES Table 2-1 Levels in H Table 3-1 Levels limits for Main profile in HEVC Table 5-1 akiyo_cif.y4m sequence quality metrics Table 5-2 city_cif.y4m sequence quality metrics Table 5-3 crew_cif.y4m sequence quality metrics Table 5-4 flower_cif.y4m sequence quality metrics Table 5-5 football_cif.y4m sequence quality metrics... 37

11 Chapter 1 Introduction 1.1. Basics of Video Compression Like many other recent technological developments, the emergence of video and image coding in the mass market is due to convergence of a number of areas. Cheap and powerful processors, fast network access, the ubiquitous Internet and a large-scale research and standardization effort have all contributed to the development of image and video coding technologies [1]. Video can be thought of as a series of images displayed at a constant interval. This constant interval also known as frame rate or frames per second (FPS) is an important factor in video technology [2]. The objective of any compression scheme is to represent the data in a compact form. Representation of data in reduced number of bits is achieved through the exploitation of various redundancies present in data. In case of video, we have spatial and temporal redundancies apart from statistical and perceptual redundancies. Spatial redundancies can be thought of as a block of pixels in a video frame bearing similarities with its neighboring blocks. Similarly, temporal redundancies can be thought of as a set frame bearing similarities with that of the frames that have arrived before and/or that will follow after the current frame of interest. A picture or frame will belong to one of the I-picture, P-picture or B-picture categories. I pictures or intra predicted frame is the one in which current frame is predicted without referring to any other frame. P pictures and B pictures are said to be inter-coded using motion-compensated prediction from a reference frame. P pictures make use of a reference frame (the P picture or I picture preceding the current P -

12 picture), whereas B pictures make use two reference frames (the P and/or I pictures before and after the current frame). The difference between predicted frame and actual frame carries less information and is coded to achieve compression. The three types of frames are shown in figure 1-1[2]. Figure 1 1: I-, P- and B- frames 1.2. Need for Video Compression Is compression really necessary once transmission and storage capacities have increased to a sufficient level to cope with uncompressed video! It is true that both the storage and transmission capacities continue to increase. However, an efficient and welldesigned video compression system gives very significant performance advantages for visual communication at both low and high transmission bandwidths Video Coding Standards There have been several video coding standards introduced by organizations like the International Telecommunication Union - Telecommunication Standardization Sector (ITU-T), Moving Picture Experts Group (MPEG) and the Joint Collaborative Team on Video Coding (JCT-VC). Each standard is an improvement over the previous standard. With every standard, the general thumb of rule has been to retain the same video quality by being able to reduce the bit rate by 50%. Figure 1-2 shows the evolution of video coding standards over the years.

13 Figure 1 2: Evolution of video coding standards 1.4. Thesis Outline Chapter 2 describes the overview of H.264 also known as MPEG 4 Part 10/AVC. In a similar fashion, the overview of H.265 also known as High Efficiency Video Coding (HEVC) is discussed in Chapter 3. Chapter 4 highlights the need for transcoding along with exploring different transcoding architectures and chooses one of them as preferred choice of transcoding scheme. Chapter 5 summarizes the results of proposed algorithm followed by Chapter 6 discussing about how well the proposed algorithms performed and what conclusions can be drawn from it along with future areas of research in the same direction.

14 Chapter 2 Overview Of H Introduction H.264/MPEG4-Part 10 advanced video coding (AVC) was iintroduced in 2003 and was developed by the Joint Video Team (JVT), consisting of Video Coding Experts Group (VCEG) of International Telecommunication Union Telecommunication standardization sector (ITU-T) and Moving Picture Experts Group (MPEG) of International Standards Organization/ (ISO/IEC) [4]. H.264 can support various interactive (video telephony) and non-interactive applications (broadcast, streaming, storage, video on demand) as it facilitates a network friendly video representation [7]. It leverages on the previous coding standards such as MPEG-1, MPEG-2, MPEG-4 part 2, H.261, H.262 and H.263 [6] [8] and adds many other coding tools and techniques which give it superior quality and compression efficiency. Like any other previous motion-based codecs, it uses the following basic principles of video compression [5]: Transform for reduction of spatial correlation Quantization for control of bitrate Motion compensated prediction for reduction of temporal correlation Entropy coding for reduction in statistical correlation. The improved coding efficiency of H.264 can be attributed to the additional coding tools and the new features. Listed below are some of the new and improved techniques used in H.264 [7]: Adaptive intra-picture prediction Small block size transform with integer precision

15 Multiple reference pictures and generalized B-frames Variable block sizes Quarter pel precision for motion compensation Content adaptive in-loop deblocking filter and Improved entropy coding by introduction of context adaptive binary arithmetic coding (CABAC) and context adaptive variable length coding (CAVLC) The increase in the coding efficiency and increase in the compression ratio results to a greater complexity of the encoder and the decoder algorithms of H.264, as compared to previous coding standards. In order to develop error resilience for transmission of information over the network, H.264 supports the following techniques [7]: Flexible macroblock (MB) ordering Switched slice Arbitrary slice order Redundant slice Data partitioning Parameter setting 2.2. Profiles and Levels in H.264 Profiles and levels specify conformance points for implementing the standard in an interoperable way across various applications that have similar functional requirements, whereas a level places constraints on certain key parameters of the bitstream, corresponding to decoder processing load and memory capabilities [13].

16 Profiles in H.264 A profile defines a set of coding tools or algorithms that can be used in generating a conforming bitstream [13]. The profiles defined for H.264 can be listed as follows [10]: 1. Baseline profile 2. Main profile 3. Extended profile 4. High Profiles defined in the FRExts amendment Figure 2-1 illustrates the coding tools for the various profiles of H.264. Figure 2 1: Different profiles in H.264 with distribution of various coding tools among the profiles

17 Baseline profile The list of tools included in the baseline profile are I (intra coded) and P (predictive coded) slice coding, enhanced error resilience tools of flexible MB ordering, arbitrary slices and redundant slices. It also supports CAVLC. The baseline profile is intended to be used in low delay applications, applications demanding low processing power and in high packet loss environments. This profile has the least coding efficiency among all the three profiles Main profile The coding tools included in the main profile are I, P, and B (bi-directionally prediction coded) slices, interlace coding, CAVLC and CABAC. The tools not supported by main profile are error resilience tools, data partitioning and switched intra (SI) coded and switched predictive (SP) coded slices. This profile is aimed to achieve highest possible coding efficiency Extended profile This profile has all the tools included in the baseline profile. As illustrated in the figure 2-1, this profile also includes B, SP and SI slices, data partitioning, interlace frame and field coding, picture adaptive frame/field coding and MB adaptive frame/field coding. This profile provides better coding efficiency than baseline profile. The additional tools result in increased complexity.

18 High profile In September 2004 the first amendment of H.264/MPEG-4 AVC video coding standard was released [10]. A new set of coding tools were introduced as a part of this amendment. These are termed as Fidelity Range Extensions (FRExts). The aim of releasing FRExts is to be able to achieve significant improvement in coding efficiency for higher fidelity material. The application areas for the FRExts tools are professional film production, video production and high-definition (HD) TV/DVD. The FRExts amendment defines four new profiles. Discussion of those profiles is out of scope of this document, so skipped Levels in H.264 Level restrictions are established in terms of maximum sample rate, maximum picture size, maximum bit rate, minimum compression ratio and capacities of the decoded picture buffer (DPB), and the coded picture buffer (CPB) that holds compressed data prior to its decoding for data flow management purposes [13]. In H.264 /AVC, 16 levels are specified. The levels defined in H.264 are listed in Table 2-1. The level 1b was added in the FRExts amendment.

19 Table 2-1 Levels in H.264

20 2.3. H.264 Encoder Figure 2-2 illustrates the block diagram for the H.264 encoder. H.264 encoder works on MB and motion-compensation like most other previous generation codecs. Video is formed by a series of picture frames. Each picture frame is an image which is split down into blocks. The block sizes can vary in H.264. Figure 2 2: H.264 Encoder block diagram The encoder may perform intra-coding or inter-coding for the MBs of a given picture. Intra coded frames are encoded and decoded independently. They do not need any reference frames. Hence they provide access points to the coded sequence where decoding can start.. Figure 2-3 illustrates the nine prediction modes for 4 4 luma block [12]. There are total of nine optional prediction modes for each 4 4 luma block, four modes for a luma block and four modes for the chroma components.

21 Figure 2 3: Nine prediction modes for 4 4 Luma block Inter-coding uses inter-prediction of a given block from some previously decoded pictures. The aim to use inter-coding is to reduce the temporal redundancy by making use of motion vectors. Motion vectors give the direction of motion of a particular block from the current frame to the next frame. The prediction residuals are obtained which then undergo transformation to remove spatial correlation in the block. The transformed coefficients, thus obtained, undergo quantization. The motion vectors, obtained from inter-prediction are combined with the quantized transform coefficient information. They are then entropy encoded using schemes such as CAVLC or CABAC to reduce statistical redundancies[6]. There is a local decoder within the H.264 encoder. This local decoder performs the operations of inverse quantization and inverse transform to obtain the residual signal in the spatial domain. The prediction signal is added to the residual signal to reconstruct the input frame. This input frame is fed in the deblocking filter to remove blocking artifacts at the block boundaries. The output of the deblocking filter is then fed to inter/intra prediction blocks to generate prediction signals.

22 2.4. H.264 Decoder The H.264 decoder works similar in operation to the local decoder of H.264 encoder. Figure 2-4 illustrates the H.264 decoder block diagram [12]. An encoded bitstream is the input to the decoder. Entropy decoding (CABAC or CAVLC) takes place on the bitstream to obtain the transform coefficients. These coefficients are then inverse scanned and inverse quantized. This gives residual block data in the transform domain. Inverse transform is performed to obtain the data in the spatial domain. The resulting output is 4x4 blocks of residual signal. Depending on inter-predicted or intra-predicted, an appropriate prediction signal is added to the residual signal. For an inter-coded block, a prediction block is constructed depending on the motion vectors, reference frames and previously decoded pictures. This prediction block is added to the residual block to reconstruct the video frames. These reconstructed frames then undergo deblocking before they are stored for future use for prediction or being displayed. Figure 2 4: H.264 Decoder block diagram

23 Chapter 3 Overview Of HEVC 3.1 Introduction H.264 is widely used for many applications, including broadcast of high definition (HD) TV signals over satellite, cable, and terrestrial transmission systems, video content acquisition and editing systems, camcorders, security applications, Internet and mobile network video, Blu-ray Discs, and real-time conversational applications such as video chat, video conferencing, and telepresence systems. However, an increasing diversity of services, the growing popularity of HD video, and the emergence of beyond HD formats (e.g., 4k 2k or 8k 4k resolution) are creating even stronger needs for coding efficiency superior to H.264/ MPEG-4 AVC s capabilities. The need is even stronger when higher resolution is accompanied by stereo or multiview capture and display [13]. High Efficiency Video Coding (HEVC) is the latest Video Coding format. It challenges the state-of-the-art H.264/AVC Video Coding standard which is in the industry by being able to reduce the bit rate by 50%, retaining the same video quality. HEVC is designed to address existing applications of H.264/MPEG-4 AVC and to focus on two key issues: increased video resolution and increased use of parallel processing architectures. It primarily targets consumer applications as pixel formats are limited to 4:2:0 8-bit and 4:2:0 10-bit. 4:2:0

24 3.2 Profiles and levels in H.265 Only three profiles targeting different application requirements, called the Main, Main 10, and Main Still Picture profiles, are finalized by January Minimizing the number of profiles provides a maximum amount of interoperability between devices, and is further justified by the fact that traditionally separate services, such as broadcast, mobile, streaming, are converging to the point where most devices should become usable to support all of them. The three drafted profiles consist of the coding tools and high layer syntax described in the different sections of, while imposing the following restrictions [13]: 1) Only 4:2:0 chroma sampling is supported as shown in figure ) When an encoder encodes a picture using multiple tiles, it cannot also use wavefront parallel processing, and each tile must be at least 256 luma samples wide and 64 luma samples tall. 3) In the Main and Main Still Picture profiles, only a video precision of 8 b per sample is supported, while the Main 10 profile supports up to 10 b per sample. 4) In the Main Still Picture profile, the entire bitstream must contain only one coded picture (and thus inter-picture prediction is not supported).

25 Figure 3 1: 4:2:0 Subsampling Currently, there are definition of 13 levels included in the first version of the standard as shown in Table 3-1, ranging from levels that support only relatively small picture sizes such as a luma picture size of (sometimes called a quarter common intermediate format (QCIF)) to picture sizes as large as (often called 8k 4k). The picture width and height are each required to be less than or equal to (8 MaxLumaPS), where MaxLumaPS is the maximum luma picture size as shown in Table 3-1 (to avoid the problems for decoders that could be involved with extreme picture shapes) [13].

26 Table 3-1 Levels limits for Main profile in HEVC 3.3 H.265 Encoder and Decoder The video coding layer of HEVC employs the same hybrid approach (inter-/intrapicture prediction and 2-D transform coding) used in all video compression standards since H.261. Figure 3-2 depicts the block diagram of a hybrid video encoder, which could create a bitstream conforming to the HEVC standard [13] whereas Figure 3-3 depicts the block diagram for HEVC Decoder.

27 Figure 3 2: Typical HEVC video encoder (with decoder modeling elements shaded in light gray) Figure 3 3: HEVC Decoder block diagram

28 Various features involved in hybrid video coding using HEVC are discussed in sub-sections to follow Coding Tree Units (CTU) and Coding Tree Block (CTB) The MB, containing a block of luma samples and, in the usual case of 4:2:0 color sampling, two corresponding 8 8 blocks of chroma samples is the core of coding layer in H.264. The analogous structure in HEVC is the coding tree unit (CTU), which has a size selected by the encoder and can be larger than a traditional MB[13]. The CTU consists of a luma CTB and the corresponding chroma CTBs and syntax elements. The size L L of a luma CTB can be chosen as L = 16, 32, or 64 samples, with the larger sizes typically enabling better compression. HEVC then supports a partitioning of the CTBs into smaller blocks using a tree structure and quadtree-like signaling [13][14] Coding Units (CU) and Coding Blocks (CB) The quadtree syntax of the CTU specifies the size and positions of its luma and chroma CBs. The root of the quadtree is associated with the CTU. Hence, the size of the luma CTB is the largest supported size for a luma CB. The splitting of a CTU into luma and chroma CBs is signaled jointly. One luma CB and ordinarily two chroma CBs, together with associated syntax, form a coding unit (CU). A CTB may contain only one CU or may be split to form multiple CUs, and each CU has an associated partitioning into prediction units (PUs) and a tree of transform units (TUs)[13].

29 3.3.3 Prediction Units (PU) and Prediction Blocks (PB) The decision whether to code a picture area using inter-picture or intra-picture prediction is made at the CU level. A PU partitioning structure has its root at the CU level. Depending on the basic prediction-type decision, the luma and chroma CBs can then be further split in size and predicted from luma and chroma prediction blocks (PBs). HEVC supports variable PB sizes from down to 4 4 samples[13] Transform Units (TU) and Transform Blocks (TB) The prediction residual is coded using block transforms. A TU tree structure has its root at the CU level. The luma CB residual may be identical to the luma transform block (TB) or may be further split into smaller luma TBs. The same applies to the chroma TBs. Integer basis functions similar to those of a discrete cosine transform (DCT) are defined for the square TB sizes 4 4, 8 8, 16 16, and For the 4 4 transform of luma intrapicture prediction residuals, an integer transform derived from a form of discrete sine transform (DST) is alternatively specified. Alternative 4 4 Transform matrix is H = [ ] Motion Vector Signaling Advanced motion vector prediction (AMVP) is used, including derivation of several most probable candidates based on data from adjacent PBs and the reference picture. A merge mode for motion vector (MV) coding can also be used, allowing the inheritance of MVs from temporally or spatially neighboring PBs. Moreover, compared to

30 H.264/MPEG-4 AVC, improved skipped and direct motion inference are also specified[13] Motion Compensation Quarter-sample precision is used for the MVs, and 7-tap or 8-tap filters are used for interpolation of fractional-sample positions (compared to six-tap filtering of half-sample positions followed by linear interpolation for quarter-sample positions in H.264/MPEG-4 AVC). Similar to H.264/MPEG-4 AVC, multiple reference pictures are used. For each PB, either one or two motion vectors can be transmitted, resulting either in uni-predictive or bi-predictive coding, respectively. As in H.264/MPEG-4 AVC, a scaling and offset operation may be applied to the prediction signal(s) in a manner known as weighted prediction[13] Intra-picture prediction The decoded boundary samples of adjacent blocks are used as reference data for spatial prediction in regions where inter-picture prediction is not performed. Intrapicture prediction supports 33 directional modes (compared to eight such modes in H.264/MPEG-4 AVC), plus planar (surface fitting) and DC (flat) prediction modes. The selected intra-picture prediction modes are encoded by deriving most probable modes (e.g., prediction directions) based on those of previously decoded neighboring PBs. Figure 3-4 shows the various modes used in intra-picture prediction [13].

31 Figure 3 4: Modes and directional orientations for intra-picture prediction Quantization Control As in H.264/MPEG-4 AVC, uniform reconstruction quantization (URQ) is used in HEVC, with quantization scaling matrices supported for the various transform block sizes[13] Entropy Coding As opposed to CAVLC and CABAC only CABAC is used as entropy coding scheme. CABAC scheme used in H.265 is similar to that of in H.264/MPEG-4 AVC, but has undergone several improvements to improve its throughput speed (especially for parallel-processing architectures) and its compression performance, and to reduce its context memory requirements[13].

32 In-loop Deblocking Filtering A deblocking filter similar to the one used in H.264/MPEG-4 AVC is operated within the inter-picture prediction loop. However, the design is simplified in regard to its decision-making and filtering processes, and is made more friendly to parallel processing[13] Sample Adaptive Offset (SAO) A nonlinear amplitude mapping is introduced within the inter-picture prediction loop after the deblocking filter. Its goal is to better reconstruct the original signal amplitudes by using a look-up table that is described by a few additional parameters that can be determined by histogram analysis at the encoder side[13]. 3.4 High-Level Syntax Architecture A number of design aspects new to the HEVC standard improve flexibility for operation over a variety of applications and network environments and improve robustness to data losses. However, the high-level syntax architecture used in the H.264/MPEG-4 AVC standard has generally been retained, including the following features[13] Parameter Set Structure Parameter sets contain information that can be shared for the decoding of several regions of the decoded video. The parameter set structure provides a robust mechanism for conveying data that are essential to the decoding process. The concepts of sequence and picture parameter sets from H.264/MPEG-4 AVC are augmented by a new video parameter set (VPS) structure[13].

33 3.4.2 NAL unit syntax structure Each syntax structure is placed into a logical data packet called a network abstraction layer (NAL) unit. Using the content of a two byte NAL unit header, it is possible to readily identify the purpose of the associated payload data[13] Slices A slice is a data structure that can be decoded independently from other slices of the same picture, in terms of entropy coding, signal prediction, and residual signal reconstruction. A slice can either be an entire picture or a region of a picture. One of the main purposes of slices is resynchronization in the event of data losses. In the case of packetized transmission, the maximum number of payload bits within a slice is typically restricted, and the number of CTUs in the slice is often varied to minimize the packetization overhead while keeping the size of each packet within this bound[13]. Figure 3-5 depicts the subdivision of a picture into slices. Figure 3 5: Subdivision of a picture into slices

34 3.4.4 Supplemental Enhancement Information (SEI) and Video Usability Information (VUI) metadata The syntax includes support for various types of metadata known as SEI and VUI. Such data provide information about the timing of the video pictures, the proper interpretation of the color space used in the video signal, 3-D stereoscopic frame packing information, other display hint information, and so on [13]. 3.5 Parallel Processing Features Four new features are introduced in the HEVC standard to enhance the parallel processing capability or modify the structuring of slice data for packetization purposes. Each of them may have benefits in particular application contexts, and it is generally up to the implementer of an encoder or decoder to determine whether and how to take advantage of these features[2] [13] Tiles HEVC has an option of partitioning its picture into rectangular independently decodable regions called as tiles. Its main purpose is for parallel processing. Tiles can also be used for random access to local regions in video pictures. Tiles provide parallelism at a more coarse level (picture/sub-picture) of granularity, and no sophisticated synchronization of threads is necessary for their use [2]. Figure 3-6 depicts the subdivision of a picture into tiles.

35 Figure 3 6: Subdivision of a picture into tiles Wavefront Parallel Processing (WPP) This is a new feature in HEVC which when enabled allows a slice to be divided into rows of CTUs. The processing of each row can be started only after certain decisions in the previous row have been made. WPP provides parallelism within slices. Figure 3-7 shows how WPP works. Figure 3 7: Illustration of Wavefront Parallel Processing

36 3.5.3 Dependent slices Dependent slices allow data associated with a particular wave front point entry or tile to be carried in a separate NAL unit. It also allows fragmented packetization of the data with lower latency than if it were all coded in one slice [2][13].

37 Chapter 4 Transcoding 4.1 Introduction Video transcoding [15][16] is the process of converting video from one format to another. A format is basically defined by the characteristics such as bit-rate, frame rate, spatial resolution etc. One of the earliest applications of transcoding is to adapt the bit rate of a precompressed bitstream to the available channel bandwidth. Hence transcoding is undertaken to meet the demands of constrained bandwidths and terminal capabilities [15]. Transcoding also leads to interoperability between different networks, devices and content representation formats. Transcoding can be of various types [15]. Some of them are bit rate transcoding to facilitate more efficient transport of video, spatial and temporal resolution reduction transcoding for use in mobile devices with limited display and processing power and error-resilience transcoding in order to achieve higher resilience of the original bitstream to transmission errors. To achieve optimum results by transcoding, the following criteria have to be fulfilled: 1) The quality of the transcoded bitstream should be comparable to the one obtained by direct decoding and re-encoding of the output stream. 2) The information contained in the input stream should be used as much as possible to avoid multigenerational deterioration. 3) The process should be cost efficient, low in complexity and achieve the highest quality possible.

38 4.2 Transcoding Architectures There are different standard transcoding architectures for changing bit rate, spatial resolution, format conversion. Few of them are discussed here in sections to follow: Open Loop Transcoding Architecture Figure 4-1 shows and open-loop system. In the open-loop system, the bit stream is variable-length decoded (VLD) to extract the variable-length code words corresponding to the quantized DCT coefficients, as well as MB data corresponding to the motion vectors and other MB-level information. In this scheme, the quantized coefficients are inverse quantized and then simply requantized to satisfy the new output bit rate. Finally, the requantized coefficients and stored MB-level information are variable length coded (VLC). Regardless of the techniques used to achieve the reduced rate, open-loop systems are relatively simple since a frame memory is not required and there is no need for an IDCT. In terms of quality, better coding efficiency can be obtained by the requantization approach since the variable-length codes that are used for the requantized data will be more efficient. However, open-loop architectures are subject to drift[16]. Figure 4 1: Open Loop, partial decoding to DCT coefficients then requantize

39 Closed-loop Transcoding Architecture In general, the reason for drift is due to the loss of high-frequency information. Figure 4-2 shows a closed-loop system. Closed-loop system aims to eliminate the mismatch between predictive and residual components by approximating the cascaded decoder-encoder architecture [17]. Figure 4 2: Closed-loop, drift compensation for requantized data This simplified scheme requires only one reconstruction loop with one DCT and one IDCT. With the exception of this slight inaccuracy, this architecture is mathematically equivalent to a cascaded decoder-encoder approach.

40 Cascaded Pixel-domain architecture Figure 4-3 shows cascaded decoder encoder architecture. The main difference in structure between the cascaded decoder encoder architecture also known as cascaded pixel-domain architecture and the closed-loop scheme is that reconstruction in the cascaded pixel-domain architecture is performed in the spatial domain, thereby requiring two reconstruction loops with one DCT and two IDCTs. Figure 4 3: Cascade decoder encoder architecture Motion Compensation in the DCT Domain The closed-loop architecture described in the section provides an effective transcoding structure in which the MB reconstruction is performed in the DCT domain. However, since the memory stores spatial domain pixels, the additional DCT/IDCT is still needed. This can be avoided though by utilizing the compressed-domain methods for MC proposed by Chang and Messerschmidt [23]. In this way, it is possible to reconstruct reference frames without decoding to the spatial domain; several architectures describing this reconstruction process in the compressed domain have been proposed [24]-[26]. It was found that decoding completely in the compressed-domain could yield equivalent

41 quality to spatial-domain decoding [24]. However, this was achieved with floating-point matrix multiplication and proved to be quite costly. Different transcoding architectures for spatial resolution reduction, temporal resolution reduction like Motion Vector Mapping, DCT-Domain Down Conversion, Conversion of MB Type, Motion Vector Reestimation, Residual Reestimation are discussed in [16] Choice of Transcoding Architecture The cascaded pixel domain transcoding architecture gives optimum results in terms of complexity, quality and cost. The cascaded pixel domain transcoder offers greater flexibility in the sense that it can be used for bit rate transcoding, spatial/temporal resolution downscaling and for other coding parameter changes as well. Since in the case of standards transcoding it is required to take into consideration the different coding characteristics of H.265 and H.264, flexibility is a key issue. Figure 4 4: Frame based comparison of open loop, closed loop and cascaded pixel domain architecture

42 It is evident from figure 4-4 that the open-loop architecture suffers from severe drift, and the quality of the simplified closed-loop architecture is very close to that of the cascaded pixel-domain architecture [16]. According to [17], cascaded pixel-domain scheme is considered as ideal transcoder since it comprises of one full decoder and one full encoder. Another benefit of this approach is that decoding is usually fast since it does not involve motion estimation and predictions can be made for frames based on variable length decoding (VLD) of motion vectors from the encoded bitstream. The quality of transcoded video in turn is dependent upon the input to encoder stage. So better the input to encoding stage of transcoder, better the end video quality. This satisfies the criteria for the optimum transcoder as discussed in section 4.1. Figure 4-5 shows the general block diagram for this proposed transcoding scheme. Figure 4 5: General block diagram for proposed transcoding scheme

43 Chapter 5 Results 5.1 Quality Metrics For Cascaded Implementation Table 5-1 akiyo_cif.y4m sequence quality metrics QP Original (HM) bitrate (kbps) Transco ded wrt HM output bitrate (kbps) Metrics Type HM encoder wrt original metrics JM encoder wrt original metrics Transcoder wrt HM ouput metrics Transcoder wrt original metrics Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR

44 Table 5-2 city_cif.y4m sequence quality metrics QP Original (HM) bitrate (kbps) Transcod ed wrt HM output bitrate (kbps) Metrics Type HM encoder wrt original metrics JM encoder wrt original metrics Transcoder wrt HM ouput metrics Transcoder wrt original metrics Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR

45 Table 5-3 crew_cif.y4m sequence quality metrics QP Original (HM) bitrate (kbps) Transcoded wrt HM output bitrate (kbps) Metrics Type HM encoder wrt original metrics JM encoder wrt original metrics Transcoder wrt HM ouput metrics Transcoder wrt original metrics Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR

46 Table 5-4 flower_cif.y4m sequence quality metrics QP Original (HM) bitrate (kbps) Transcoded wrt HM output bitrate (kbps) Metrics Type HM encoder wrt original metrics JM encoder wrt original metrics Transcoder wrt HM ouput metrics Transcoder wrt original metrics Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR

47 Table 5-5 football_cif.y4m sequence quality metrics QP Original (HM) bitrate (kbps) Transcoded wrt HM output bitrate (kbps) Metrics Type HM encoder wrt original metrics JM encoder wrt original metrics Transcoder wrt HM ouput metrics Transcoder wrt original metrics Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR Y-PSNR U-PSNR V-PSNR YUV-PSNR

48 PSNR (db) PSNR (db) Peak-Signal-To-Noise-Ratio (PSNR) versus Quantization Parameter (QP) QP akiyo_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 1: PSNR (db) versus QP for akiyo_cif.y4m QP city_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 2: PSNR (db) versus QP for city_cif.y4m

49 PSNR (db) PSNR (db) QP crew_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 3: PSNR (db) versus QP for crew_cif.y4m QP flower_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 4: PSNR (db) versus QP for flower_cif.y4m

50 PSNR (db) QP football_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 5: PSNR (db) versus QP for football_cif.y4m

51 Bitrate (kbps) Bitrate (kbps) Bitrate versus Quantization Parameter akiyo_cif.y4m JM encoder wrt original Transcoder wrt HM reconstructed QP Figure 5 6: Bitrate (kbps) versus QP for akiyo_cif.y4m QP city_cif.y4m JM encoder wrt original Transcoder wrt HM reconstructed Figure 5 7: Bitrate (kbps) versus QP for city_cif.y4m

52 Bitrate (kbps) Bitrate (kbps) QP crew_cif.y4m JM encoder wrt original Transcoder wrt HM reconstructed Figure 5 8: Bitrate (kbps) versus QP for crew_cif.y4m QP flower_cif.y4m JM encoder wrt original Transcoder wrt HM reconstructed Figure 5 9: Bitrate (kbps) versus QP for flower_cif.y4m

53 Bitrate (kbps) football_cif.y4m JM encoder wrt original Transcoder wrt HM reconstructed QP Figure 5 10: Bitrate (kbps) versus QP for football_cif.y4m

54 PSNR (db) PSNR (db) 5.4 Rate Distortion (R-D) Plot 46 akiyo_cif.y4m JM encoder wrt original Transcoder wrt original Bitrate (kbps) Figure 5 11: R-D plot for akiyo_cif.y4m city_cif.y4m JM encoder wrt original Transcoder wrt original Bitrate (kbps) Figure 5 12: R-D plot for city_cif.y4m

55 PSNR (db) 44 crew_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 13: R-D plot for crew_cif.y4m Bitrate (kbps) flower_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 14: R-D plot for flower_cif.y4m

56 PSNR (db) Bitrate (kbps) football_cif.y4m JM encoder wrt original Transcoder wrt original Figure 5 15: R-D plot for football_cif.y4m

57 Chapter 6 Conclusion and Future Work The objective of thesis is to implement a transcoding scheme that would make it possible for a device with H.264 support to play H.265 encoded bitstreams. It can be verified from the results in chapter 5 that the main purpose of any optimal transcoder being able to implement transcoding and get similar quality video has been met. As expected, PSNR has decreased while bitrate has increased in case of transcoder when compared with that of JM encoder directly working on original raw video. This is due the fact that re-encoding was on a reconstructed video which already had some deviations from that of original video since it was processed through HM encoder and decoder. The quality of the video depended upon how well the HM encoded video was decoded before giving it as input to transcoder. Time complexity of this implementation is high since full decoding followed by encoding is implemented. Motion estimation contributes to most of the time spent in re encoding phase. Various optimization techniques can be implemented to take care of this constraint. This thesis was based on format conversion from one standard to another. In coming years, we will have devices capable of HEVC playback, so one other area of transcoding that can be explored is spatial reduction resolution of HEVC bitstream for use in mobile displays. HEVC supports 35 intra prediction modes, these can be mapped to one of 9 intra prediction modes in AVC, thereby avoiding the need to decode the video all the way down to spatial domain and re-encoding.

58 APPENDIX A Test Sequences 48

59 A 1: akiyo_cif.y4m A 2: city_cif.y4m 49

60 A 3: crew_cif.y4m A 4:flower_cif.y4m 50

61 A 5:football_cif.y4m 51

62 APPENDIX B Test Conditions 52

63 The code revision of reference software for HEVC encoder and decoder i.e., HM used for this research is HM 16.7 [41] The code revision of reference software for H.264 encoder and decoder i.e., JM used for this research is JM 19.0 [42] All the work was done on a system with following configuration: Operating System: Windows 10 Home Edition Processor: Intel(R) Core(TM) 2.00GHz 2.50GHz RAM: 8.00 GB System type: 64-bit Operating System, x64-based processor Test Environment H.265 encoded streams were generated by using reference HEVC encoder configured to work with main profile and in intra mode. Quantization parameter was incremented in steps of 5 ranging from 22 to 37 for all the test sequences and measurements like PSNR, encoding time, bitrate were recorded. Similarly H.264 encoder was used with high profile. A total of 60 frames at 30 frames per second were encoded in both the cases. 53