GPU Compute accelerated HEVC decoder on ARM Mali TM -T600 GPUs
Ittiam Systems Introduction DSP Systems IP Company Multimedia + Communication Systems Multimedia Components, Systems, Hardware Focus on Broadcast, Video Communication, Video Security, Mobile IP Licensing Business Model Founded in 2001 Venture funded Flexible mix of one time fees and royalties for licensing 300+ licensees Worldwide Fortune 100 companies, Tier 1 OEMs Consistently rated as Most Preferred DSP IP Supplier 250 strong Engineering Team World Class Talent Deep Multimedia and end application Expertise 29 patents issued 30+ patents filed World s most preferred DSP IP supplier 2004 2005 2006 DSP Professionals Survey by Forward Concepts 2
Ittiam Multimedia Overview Multimedia Components Audio Codecs Video Codecs/Image Codecs Algorithms for Audio Effects, Acoustics, Imaging ARM CPU, NEON Optimized DSP+HW Accelerators + GPU expertise and capabilities Middleware + SDKs System components Parsers, Creators, Stacks, Subtitles Multimedia Integration Android, Other Frameworks Use Case validation Enhancements to existing Middleware Application Specific SDKs OEM Applications Complete Multimedia Applications Covers major Multimedia Use Cases Camera, Gallery, Editor, Players, Video Editor Production tested Customizable to requirements 4x 3
Ittiam Multimedia Solutions and ARM Strategic Platform Focus on Mobile, Home, Portable segments ARM Connected Community Member Strong Portfolio of IP Expertise in ARM architecture and optimizations for ARM Long Investment Many years of development on ARM Platforms Covering ARM9E, ARM11, Cortex -A8, A9, A15, A5, A7 and NEON TM In house developed reference C models for all IP Efficient, targeted for ARM, validated across multiple generations Partnership Joint Benchmarking of implementations Early Access to Mali/OpenCL information Early involvement on new platforms 4
Ittiam Media Processing Elements Audio Codes Stereo and Multichannel MP12, AAC- LC/HE v1&v2, AC3, DD+ High Quality Resampler Post Processing and Audio Effects Field Proven Acoustics Voice Quality Enhancements with Echo Cancellation/ AEC), Noise Reduction/ANR Equalizer for Microphone Sin & Speaker AGC, AVC, Audio De-Reverb Mic Beam Forming Video Codecs MPEG2, MPEG-4, H.264, HEVC / H.265 Scalable across Multiple ARM Cores Optimized for bandwidth and CPU + NEON Error Resilience for Streaming Use cases In Production Image Processing De-noise, Face detection, Red-eye correction Panorama, HDR, Low Light, 3D B&W, Sepia, Cross Process Exposure, Colours, Geometric, Filters 5
HEVC Overview
HEVC / H.265 Sandard HEVC aka H.265 is a video compression standard, jointly developed by ISO/IEC MPEG and ITU-T VCEG MPEG and VCEG have established a Joint Collaborative Team on Video Coding (JCT-VC) to develop the HEVC standard HEVC is a successor to H.264 standard HEVC can support ultra high resolutions upto 8192 x 4320 pixels HEVC offers substantially higher video compression ratio compared to existing standards
H.265 vs H.264 Tool H.264 H.265 Coding unit 16x16 macroblocks Block coding Structure Coding tree blocks (64x64) Quadtree coding structure Transforms 4x4 and 8x8 4x4, 8x8, 16x16 and 32x32 Inter Prediction 4x4 to 16x16 Symmetric partitions Intra Prediction 9 Modes 35 Modes 4x4 to 64x64 Asymmetric partitions Motion Prediction Spatial Median Advanced Motion Vection Prediction (Spatial + Temporal) Luma motion compensation Chroma motion compensation 6 taps for half-pel positions+ Bilinear filter for qpel positions 2 taps 4 taps 8 taps for half-pixel positions + 7 tap filter for quarter-pel positions Slices Slices for parallel parsing Wavefront parallel processing Tiles and slices for parallel parsing In-loop filters Deblocking Deblocking and SAO
BitRate HEVC compression About 50% compression over H264 for video resolutions of 1080p and above. 30-40% compression over H264 for lower resolutions MPEG-2 35% reduction in bitrate for same PSNR output when compared to H.264 H264/AVC Perceptual video quality is subjective and cannot be measured with PSNR values 1990 2000 2010 H265/HEVC Subjective tests have shown around 50% reduction in bitrate for similar perceptual video quality when compared to H.264
HEVC Applications Near Term Over-the-top(OTT) video services market is growing at a rapid pace, thanks to Netflix, Hulu, YouTube etc., Smarter Phones and Tablets contribute significantly to OTT growth with consumers opting to view videos on-the-go OTT video services are popularly used with in TVs/set-top boxes as well Rapid growth in OTT market chokes the network bandwidth One in five Consumers abandon viewing due to slow feeds, poor quality viewing experience HEVC will enable superior viewing experience with OTT video service
HEVC Applications Long Term Higher quality video in the traditional terrestrial and satellite broadcasts Video recording in cameras and mobile phones, for saving storage space or higher quality Broadcasting 1080p video at 50 or 60 frames per second for the same bandwidth as 1080i (25 or 30 fps) 4K and 8K Ultra-HD broadcasts for theatre-like quality
Need for Software HEVC Decoder HEVC is a newly ratified standard and there is no hardware support in the current generation of Processors (Embedded / Mobile / Applications SoCs) Dedicated HW accelerators for HEVC increases the silicon area and hence the cost significantly Lack of HEVC content makes the early HW implementation risky Software Decoding is simpler and economically viable option for HEVC deployment NOW Handling the HEVC decoder complexity on a wide range of processors with constraints on the power consumption is key challenge for the Software Decoder
Why use GPUs for Video Processing? Decoding of high resolution videos in software involves high computational complexity and will load the CPU enormously GPUs are highly compute capable and power efficient devices Sin CPU Core(s) ARM Cortex with NEON GPUs are generally idle during video playout GPU acceleration will free up the CPU to perform other (system) tasks MALI T600 / OpenCL compliant GPU
HEVC Decoding on Capable GPUs GPUs are massively multithreaded devices capable of handling hundreds or thousands of threads in parallel at any given time Only highly data parallel algorithms of video codec can be efficiently offloaded to the GPU for processing Parsing & Entropy Decode Motion Compensa tion Inverse Quant Intra Prediction Inverse Transform Recon Deblocking & SAO Not suitable for GPU execution Data parallel execution,suitable for GPU execution
Motion Compensation The current picture/frame pixels is predicted from the reference frame s pixels The reference picture can be from past or future The prediction happens on a block-by-block basis Sin And there can be multiple reference frames for each block
Motion Compensation The most compute intensive part of Motion compensation is sub-pixel interpolation Luma 8 or 7 tap filter Chroma 4 tap filter Sub pixel interpolation is data parallel, i.e., interpolation of each block within a frame can happen in parallel and hence suited for GPU computing Sin
Inverse Quantization and Transform Inverse Quantization & Transform The residue value need to be Inverse quantized 2-D Inverse DCT transformations should be performed over the inverse quantized data Recon & InLoop Filters Reconstruction : The output from the Motion compensation and intra prediction should be added with the output from Inverse transform In loop filtering such as Deblocking and SAO filters are applied over reconstructed samples Parsing & Entropy Decode Motion Compensati on Inverse Quant Intra Prediction Sin Inverse Transform Recon Deblocking & SAO
Challenges in CPU+GPU Implementation Efficient Partitioning of work between CPU and GPU The effective FPS of decoder will be the minimum of the FPS achieved by the CPU and GPU for their respective work So the partitioning needs to be efficient so that both of them perform their respective work at almost the same speed(fps) Efficient pipelining data between CPU and GPU The algorithms running on CPU will depend on the output of algorithms from GPU and/or vice versa A good design should make sure neither the CPU nor the GPU spend any time waiting for the output of the other Cache coherency Cache coherency between CPU and GPU data need to ensured.
Benefits of Mali T600 GPU The 128-bit vector processing Presence of GPU cache instead of Local memory Flexible OpenCL workgroup size No divergent threads Unified memory Suits DSP algorithms like Video processing No requirement for data transfers from/to global memory. Can be understood just like a CPU. Works optimizally for a large range of OpenCL workgroup sizes. Multiple block sizes in a Video frame can be handled efficiently. Similar to CPU code, conditional code can be used in OpenCL kernels as well. Different kinds of filter types, filter lengths etc., in video decode can be handled efficiently. CPU and GPU share the same memory. Video YUV buffers are pretty big. There is no need of costly memory transfers of those buffers. MALI GPUs are well suited for Video Acceleration with significant power/performance benefits
Thank You For more information visit www.ittiam.com or contact us at mkt@ittiam.com