Using Intel Graphics Performance Analyzer (GPA) to analyze Intel Media Software Development Kitenabled

Using Intel Graphics Performance Analyzer (GPA) to analyze Intel Media Software Development Kitenabled applications The 2 nd Generation Intel Core family of processors provides hardware-accelerated media encode, decode and preprocessing capabilities called Intel Quick Sync Video technology. The Intel Media Software Development Kit (Intel Media SDK) provides developers with a standard application programming interface (API) to create highperformance video solutions for consumer and professional uses based on Intel Quick Sync Video technology. The Intel Media SDK provides developers with: Highly optimized routines for delivering maximum video performance on 2 nd Generation Intel Core processors. Built-in support for upcoming video capabilities in future Intel platforms. Faster time to market with easy-to-access APIs and reduced development time. Intel Graphics Performance Analyzers (Intel GPA) Intel GPA is a suite of software tools that provides platform-level graphics performance analysis to help developers optimize application performance. Intel GPA has the following major components: Intel GPA Frame Analyzer is a powerful, intuitive, best-in-class single frame analysis and optimization tool. Intel GPA System Analyzer Heads-up Display (HUD) and Standalone provide straightforward initial analysis and interactive Microsoft Direct3D* pipeline state overrides. Intel GPA Platform Analyzer provides a timeline view for analysis of tasks, threads, Microsoft DirectX*, OpenCL and GPU-accelerated media applications in context.

Intel GPA Media Performance Analyzer: See how efficiently your code utilizes hardware acceleration on Intel Core processor-based PCs with Intel HD Graphics, or run real-time analysis of encode and decode metrics to get indepth, real-time media performance analysis. The Intel GPA Media Performance Analyzer helps the developer to understand and fine-tune performance gaps resulting from suboptimal use of the Intel Media SDK. The rest of this white paper will provide detailed usage examples and tips on using the Intel GPA Media Performance Analyzer to understand performance issues in Intel Media SDK-enabled applications. We will use an example to encode a YUV file to H.264/AVC using the Intel Media SDK. If Intel GPA is installed on the system, usually there is an Intel GPA icon on the taskbar. If there is no icon, start Intel GPA from the Microsoft Windows* Start menu; it will place the icon on the taskbar. Rightclick on the Intel GPA icon. It will have Media Performance. as one of the options (Figure 1: Intel GPA). Select the Media Performance option. It will open up the Intel GPA Media Performance Analyzer Window as shown below: Figure 1: Intel GPA Figure 2: Intel GPA Media Performance Analyzer

The Media Performance Analyzer window provides real-time Intel HD Graphics usage information. The GPU General tab shows the overall usage percentage of Intel HD Graphics. The GPU Execution Unit Engine (EU) tab provides total real-time usage of the execution units. The table under the GPU Execution Unit Engine tab provides the usage of various components of the GPU execution units. The right hand side shows the total usage of the Multi-Format Codec Engine (MFX Engine) and its components. To demonstrate how to use the Intel GPA Media Performance Analyzer to optimize an Intel Media SDK-enabled application, we will use the sample encode application shipped with Intel Media SDK distributions. The Intel Media SDK provides two APIs: Submit the frame to Intel HD Graphics for encoding: MFXVideoENCODE_EncodeFrameAsync(mfxSession session, mfxencodectrl *ctrl, mfxframesurface1 *surface, mfxbitstream *bs, mfxsyncpoint *syncp) encode Receive the encoded frame: MFXVideoCORE_SyncOperation(mfxSession session, mfxsyncpoint syncp, mfxu32 wait); MFXVideoENCODE_EncodeFrameAsync() is an asynchronous (nonblocking) API call used by the application to submit an uncompressed (NV12) frame to the Intel Media SDK for encoding. This API takes the session handle, encode settings (mfxencodectrl), input frame (mfxframesurface1), output buffer (mfxbitstream), and sync point (MfxSyncPoint) as input. The sync point is used by the MFXVideoENCODE_EncodeFrameAsync API to retrieve the encoded frame back from the Intel Media SDK. Consider the following two scenarios to understand how the data flows in Intel HD Graphics between different encode stages, and how the Intel GPA Media Performance Analyzer helps to optimize the application. Scenario A: we will use the simple encode application to o Read a raw frame from the input file; o Submit the frame to Intel Media SDK using EncodeFrameAsync API; o Retrieve the encoded frame from Intel Media SDK; o Write the encoded frame to the file. Scenario B: we will use the simple encode application to o Read 3 raw frames from the input file; o Submit these frames to Intel Media SDK using EncodeFrameAsync API; o Retrieve the encoded frames from Intel Media SDK; o Write the encoded frames to the file. Scenario A works on one frame at a time, and Scenario B works on multiple frames (3) at a time. There is no change in the architecture of the simple encode application. It is achieved by the AsyncDepth parameter in the Intel Media SDK, which tells the Intel Media SDK how many frames can be submitted

asynchronously before the application starts draining the output from the Intel Media SDK. It is exposed as a parameter to the application. The input YUV file is 1920x1080 with 300 frames and encoded to H.264/AVC for 8Mbps with Constant Bit Rate (CBR) settings. The important encoder configuration parameters for the Intel Media SDK are set as follows: // set mfx parameters mfxencparams.mfx.codecid = MFX_CODEC_AVC; mfxencparams.iopattern = MFX_IOPATTERN_IN_SYSTEM_MEMORY; mfxencparams.mfx.frameinfo.chromaformat = MFX_CHROMAFORMAT_YUV420; mfxencparams.mfx.frameinfo.fourcc = MFX_FOURCC_NV12; mfxencparams.mfx.frameinfo.picstruct = MFX_PICSTRUCT_PROGRESSIVE; mfxencparams.mfx.targetusage = MFX_TARGETUSAGE_BALANCED; mfxencparams.mfx.ratecontrolmethod = MFX_RATECONTROL_CBR; mfxencparams.mfx.numthread = 0; mfxencparams.mfx.encodedorder = 0; mfxoption.cavlc = MFX_CODINGOPTION_OFF; // CABAC Figure 3 shows the changes in the Intel GPA Media Performance Analyzer window when we run the simple encode application for Scenario A: Figure 3: Scenario A - Intel GPA Media Performance Analyzer View Figure 4 shows the Intel Media Performance Analyzer output when Scenario B is run:

Figure 4: Scenario B - Intel GPA Media Performance Analyzer View In Scenario B, GPU usage is 98%, while in Scenario A, GPU usage is 62%. Similarly, execution unit and MFX unit usage are higher in Scenario B. Table 1: Performance Comparison of Scenario A and Scenario B Scenario A (single synchronized frame encoding) Scenario B (multiple asynchronous frame encoding) GPU Usage EU Usage MFX Usage Frames/sec 62% 51% 10% 41* fps 98% 83% 25% 108* fps Note: * the above performance number no way indicates minimum or maximum performance that can be achieved from Intel Quick Sync Video. These numbers are obtained through a sample application to understand the Intel GPA tool. It is easy to imagine that multiple asynchronous frame encoding will yield better performance over single synchronized frame encoding. However, what is the optimal value for AysncDepth (Intel Media SDK parameter) to get the best performance? Often the developer is doing multiple asynchronous encoding sessions, but is not getting any performance benefit, or it is actually hurting the

performance. The Intel GPA Media Performance Analyzer helps by visualizing concurrency between hardware processing elements and logical (API) calls. If you click on the Capture button while the application is running, the Intel GPA Media Performance Analyzer will capture a detailed execution trace of the application. Let us run Scenario A of the sample encode application and capture the trace to understand its execution inside Intel HD Graphics. Tracing Duration tells us the length of the trace to be captured in milliseconds. 1000ms or 2000ms is more than enough to understand the performance issues of the application. Click on the Capture button and start the simple encode application. After 1000ms, the trace will stop and it will open up the trace in a separate window. Figure 5 shows the trace from a simple encode application for Scenario A: Figure 5: Scenario A - Trace View The trace provides a system-wide view of how the application code works with Intel Media SDK and how media-related workloads execute on Intel HD Graphics. The trace is organized in horizontal tracks, which can be a processing thread (running on a CPU core) or within an Intel HD Graphics hardware block, or it can represent the duration of a blocking API call. The user can easily zoom in or out horizontally with the mouse wheel or - and = keys. The bottom right corner provides the list of panels that can provide more details about the trace when a user selects part of the trace. For example, in the picture below, the Statistics panel is selected and shown on the top right panel.

Here we click inside a track and zoom in (Figure 6 below). This expands the details about the tracks. 1 frame Figure 6: Scenario A - Expanded Track View The Statistics and Summary panels provide more details about the selected region. The Summary panel tells us that MFX_SyncOperation is selected. This can be easily confirmed by zooming in more on the Track panel. Figure 7: Scenario A Expanded Frame Info Figure 7 shows the encoding of one H.264/AVC frame. The first track, msdk_sample.exe (let us call it Application Track), runs the Mfx_SyncOperation API. The corresponding tasks in the Intel Media SDK are labeled as MSDK Track. Two GPU Encode tracks show the movement of frames inside the GPU and execution of different functions. The first GPU Encode track is labeled as Motion Estimation Track, and the other is labeled as Coding Track. Let us understand these tracks in more detail. Application Track: We zoom in a little bit more and can see that the Application Track has two function calls, MFX_EncodeFrameAsync and MFX_SyncOperation.

MFX_EncodeFrameAsync takes only.0237ms, and MFX_SyncOperation takes 20.75ms. Figure 8: Scenario A - Application Track The MFX_EncodeFrameAsync call is asynchronous and returns immediately, but the MFXSyncOperation call waits for the encode to finish for the frame submitted by the MFX_EncodeFrameAsync. This is one of the issues in Scenario A, where the application is waiting for each frame to complete before submitting the next frame for encoding. Encode does not take that much time, but there is additional overhead associated with the operation, e.g., copying the frame from the CPU memory to the graphics unit memory and getting the encoded frame back from the graphics unit. MSDK Track The MSDK Track has a main track called Encode Submit, which has multiple subtracks (see Figure 9). Encode Submit first locks the frame, then copies the frame to the graphics unit, then unlocks the frame. The first step to copy the frame also depends on frame size. In our case, it is 1920x1080 YUV buffer. The second step is to execute DXVA commands to the graphics unit to encode the frame. That is where the graphics unit starts encoding the frame. Figure 9: Scenario A - MSDK Track

The DXVA2_Execute time may differ for every frame type (I-frame, B-frame, or P-Frame) as the driver may attach other information with the command (e.g., information needed to manage reference frames, etc.). GPU Encode Track There are two GPU Encode tracks (Figure 10). This is because there are two different hardware blocks used in the encoding process to perform two separate tasks. Figure 10: Scenario A - GPU Encode Track The first GPU Encode track is responsible for motion estimation, which includes motion detection and mode decision. The second task is mainly responsible for bitstream coding based on the information sent from the Motion Estimation Track, which includes CABAC. The application cannot control the execution time of these tracks, but it is helpful to understand these tracks for performance purposes. Motion Estimation Track The Motion Estimation Track is actually the kernel software which runs on the Execution Units (EUs) of the graphics unit. This kernel is executed while adaptively invoking the motion estimation acceleration hardware. The actual behavior depends on the TargetUsage set by the application. Here we can see 4 stages inside the Motion Estimation Track with MFX_TARGETUSAGE_BALANCED mode. The performance of this track depends on the usage parameter, but it is also affected by whether graphics Intel Turbo Boost Technology is on or off. Coding Track The Coding Track is executed on the independent coding acceleration hardware, which is separate from the EUs. Because motion estimation and coding are on

independent hardware, the Motion Estimation Track and Coding Track can work in parallel. On the Intel GPA Platform Analyzer, these two processes are serialized, but this is just because of logical dependency, which means the Coding Track needs the motion vector and macroblock type from the Motion Estimation Track. In other words, this serialization is done by the driver software, not by hardware logic. If there is no dependency, the hardware can work in parallel. This is a very important point to consider when optimizing the encoding performance. The Motion Estimation Track is separated into multiple stages to improve performance by breaking the entire motion estimation process into several pieces. So, multiple frames which are in the encoding stage (Motion Estimation Track or Coding Track) can be encoded in parallel. In this case, motion estimation for the current frame can work in parallel with coding of the last frame. Intel GPA Media Performance Analyzer helps the developer to understand whether encode hardware is used optimally inside the graphics or not. End of Encode_Query Encode_Query messages are issued by the Intel Media SDK to check whether encoding is complete or not. If you look at Figure 10, after Encode Submit, there are tasks at regular intervals in the MSDK Track; these tasks are Encode_Query tasks. Figure 11 shows the expanded form of the MSDK Track near Encode_Query. Figure 11: Scenario A - Encode Query If the encoded frame is ready, then the graphics driver locks the frame, copies the data from the graphics unit to the destination frame, and unlocks the frame. This data copy (graphics unit to CPU) is much faster than when an uncompressed frame is copied to the graphics unit. There are two reasons: first, the compressed frame is smaller in size as compared to uncompressed frame, and second, the Intel Media SDK utilizes an optimized data copy with a combination of MOVNTDQA and MFENCE instructions. An application can get the best performance from Intel Quick Sync Video technology by fully utilizing the hardware acceleration capabilities in Intel HD Graphics. For example, in Scenario A:

1. There is a gap of ~4.8 ms when the first frame encoding ends and the second frame encoding starts (Figure 12). The application should submit more work to the graphics unit, as the graphics unit is idle during this gap. Figure 12: Scenario A - Idle Time in Intel HD Graphics 2. The graphics unit encode capabilities are not overlapped optimally. As mentioned earlier, there are parts of motion estimation and encoding that can run in parallel, as there are different hardware units used. Figure 12 shows that GPU Encode Track 1 and GPU Encode Track 2 are mostly working serially. About ~12 ms could be recovered if these were overlapped. There are a some obvious ways to accomplish this (see Figure 13): 1. First, by submitting two frames in parallel, we can eliminate the CPU bottleneck. 2. Second, by using more than two frames in parallel, we can fully overlap the two Encode tracks. Figure 13: Scenario A - Add Parallelism Let us look at the Intel GPA Media Performance Analyzer trace for Scenario B (Figure 14).

Figure 14: Scenario B - Expanded View In Scenario B, the application is using the Motion Estimation Track and the Coding Track in parallel. There is no idle time in between them. Multiple frames are submitted for encoding to the graphics unit, which also limits the idle time between different hardware encoding units. It is clear that Scenario B is able to hide the idle time on the graphics unit. The performance of Scenario A is 41 fps, and the performance of Scenario B is 108 fps. Scenario B only reduces 7-16ms time on hardware, which does not account for the more than double performance boost. Where is the extra performance coming from? To find out, let us compare the traces of the two scenarios. Figure 15 shows the combined traces from both scenarios. Figure 15: Scenario A and Scenario B Comparison

Using the Intel GPA Media Performance Analyzer to show the time spent in each task, it looks like the Media Estimation unit in the GPU Encode Track is spending almost ~7-8ms in Scenario B, as compared to 12-14ms in Scenario A. How is that possible? It is the same amount of work; we did not change the frame size or any other encoding parameters. This performance gain is due to Intel Turbo Boost Technology in Intel HD Graphics, which is similar to Intel Turbo Boost Technology in the CPU. In the 2 nd Generation Intel Core i5-2540m processor, the base graphics frequency is 650MHz, and the maximum graphics frequency is 1.3GHz. If an application is not using the graphics unit heavily enough, the graphics unit remains at the base frequency. But as the graphics unit usage increases, Intel Turbo Boost Technology kicks in and increases the graphics unit frequency. So, the extra performance achieved by Scenario B on top of removing idle time between hardware units is due to the graphics unit s use of Intel Turbo Boost Technology. Summary Let me summarize the key points that we learned from analyzing Intel GPA Media Performance Analyzer traces to get the best encode performance from Intel Quick Sync Video: 1. The Encode Submit in the MSDK Track should finish before the Motion Estimation for the previous frame finishes (Figure 14). If the application can do that, then the application is making sure that the Motion Estimation unit has another frame to work on after the current frame is finished. 2. If it is hard to achieve the requirement mentioned in point 1 due to application complexity, try to complete Encode_Submit before the Coding Track for the previous frame completes. This enables the application to achieve nearly 100% GPU utilization. 3. If even the second point is not feasible, try to achieve 90% GPU utilization; then Intel Turbo Boost Technology on the graphics unit can benefit application performance. An application can meet one of the above requirements by using the AsyncDepth parameter in the Intel Media SDK to submit multiple frames before synchronizing those frames. How many frames need to be submitted? The answer is the number which can meet at least one of the above requirements. The ideal number of frames to submit depends on the execution time inside the graphics unit. If it takes 20ms, two frames may be enough, but if it is less than 10ms, it may be better to submit three or more frames. In other words, the ideal number of frames to be submitted depends on the acceleration architecture/platform and application. It is better that application developers play with the AsyncDepth parameter in mfxvideoparam and use Intel GPA to understand graphics unit usage. This way the developer can achieve the best performance using Intel Quick Sync Video. Even though this example is only for the encode process, the

same analyzing principles hold true for decode and video pre-processing using Intel Quick Sync Video. *OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Optimization Notice Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804