FLIX: Fast Relief for Performance-Hungry Embedded Applications

Transcription

1 FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February Tensilica, Inc.

2 25 Tensilica, Inc. ii

3 Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications... Four Applications in Search of Acceleration...2 Works on Large Code Blocks Too...5 Conclusion...8 Figures Figure : Designer-defined FLIX instructions for the Xtensa LX processor can be either 32 or 64 bits wide and can encode several independent operations in one instruction word....2 Figure 2: Bit Manipulator application performance versus gate-count....4 Figure 3: H.264 deblocking filter application performance versus gate-count...6 Figure 4: MPEG-4 decoder application performance versus gate-count...7 Figure 5: SAD (sum of absolute differences) application performance versus gate-count...8 Tables Table : Results for Bit Manipulator application...9 Table 2: Results for H.264 deblocking filter...9 Table 3: Results for MPEG-4 decoder...9 Table 4: Results for SAD (sum of absolute differences) engine Tensilica, Inc. iii

4 25 Tensilica, Inc. iv

5 FLIX: Fast Relief for Performance-Hungry Embedded Applications By Steven Leibson and John Massingham Tensilica, Inc. Microprocessors are great building blocks for all types of embedded systems because they re so flexible. Compile some code for them and they can decode and play digital audio, route IP network packets, or decompress video (just to name a very few applications). If microprocessors were infinitely fast, there d never be a need to design any other hardware. However, microprocessors aren t infinitely fast. Often, they re not even fast enough to meet project goals. One of the bottlenecks in general-purpose microprocessor designs that prevent them from meeting performance goals is their insistence on executing one operation at a time. Modern RISC processor designs solve this problem somewhat through pipelining, which allows several instructions to be in various pipeline stages simultaneously. However, most RISC designs remain singleinstruction-issue machines. To combat this bottleneck, processor designers sometimes develop designs that issue and execute multiple independent operations simultaneously. These processors are often called VLIW (very long instruction word) machines because they encode multiple independent operations into one long instruction word. Many classes of programs benefit from the increased instruction parallelism provided by VLIW processor designs. However, VLIW instruction words must often be hundreds of bits long to allow the encoding of many simultaneous independent operations. As a result, VLIW programs tend to be large, which is the usual price for encoding multiple, independent operations. Tensilica has developed a VLIW-like technology called FLIX (flexible-length instruction extensions) for its Xtensa processor core family. This technology offers developers a way to realize the performance of VLIW instructions but without the usual VLIW code bloat. And with Tensilica s XPRES Compiler, SOC designers don t have to become processor designers to employ this technology the XPRES Compiler exploits capability when it automatically generates processor configurations. FLIX instruction formats can be either 32 or 64 bits wide and can encode many independent operations in designer-defined operation slots within the FLIX instruction word, as shown in Figure. Note that as the number of independent operations encoded in each FLIX instruction increases, the number of bits available in each operation slot decreases because the number of bits in the instruction is constant. With fewer available encoding bits, generalpurpose instructions become more specialized because there are fewer bits available to specify source and destination operands and immediate values. This fact will be important to remember when analyzing the results of the tests made for this white paper. 25 Tensilica, Inc.

6 Designer-Defined FLIX Instruction Formats with Designer-Defined Number of Operations 63 Operation Operation 2 Operation3 Example: 3-Operation, 64-bit Instruction Format 63 Operation Operation 2 Op 3 Op 4 Operation 5 Example: 5-Operation, 64-bit Instruction Format 3 Op Op 2 Op 3 Op. 4 Example: 4-Operation, 32-bit Instruction Format These three examples show (from top to bottom) a 3-operation, 64-bit FLIX instruction; a 5-operation, 64-bit FLIX instruction; and a 4-operation, 32-bit FLIX instruction. Note that as the number of operations in a FLIX instruction increases, the number of bits available to encode each operation decreases. Figure : Designer-defined FLIX instructions for the Xtensa LX processor can be either 32 or 64 bits wide and can encode several independent operations in one instruction word. The Xtensa LX C/C++ compiler that is generated along with a FLIX-enhanced Xtensa LX processor core can exploit the operational parallelism provided by FLIX instructions. Thus FLIX instructions can be used selectively to improve application performance where needed while the processor s native 24- and 6-bit instructions can be used in other sections of the code where parallelism isn t needed. This flexibility allows the compiler to generate compact code in sections of the application where the high performance of multiple operations/clock isn t required. Four Applications in Search of Acceleration To demonstrate the ability of FLIX instructions to accelerate code performance, Tensilica used the XPRES Compiler to automatically analyze the C code of four different applications. The XPRES Compiler is an ideal tool for this sort of architectural exploration. In much less than an hour, the XPRES Compiler analyzes the instruction flow within a C program and then generates hundreds of thousands or millions of candidate processor architectures all based on Tensilica s Xtensa LX processor core. It then selects the best candidates based on silicon area (cost) and performance criteria, and presents the final candidates to the system designer who selects a final architecture based on project goals. For each of the four applications considered in this white paper, we used a baseline single-instruction-issue processor configuration with one load/store unit that executes the full Xtensa LX instruction set to generate baseline performance numbers. We then allowed the XPRES Compiler to generate eight additional processor configurations (four with one load/store unit and four with two load/store units) with FLIX enhancements. In each group, the XPRES Compiler generated versions of the Xtensa LX processor with 2, 3, 4, and 5 operation slots in the FLIX instruction word. The addition of a second load/store unit allows the Xtensa LX processor to emulate XY memory operation that is a popular performance-enhancing feature found in many DSP processors. Addition of the second load/store unit requires the use of FLIX technology because each load/store unit requires its own operation field. For these experiments, we restricted the XPRES Compiler to use only one of its three optimization methods: the addition of FLIX instructions. The XPRES Compiler can also create new kinds of instructions using operator fusion and SIMD vectorization techniques 25 Tensilica, Inc. 2

7 and these additional optimizations are discussed in another Tensilica white paper: The XPRES Compiler triple-threat solution to code performance challenges. However, for this white paper, we constrained the XPRES Compiler to use of FLIX optimizations and no new operations were created. The XPRES compiler was only allowed to replicate baseline processor instructions in the additional FLIX operation slots if the additional parallelism could increase the application s performance. The results from these experiments show which of the four application programs benefit from the addition of an extra load/store unit and which benefit from the additional operation slots. The four test applications include: Bit Manipulator, a simple multi-operation algorithm that takes two numbers, masks each, shifts each, and then adds them together in a loop An H.264 (video) deblocking filter A SAD (sum of absolute differences) algorithm for video motion estimation An MPEG-4 video decoder algorithm Cycle counts for these four algorithms running on the baseline single-operation/clock Xtensa LX processor range from a few tens of thousands to hundreds of millions of cycles. Performance improvements from FLIX extensions range from cycle-count reductions of as much as 63% (the code runs nearly three times faster) to about 6%, which shows that not all code benefits from the availability of multiple simultaneous operations. Some code is stubbornly serial and cannot be accelerated through the operational parallelism of SIMD units or even big VLIW architectures. It s very important to note that the use of the XPRES Compiler allowed this design-space exploration to occur very quickly. XPRES can examine a block of code and generate multiple processor designs in less than an hour. Even a 6% performance improvement could help many projects meet performance goals. However, tripling an algorithm s speed in the time it takes to go out for a meal is truly a remarkable result. Tables through 4 (which appear at the end of this white paper) list the raw performance numbers for the four applications listed above. These tables show the performance of the unaugmented Xtensa LX processor core, a very competent 32-bit embedded RISC processor even without application-specific extensions, and they list the performance numbers for the enhanced 2-, 3-, 4-, and 5-slot Xtensa LX processors created by the XPRES compiler. 25 Tensilica, Inc. 3

8 35, 3, slot Load/Store Unit 2 Load/Store Units 25, Cycle Count 2, 5,, 2 slots 2 slots 5,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 2: Bit Manipulator application performance versus gate-count. The Bit Manipulator application results appear in Figure 2 and are perhaps the easiest to understand. The dark line in Figure 2 shows performance results for the baseline Xtensa LX processor and for processors that have been enhanced with 2-, 3-, 4-, and 5- slot FLIX instructions. All of the operation slots in these processors are filled with instances of Xtensa LX baseline instructions but the processors with FLIX enhancements can execute multiple operations during each clock cycle. All of the processors represented by the dark line in the graph in Figure 2 have one load/store unit. The lighter line in Figure 2 plots the same results but all of the processors on that line (except for the baseline processor) have two load/store units. The graph plots the cycle count required to execute the application versus the number of additional gates required to add the multiple execution units and the second load/store unit. As you can see from this graph, adding the ability to execute multiple simultaneous operations greatly accelerates the Bit Manipulator application. This application places load, mask, shift, and store operations within a loop and the Xtensa C/C++ compiler is able to profitably use the additional parallel execution resources to accelerate loop performance. A processor with 3-slot FLIX instructions requires only about 37% of the execution cycles to execute this application code compared to the baseline Xtensa LX processor it s nearly three times faster for this application. The addition of a FLIX instruction format with three operation slots is a general sort of extension and can be usefully employed to accelerate a wide range of application code. There are three additional factors to note with respect to Figure 2:. Essentially all of the benefit from parallel operations is realized in the processor with 3-slot FLIX instructions. More operation slots add more hardware parallelism but the Xtensa C/C++ compiler is unable to exploit the additional available instruction-level parallelism for this particular application program. 25 Tensilica, Inc. 4

9 2. Processors with a second load/store unit (results shown by the lighter line in Figure 2) are no faster than the same processor configuration with one load/store unit. This result indicates that the Bit Manipulator application is compute intensive and that the load/store unit is not a bottleneck in this instance, so the additional cost (in terms of silicon area) of the second load/store unit is not merited in this case. 3. The 5-slot version of the Xtensa LX processor actually exhibits slightly degraded performance compared to the 3- and 4-slot versions. This result shows that forcing the XPRES compiler to add more than the required number of operation slots can result in a loss of operation efficiency by reducing the number of operation-encoding bits available to each operation slot. Having more encoding bits available per operation allows the XPRES Compiler to create more comprehensive operations so that the C compiler needs fewer of these operations to execute a task. Works on Large Code Blocks Too Figure 3 illustrates these same trends but for a much larger application: an H.264 deblocking filter. This application program requires nearly 2 million cycles to run on an unaugmented Xtensa LX processor. The XPRES Compiler achieves about a 6% performance improvement (eliminating more than million execution cycles from the application in a few minutes) by adding a FLIX instruction format with two operation slots. Based on the results, this particular application appears to be more limited by data movement than by a lack of computational resources because additional operation slots do not appear to improve performance. In fact, adding more than two operation slots appears to slightly degrade performance compared to the 2-slot result. However, the addition of a second load/store unit nearly doubles the achieved performance improvement, as shown by the lighter line in the graph in Figure 3. This result differs from the one observed for the Bit Manipulator application, demonstrating that different applications really do benefit from different processor optimizations. 25 Tensilica, Inc. 5

10 2,, 95,, slot 9,, Load/Store Unit 2 Load/Store Units Cycle Count 85,, 8,, 75,, 7,, 2 slots 2 slots 65,, 6,, 55,, 5,,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 3: H.264 deblocking filter application performance versus gate-count Another video application, an MPEG-4 video decoder, exhibits the same sort of results as the H.264 deblocking filter. Results for this application appear in Figure 4 and the pattern of the results for the processors with one load/store unit is very similar to the results obtained for the H.264 deblocking filter but the MPEG-4 video decoder application code benefits more from the optimizations of the XPRES Compiler, which achieves a 23% reduction in cycle count by adding a second operation slot to the Xtensa LX processor. In addition, a second load/store unit further increases performance and the processor with two load/store units can gainfully exploit three operation slots for yet more performance. Again, it s important to remember that all of this design-space exploration consumes only a few minutes because it s automated by the XPRES Compiler. Normally, a design team would not be able to conduct this sort of extensive architectural research because handdesigned processor variants require months of design time, not minutes. 25 Tensilica, Inc. 6

11 3,, 2,, slot Load/Store Unit 2 Load/Store Units,, Cycle Count,, 9,, 2 slots 2 slots 8,, 7,, 6,,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 4: MPEG-4 decoder application performance versus gate-count The fourth application, which is also from the video domain, is a SAD (sum of absolute differences) engine used for video motion estimation. Results achieved for this application (shown in Figure 5) are very similar to those achieved for the Bit Manipulator application, although the SAD application consumes about 2 times the number of cycles. The addition of 3-slot FLIX instructions cuts the cycle count by about 63% and the addition of a second load/store unit provides no benefit to the SAD engine code. 25 Tensilica, Inc. 7

12 8, slot 7, 6, Load/Store Unit 2 Load/Store Units Cycle Count 5, 4, 3, 2, 2 slots 2 slots,, 2, 3, 4, 5, 6, Incremental Gate Count A plot for an Xtensa LX base processor and processors with 2-, 3-, 4-, and 5-slot FLIX extensions and one or two load/store units. Figure 5: SAD (sum of absolute differences) application performance versus gate-count Conclusion Use of the XPRES Compiler in this white paper was artificially constrained but the results are real and demonstrate both the ability of FLIX multi-operation instructions to accelerate various applications and the ability of the XPRES Compiler to rapidly explore processor configurations and extensions that can accelerate the execution of critical code blocks in a system. The XPRES Compiler results for each application discussed in this white paper consumed far less than an hour per run, resulting in performance enhancements that range from some performance improvement to a tripling of code execution speed. The automated XPRES Compiler allows the developer to discover how much performance benefit FLIX instruction extensions can provide to an application in the time it takes to eat lunch. Some applications benefit only mildly from FLIX-type processor enhancement and others benefit substantially. Examination of results from experiments conducted with the XPRES Compiler by Tensilica customers has prompted design teams to restructure some application code. This code restructuring, taking only a day or two, has substantially boosted application performance in some cases. 25 Tensilica, Inc. 8

13 Number of Instruction slots Table : Results for the Bit Manipulator application Bit Manipulator One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 3,746 NA NA 2-slot FLIX 5,384 2,263 5,384 35,284 3-slot FLIX,29 24,662,287 48,46 4-slot FLIX,29 24,73,287 49,43 5-slot FLIX,287 26,366 2,34 49,946 Number of Instruction slots Table 2: Results for the H.264 deblocking filter H.264 Deblocking Filter One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 93,788,7 NA NA 2-slot FLIX 8,4,93 2, 74,638,399 36,92 3-slot FLIX 82,552,788 2,55 74,9,533 49,26 4-slot FLIX 82,679,72 2,684 77,978,299 48,75 5-slot FLIX 82,988,82 2,765 79,85,489 48,837 Number of Instruction slots Table 3: Results for the MPEG-4 decoder MPEG-4 Decoder One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 22,8,59 NA NA 2-slot FLIX 93,684,776 4,74 9,69,824 38,833 3-slot FLIX,669,84 27,992 85,259,477 5,56 4-slot FLIX,7,48 28,9 9,647,42 5,549 5-slot FLIX,36,58 28,29 87,847,53 52,439 Number of Instruction slots Table 4: Results for the SAD (sum of absolute differences) engine SAD Engine One Load/Store Unit Two Load/Store Units Cycles Gates Cycles Gates slot 729,322 NA NA 2-slot FLIX 369,37 2,2 369,37 35,23 3-slot FLIX 287,943 22, ,943 45,76 4-slot FLIX 27,43 22,747 27,43 45,862 5-slot FLIX 287,993 24,76 287,993 47,89 25 Tensilica, Inc. 9