Applicatin Nte: 202 MDK-ARM Cmpiler Optimizatins Getting the Best Optimized Cde fr yur Embedded Applicatin Abstract This dcument examines the ARM Cmpilatin Tls, as used inside the Keil MDK-ARM (Micrcntrller Develpment Kit), and hw t use them t ptimize yur cde fr best perfrmance r smallest cde-size. Cntents ARM Cmpilatin Tls... 2 Cmpiler Optins fr Embedded Applicatins... 2 Optimizing fr Smallest Cde Size... 5 Cmpile the Measure example withut any ptimizatins... 5 Optimize the Measure example fr Size... 6 Optimizing fr Best Perfrmance... 7 Run the Dhrystne benchmark withut any ptimizatins... 7 Optimize the Dhrystne example fr Perfrmance... 8 Summary... 9 Revisin Histry August 2009: Initial Versin Infrmatin in this file, the accmpany manuals, and sftware is Cpyright (c) ARM. All rights reserved.
ARM Cmpilatin Tls The ARM Cmpilatin Tls are the nly cmpilatin tls c-develped with the ARM prcessrs, and specifically designed t ptimally supprt the ARM architecture. They are a result f 20 years f develpment, and are recgnized as the industry-leading C and C++ cmpilatin tls fr the ARM, Thumb, and Thumb-2 instructins sets. The ARM Cmpilatin tls cnsist f: The ARM Cmpiler, which enables yu t cmpile C and C++ cde. It is an ptimizing cmpiler, and features cmmand-line ptins t enable yu t cntrl the level f ptimizatin Linker and Utilities, which assign addresses and lay ut sectins f cde t frm a final image A selectin f libraries, including the ISO standard C libraries, and the MicrLIB C library which is ptimized fr embedded applicatins Assembler, which generates machine cde instructins frm ARM, Thumb r Thumb-2 assembly-level surce cde Cmpiler Optins fr Embedded Applicatins The ARM Cmpilatin Tls include a number f cmpiler ptimizatins t help yu best target yur cde fr yur chsen micrcntrller device and applicatin area. They can be accessed frm within µvisin by clicking n Prject Optins fr Target. The ptins described this dcument can be fund n the Target and C/C++ tabs f the Optins fr Targets dialg. 2
Crss-Mdule Optimizatin takes infrmatin frm a prir build and uses it t place UNUSED functins int their wn ELF sectin in the crrespnding bject file. This ptin is als knwn as Linker Feedback, and requires yu t build yur applicatin twice t take advantage f it fr reduced cde size. Crss-Mdule Optimizatin has been shwn t reduce cde size, by remving unused functins frm yur applicatin. It can als imprve the perfrmance f yur applicatin, by allwing mdules t share inline cde. The MicrLIB C library has been ptimized t reduce the size f embedded applicatins. It is a subset f the ISO standard C runtime library, and ffers a tradeff between functinality and cde size. Sme f the standard C library functins such as memcpy() are slwer, while sme features f the default library are nt supprted. Unsupprted features include: Operating system functins e.g. abrt(), exit(), time(), system(), getenv(), Wide character and multi-byte supprt e.g. mbtwc(), wctmb() The stdi file I/O functin, with the exceptin f stdin, stdut and stderr Psitin-independent and thread-safe cde Use the MicrLIB C library fr applicatins where verall perfrmance can be traded ff against the need t reduce cde size and memry cst. Link-Time Cde Generatin instructs the cmpiler t create bjects in an intermediate frmat s that the linker can perfrm further cde ptimizatins. This gives the cde generatr visibility int crss-file dependencies f all bjects simultaneusly, allwing it t apply a higher level f ptimizatins. Link-time cde generatin can reduce cde size, and allw yur applicatin t run faster. Optimizatin Levels can als be adjusted. The different levels f ptimizatin allw yu t trade ff between the level f debug infrmatin available in the cmpiled cde, and the perfrmance f the cde. The fllwing ptimizatin levels are available: -O0 applies minimum ptimizatins. Mst ptimizatins are switched ff, and the cde generated has the best debug view. -O1 applies restricted ptimizatin. Fr example, unused inline functins and unused static functins are remved. At this level f ptimizatin, the cmpiler als applies autmatic ptimizatins such as remving redundant cde and re-rdering instructins s as t avid an interlck situatin. The cde generated is reasnably ptimized, with a gd debug view. -O2 applies high ptimizatin (This is the default setting). Optimizatins applied at this level take advantage f ARM s in-depth knwledge f the prcessr architecture, t explit prcessr-specific behavir f the given target. It generates well ptimized cde, but with limited debug view. -O3 applies the mst aggressive ptimizatin. The ptimizatin is in accrdance with the user s Ospace/-Otime chice. By default, multi-file cmpilatin is enabled, which leads t a lnger cmpile time, but gives the highest levels f ptimizatin. 3
The Optimize fr Time checkbx causes the cmpiler t ptimize with a greater fcus n achieving the best perfrmance when checked (-Otime) r the smallest cde size when unchecked (-Ospace). Unchecking Optimize fr Time selects the Ospace ptin which instructs the cmpiler t perfrm ptimizatins t reduce the image size at the expense f a pssible increase in executin time. Fr example, using ut-f-line functin calls instead f inline cde fr large structure cpies. This is the default ptin. When running the cmpiler frm the cmmand line, this ptin is invked using -Ospace Checking Optimize fr Time selects the Otime ptin which instructs the cmpiler t ptimize the cde fr the fastest executin time, at the risk f an increase in the image size. It is recmmended that yu cmpile the time-critical parts f yur cde with Otime, and the rest using the Ospace directive. Split Lad and Stre Multiples instructs the cmpiler t split LDM and STM instructins invlving a large number f registers int a series f lads/stres f fewer multiple registers. This means that an LDM f 16 registers can be split int 4 separate LDMs f 4 registers each. This ptin helps t reduce the interrupt latency n ARM systems which d nt have a cache r write buffer, and systems which use zer-wait state 32-bit memry. Fr example, the ARM7 and ARM9 prcessrs take can nly take an exceptin n an instructin bundary. If an exceptin ccurs at the start f an LDM f 16 registers in a cacheless ARM7/ARM9 system, the system will finish making 16 accesses t memry befre taking the exceptin. Depending n the memry arbitratin system, this can result in a very high interrupt latency. Breaking the LDM int 4 individual LDMs fr 4 registers means that the prcessr will take the exceptin after lading a maximum f 4 registers, thereby greatly reducing the interrupt latency. Selecting this ptin imprves the verall perfrmance f the system. The One ELF Sectin per Functin ptin tells the cmpiler t put all functins int their wn individual ELF sectins. This allws the linker t remve unused functins. An ELF cde sectin typically cntains the cde fr a number f functins. The linker is nrmally nly able t remve unused ELF sectins, nt unused functins. An ELF sectin can nly be remved if all its cntents are unused. Therefre, splitting each functin int its wn ELF sectin allws the cmpiler t easily identify which nes are unused, and remve them. Selecting this ptin increases the time required t cmpile yur cde, but results in imprved perfrmance. The cmbinatin f ptins applied will depend n yur ptimizatin gal whether yu are ptimizing fr smallest cde size, r best perfrmance. The next sectin illustrates the best ptimizatin ptins fr each f these gals. 4
Optimizing fr Smallest Cde Size T ptimize yur cde fr the smallest size, the best ptins t apply are: The MicrLIB C library Crss-mdule ptimizatin Optimizatin level 2 (-O2) Cmpile the Measure example withut any ptimizatins The Measure example uses analg and digital inputs t simulate a data lgger. File -- Open Prject C:\Keil\ARM\Bards\Keil\MCBSTM32\Measure\Measure.uv2 Click the Optins fr Target buttn In the Target tab: Uncheck Crss-Mdule Optimizatin Uncheck Use MicrLIB Uncheck Use Link-Time Cde Generatin In the C/C++ tab: Set Optimizatin Level t Zer Then click OK t save yur changes. Prject Build target Withut any cmpiler ptimizatins applied, the initial cde size is 13,656 Bytes. 5
Optimize the Measure example fr Size Apply the cmpiler ptimizatins in turn, and re-cmpile each time t see their effect in reducing the cde size fr the example. Optins fr Target Target tab: Use the MicrLIB C library Optins fr Target Target tab: Use crss-mdule ptimizatin - Remember t cmpile twice Optins fr Target C/C++ tab: Enable Optimizatin level 2 (-O2) Optimizatin Applied Cmpile Size Size Reductin Imprvement MicrLIB C library 8,960 Bytes 4,696 Bytes 34% smaller Crss-Mdule Cmpilatin 13,500 Bytes 156 Bytes 1.1% smaller Optimizatin level O2 12,936 Bytes 720 Bytes 5.3% smaller All 3 ptimizatin ptins 8,116 Bytes 5,540 Bytes 40.6% smaller Applying all the ptimizatins will reduce the cde size dwn t 8,116 Bytes. The fully ptimized cde is 5,540 Bytes smaller, a ttal cde size reductin f 40.6% 6
Optimizing fr Best Perfrmance T ptimize yur cde fr perfrmance, the best ptins t apply are: Crss-mdule ptimizatin Optimizatin level 3 (-O3) Optimize fr time Run the Dhrystne benchmark withut any ptimizatins The Dhrystne benchmark is used t measure and cmpare the perfrmance f different cmputers, r the efficiency f the cde generated fr the same cmputer by different cmpilers. File Open Prject C:\Keil\ARM\Examples\DHRY\DHRY.uv2 Click the Optins fr Target buttn Turn ff ptimizatin settings in the Target and C/C++ tabs, then click OK Prject Build target Enter Debug mde View Serial Windws UART #1 Open the UART #1 windw View Analysis Windws Perfrmance Analyzer Open the Perfrmance Analyzer Debug Run Start running the applicatin When prmpted: Enter 50000 in the UART#1 windw and press Enter 7
In the Perfrmance Analyzer windw, nte that The drhy_1 lp tk 2.829s The dhry_2 tk 2.014s In the UART #1 windw, nte that It tk 138.0 ms fr 1 run thrugh Dhrystne The applicatin is executing 7246.4 Dhrystnes per secnd Optimize the Dhrystne example fr Perfrmance Re-cmpile the example with all three f the fllwing ptimizatins applied: Optins fr Target Target tab: Crss-mdule ptimizatin Remember t cmpile twice Optins fr Target C/C++ tab: Optimizatin level 3 (-O3) Optins fr Target C/C++ tab: Optimize fr Time Re-run the applicatin, and examine the perfrmance. Measurement Withut ptimizatins With Optimizatins Imprvement dhry_1 2.829s 1.695s 40.1% faster dhry_2 2.014s 1.011s 49.8% faster Micrsecnds fr 1 run 138.0 70 49.3% faster thrugh Dhrystne Dhrystnes per secnd 7246.4 14,285.7 97.1% mre The fully ptimized cde achieves apprximately 2x the perfrmance f the un-ptimized cde. 8
Summary The ARM Cmpilatin Tls ffer a range f ptins t apply when cmpiling yur cde. These ptins can be cmbined t ptimize yur cde fr best perfrmance, fr smallest cde size, r fr any perfrmance pint between these tw extremes, t best suit yur targeted micrcntrller device and market. When ptimizing yur cde, MDK-ARM makes it easy and cnvenient t measure the effect f the different ptimizatin settings n yur applicatin. The cde size is clearly displayed after cmpilatin, and a range f analysis tls such as the Perfrmance Analyzer enable yu t measure perfrmance. The ptimizatin ptins in the ARM Cmpilatin Tls, tgether with the easy-t-use analysis tls in MDK-ARM, help yu t easily ptimize yur applicatin t meet yur specific requirements. 9