Case Study: Improving FPGA Design Speed with Floorplanning - An introduction to Xilinx PlanAhead 10.1 by Consultant Kent Salomonsen (kent.salomonsen@teknologisk.dk) Picture this: the RTL is simulating perfectly, prototype hardware is waiting on the desk, the entire project team is excited waiting to see some on board action, but the FPGA tool suite cant make the RTL meet timing closure. What does the skilled engineer do now? Maybe some more time should be invested in hunting down a better tool setup that will provide the lacking nano or pico seconds, albeit this approach often is a dead end. Or maybe time is better spent tweaking the RTL, knowing that correct functionality, code clarity and design performance is at the stake. At worst the design will never run in the FPGA mounted in the prototype hardware. A third approach would be floorplanning. If the FPGA targeted is a larger sub 90nm type and the RTL design utilizes a significant amount of FPGA resources, chances are good that the old tool suit can be guided to make a failing design meet timing closure. In this article some of the principles applied in floorplanning will be explained, emphasis will be on Xilinx PlanAhead 10.1. Based on a real world case provided by the Danish company TC Electronic A/S, the article will show that the principles and tooling behind FPGA floorplanning are not all hot air and promises to be fulfilled, but are available and easily adoptable. The activities and techniques described in this article are all made available through ADD-Lab (Accelerated Digital Design Lab) at Danish Technological Institute. Floorplanning with PlanAhead Floorplanning is basically about establishing guide lines for a place and route tool on how to place a design structure inside an FPGA. Motivations for doing floorplanning can be the desire for: a faster running FPGA design, a design meeting timing closure more consistently or a reduction in tool time. During the floorplanning process the designer can lock parts of the RTL structure or physical parts (e.g. RAM, LUTs or flipflops) in certain FPGA fabric locations. Two structural modules in a design should probably be located adjacent to each other if they are closely attached, i.e. if interconnections are many and timing on the interconnections is critical. In order to enable such an approach the synthesized netlist must describe a structural hierarchy, at least for the RTL structures targeted for floorplanning. Other floorplanning attempts focus on I/O interconnections or merging of timing critical parts from one structural module with another structural module. Some floorplans need to be quite elaborate in order to make an FPGA design perform satisfactorily while others can be very rudimentary yet highly effective. The process of establishing a good floorplan is more of a recursive trial and error process loaded with experience and qualified guesses than a well structured linear process. Figure 1 illustrates three RTL modules floorplanned for a Xilinx Virtex-2 FPGA. The blocks pblock_receiver and pblock_channel are populated with the structural modules receiver and Copyright 2008 Danish Technological Institute Page 1 of 8
channel, respectively. The structural modules have 53 interconnections (thick orange line), hence they can be considered closely attached and the floorplanning blocks should therefore be located right next to each other. Figure 1 Example floorplanning for a design structure with three modules (purple blocks). Green lines illustrate I/O interconnection, white lines and orange line (bundle) are module interconnection. The delay time reductions achievable by floorplanning are strictly due to reductions in routing delay time. The netlist logics are by no means optimized! Floorplanning only establishes a rational allocation and placement for structural and physical modules, thereby preventing timing critical logics from being scattered over unnecessarily large areas of FPGA fabric. In effect such guidelines, when carefully defined, will enable significantly better performance from the place and route tool 1. Floorplanning for Xilinx FPGAs should be conducted with the Xilinx PlanAhead tool; the Xilinx Floorplanner tool which is bundled with all ISE distributions is not recommendable, in particular because Xilinx ISE 10.1i includes PlanAhead Lite. The Lite version will provide all the functionality addressed in this article. PlanAhead can be used as a project IDE or it can be inserted in the Xilinx tool chain as illustrated in Figure 2. 1 Put in other words: good floorplanning can to some extend make it up for average performing place and route tools. Copyright 2008 Danish Technological Institute Page 2 of 8
Figure 2 PlanAhead inserted in the well known ISE design flow. The purple blocks all relate to PlanAhead activities and file outputs. The Wasabi Case The project case applied for evaluation of the PlanAhead flow is provided by TC Electronic A/S, see presentation below. TC Electronic holds an IP known as Wasabi, which is continuously improved and evaluated on a platform hosting a Virtex-4 FPGA (XC4VLX60-10FF672). The Wasabi ISE project has a perpetual problem; it cannot meet timing closure consistently when targeting the above FPGA type, even small updates in logics unrelated to the timing critical paths cause timing closure failures. Extensive attempts on tuning tool parameters have been tried but no robust setting has been found. It is not desirable to change timing critical RTL in this case nor is it desirable to replace the Virtex-4 FPGA with a faster speed grade type. TC Electronic A/S is a leading manufacturer of audio processing equipment for the pro audio business. The TC Electronic portfolio is based on high performance audio processing algorithms executing in DSP s, FPGA s or proprietary ASIC s. The Wasabi ISE project appears to be a good candidate for floorplanning. Three criteria define success for such an attempt: Copyright 2008 Danish Technological Institute Page 3 of 8
1. A significant performance improvement on the timing critical paths. In Xilinx authored literature, metrics as 15 % performance gain and two speed-grade advantage have been used. Recognizing that the above probably is based on sunshine scenario conditions and that the Wasabi case will not establish such conditions, a realistic yet ambitious goal would be an improvement of 10 % in the timing critical regions when comparing to the best result achievable by the current Wasabi ISE project. 2. The Wasabi ISE project must turn more consistent in meeting timing closure. Evidence for this is established by a Wasabi variant with a small RTL change unrelated to the timing critical area. The RTL change must make the current Wasabi ISE project fail in meeting timing closure while the new floorplanned Wasabi ISE project maintains timing performance. 3. A floorplan satisfying the above criteria must be fairly simple and highly reusable. As numerous updates of Wasabi RTL are expected, this floorplan cannot address physical parts as they may be removed or renamed in coming Wasabi RTL versions, hence only structural module names can be addressed. Wasabi is a multi clock design utilizing nearly 5000 Virtex-4 slices, two BRAM blocks and two DSP48 blocks. These measures translate to FPGA resource utilization in the vicinity of 20 % when considering the XC4VLX60 type. The Wasabi ISE project applied so far flattens the structural design hierarchy, optimizes for speed and enables about every feature that can improve speed. The timing critical paths in Wasabi are designed for a clock speed of 98.3 MHz. When using the most recent Wasabi RTL, the Wasabi ISE project can meet a clock speed of 104.2 MHz on the timing critical clock network. The peak operating frequency of 104.2 MHz as for any of the coming peak operating frequencies is a result from a tool run using the peak frequency as target frequency, it is not deducted from the slack time presented in a tool run targeting the nominal 98.3 MHz. Step 1 First the Wasabi source files are re-synthesized in order to maintain the structural hierarchy. This step is necessary in order to floorplan at a structural level. An obvious downside is an inevitable degradation in time performance, because logic paths traversing more than one module can t be optimized in their entirety. The new Wasabi ISE project now comes out with a performance degradation allowing for a clock speed of only 89.2 MHz. Step 2 An analysis of the timing report was conducted and the paths that violate timing constraints were identified. In PlanAhead the timing report from the above run and the accompanying xdl file (ASCII description of the placed and routed design) are loaded. Paths failing to meet timing closure can now be studied in multiple views such as a device view (FPGA fabric), a design hierarchy view and a schematics view. The views are quite useful for the floorplanning engineer who needs to quickly comprehend the design areas candidating for floorplanning. Figure 3 shows the design hierarchy view for the Wasabi design 2. The boxes highlighted in yellow contain the timing critical paths. 2 Unfortunately PlanAhead fails to read the timing critical paths from the xdl file, while other paths are read nicely. Hence Figure 3 is constructed for illustration purposes only; the modules selected will be highlighted differently when the tool is brought back to order. Copyright 2008 Danish Technological Institute Page 4 of 8
Figure 3 Hierarchy view showing the structures containing the timing critical paths. Step 3 A floorplan that makes the new Wasabi ISE project meet timing closure should be defined now. Figure 3 suggests either a floorplan covering only g2_pwm, a floorplan covering g2_datapath and g2_sequencer or an even finer grained floorplan only covering the modules highlighted in Figure 3. Ad hoc experiments have shown that floorplanning for g2_pwm only (see Figure 4) will produce the best timing result. In fact the Wasabi design will meet the application target frequency of 98.3 MHz when applying this floorplan, but only with a very small margin trying to target 99 MHz will fail. A short status on the progress so far says that timing constraints are still met but now with a smaller time margin than initially. It was requested that the floorplanned Wasabi ISE project should meet timing closure with a performance gain of 10 % comparing to the initial Wasabi ISE project, this hasn t been accomplished yet. The PlanAhead Methodology Guide advises that timing critical paths shouldn t span multiple modules, and for better timing performance all outputs should be registered. The critical and failing timing paths in Wasabi happen to span multiple modules because not all outputs are registered! Synthesizing Wasabi with the structural hierarchy maintained only worsens this, because the synthesis tool thereby isn t allowed to perform logical optimization on these paths in their entirety. Fortunately the XST synthesis tool in ISE 9.2i has introduced a new switch -netlist_hierarchy rebuilt that allows a structural hierarchy to be dissolved and then rebuilt after synthesis and logical optimization from the PlanAhead perspective this switch combines the best from the flattened netlist and the hierarchically netlist! Copyright 2008 Danish Technological Institute Page 5 of 8
Figure 4 Floorplan applied for the g2_pwm module in Step3. Step 4 The Wasabi RTL code is re-synthesized with -netlist_hierarchy rebuilt applied and processed using the floorplan in Figure 4. The application timing requirements of 98.3 MHz are easily met; actually the new netlist and the floorplan applied can meet a request of 112.4 MHz. Comparison of this figure with the initial Wasabi ISE project that met 104.2 MHz, reveals a performance gain of 7.9 %. Obviously this result is not acceptable when considering the initial requirements for success, but the very simple floorplan suggests that the potential for a significant better performance is definitely present, if more time is put into elaboration. Multiple floorplanning iterations have resulted in the floorplan seen in Figure 5. This floorplan includes only g2_datapath, g2_pulse_generator_ch0 and g2_pulse_generator_ch1, the latter two modules are found right below g2_pwm in Figure 3. An implementation with the new floorplan results in a clock speed of 117.6 MHz on the timing critical clock network. Comparison to the initial Wasabi ISE project now reveals a performance gain of 12.9 %, which is well within the initial requirements. Step 5 Evidence for criterion 2 is established by introducing a change in the RTL unrelated to the timing critical g2_pwm module. A Wasabi top level input pin is driven by a noisy switch and a pull up resistor at board level. In the old implementation a 16 entries delay line is applied for resampling and filtering the input, the delay line is now extended to 48 entries. When running the old Wasabi ISE project with the updated RTL and a target frequency of 98.3 MHz timing closure is missed by 1.6 MHz. Applying the updated RTL to the floorplanned Wasabi project results in 117.6 MHz on the critical clock network. Hence the new floorplanned Wasabi project is also considered conforming to criterion 2. Copyright 2008 Danish Technological Institute Page 6 of 8
Figure 5 The final floorplan as used in step 4 and step 5. This floorplan covers the modules g2_datapath, g2_pulse_generator_ch0 and g2_pulse_generator_ch1. Wasabi Case Summary PlanAhead and two months of learning and investigation turned a shaky Xilinx ISE 9.2i project into a project consistently meeting timing closure with a 19.7 % margin on the timing critical clock network. Three criteria defining success in migrating to the PlanAhead flow were established and later shown met. Table 1 summarizes the builds and the timing results achieved during the Wasabi case. Note that peak frequencies differ between PlanAhead runs and ISE runs, though the netlist and the tool parameters applied are the same. The difference is a side effect from PlanAhead translating the Xilinx XST netlist from NGC file format into EDIF file format before processing. This translation doesn t impose any optimizations, but probably a different starting point for the place and route tool, which causes it to produce a different FPGA layout and hence a different peak operating frequency. RTL Netlist Floorplan Peak Frequency PlanAhead 10.1i Peak Frequency ISE 9.2i Wasabi Flattened - 104.2 MHz 104.2 MHz Wasabi Hierarchical Figure 4. 98.3 MHz - Wasabi Hierarchy rebuilt - 103.1 MHz - Wasabi Hierarchy rebuilt Figure 4. 115 MHz 114.2 MHz Wasabi Hierarchy rebuilt Figure 5. 117.6 MHz 116.3 MHz Modified Wasabi Flattened - 99 MHz 97.1 MHz Modified Wasabi Hierarchical Figure 5. 117.6 MHz 117.6 MHz Table 1 Builds and timing results from the Wasabi case. Peak frequency scores refer to the timing critical clock network only. Copyright 2008 Danish Technological Institute Page 7 of 8
Conclusion Days and weeks are easily spent in making a success out of an FPGA project failing to meet timing closure. Often this stage of a project is the dark horse in the project time schedule. Facing timing closure problems late in a project leaves the project team with the choice of rewriting verified RTL or trying to tune the tool suit parameters. The first choice is obviously not attractive, unless hopelessly long logics delay time prove to be the problem. The second attempt is often the preferred one, at least for a start. However, much time can be spent here in vain, especially if RTL is the true cause of the problem. There is a growing gap between FPGA performance and tool suite performance, especially when considering more complex designs for sub 90nm FPGA types. The best suggestion at the moment for filling out at least parts of this gap is floorplanning. And apparently floorplanning can be that life jacket saving valuable time for the FPGA project team struggling with timing problems. The down side is that some expertise must go along with the floorplanning tool; starting from scratch, it can take weeks to get acquainted with the tool and finding the floorplan that suite the project requirements. PlanAhead was the floorplanning tool applied in this article. Actually PlanAhead is a floorplanning and a design analysis tool, allowing the user to get to know a new design quicker and analyze the consequences of taking different floorplanning approaches. What is more important is, floorplanning made a significant difference to the Wasabi project, or rather if one remember to synthesize with -netlist_hierarchy rebuilt set it can. The case showed that a performance gain of A final remark would be a suggestion for designers using Xilinx products to learn the PlanAhead Lite tool before the next FPGA project dead line. Methods for incorporating PlanAhead (Lite) early in the design process are described in Xilinx literature, thereby enabling the project team to expose and even avoid timing pit falls early in the project. ADD-Lab at the Danish Technological Institute offers the facilities and expertise to prepare FPGA project teams for the next project using PlanAhead Lite or provide the guidance to make a failing project meet timing closure. Copyright 2008 Danish Technological Institute Page 8 of 8