ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler DAC 2008 Philip Watson Philip Watson Implementation Environment Program Manager ARM Ltd
Background - Who Are We? Processor Division, Cores Implementation, ARM-India. This team is actively involved in processor development benchmarking The team has been working alongside the development of the microarchitecture of the ARM Cortex -A9 processor since early development and test The outcome of this effort is to showcase Power consumption Performance Area The effort is focused on making the Cortex-A9 processor core a deployable embedded solution 2
Partnership Through the Design Chain The RM ties all this together, piloting the route from RTL to Silicon The CPU is at the heart of the system-on-chip We work with major EDA companies to ensure our IP works seamlessly Processors Reference Methodology Fabric & EDA Tools Physical IP Mutual Customers We partner with silicon foundries to provide diversity it of SoC implementation and manufacturing choice EDA tools provide the environment to exploit this IP SoCs require high performance fabric and quality physical IP 3
Cortex-A9 MPCore Multicore Solutions The relative performance and power range of an ARM processor enabled by its ARM Physical IP MHz Mainstream Platform Performance Platform 15% CPU performance boost! Density Optimized Platform 15% lower power, higher density mw 4
Challenges with Cortex-A9 MPCore Implementation run time with all EDA tools is a key challenge for design closure, particularly with scalable performance processor designs Iteration time increases as the design size increases The iterations influence our ability to turnaround floor plan changes, tailor optimizations, allow the debug of constraints and design feedback this is a key to converging results 6.0 5.0 4.0 3.0 2.0 1.0 0.0 A9 MP 1x with Neon A9 MP 2x with Neon A9 MP 4x with Neon Gate Count Run time 5
Challenges with Cortex-A9 MPCore Implementation of 1 CPU vs 4 CPU Cortex-A9 with flat flow Configuration 1CPU, 1 Neon, 32K D$, 32K I$, 32 interrupts 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts Process Technology TSMC CLN65LP TSMC CLN65LP Standard Cell Library 12Track Nominal VT 12Track Nominal VT Memory Library Optimized fast cache instances Optimized fast cache instances The 4 CPU solution gives: A significant increase in run time Potentially some drop in performance (frequency) as compared to a 1 CPU implementation. 6
Hierarchical Implementation with IC Compiler For faster TTR Cortex-A9 cpu0 Placement (X Hrs) CTS (Y Hrs) Routing (Z Hrs) Cortex-A9 cpu1 Placement (X Hrs) CTS (Y Hrs) Cortex-A9 MPCore Routing (Z Hrs) Cortex-A9 top only Cortex-A9 cpu2 Placement (X Hrs) CTS (Y Hrs) Placement (A Hrs) CTS (B Hrs) Routing (Z Hrs) Routing (C Hrs) Cortex-A9 cpu3 Placement (X Hrs) CTS (Y Hrs) Routing (Z Hrs) Total Run Time = X + Y + Z + C Hrs 7
Hierarchical Implementation with IC Compiler Steps involved SDC & ScanDef Floorplanning Create Physical Partition Partition Aware Place Power Network Synthesis Power Network Analysis In-Place Optimization Clock Planning Pin Assignment Budgeting Commit Blocks 8
Cortex-A9 MPCore Multicore Solutions The relative performance and power range of an ARM processor enabled by its Artisan physical IP Cortex-A9 Hierarchical Flow (with IC Compiler) MHz Mainstream Platform Performance Platform 15% CPU performance boost! Density Optimized Platform 15% lower power, higher density mw 9
Hierarchical Implementation with IC Compiler Results Implementation of 1 CPU Cortex-A9 flat vs 4 CPU Cortex-A9 hierarchical flow Configuration 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts Process Technology TSMC CLN65LP TSMC CLN65LP Standard d Cell Library 12Track Nominal VT 12Track Nominal VT Memory Library Optimized fast cache instances Optimized fast cache instances Implementation flow Flat Hierarchical 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 The 4 CPU implemented with a hierarchical flow gives: 0.5 0.0 A9 MP 1x with Neon A9 MP 2x with Neon A9 MP 4x with Neon Comparable QoR in performance (frequency) 25% additional run time when compared to a 1CPU flat implementation Gate Count Run time hierarchical 10
Next Steps Handling efficiently Multiple Instantiated Module (MIM) for symmetric cores 11
Summary Hierarchical flow delivers much faster iteration time with no loss of QoR Simple and effective strategy to implement a multicore processor Reduction in high memory cluster requirements Lends itself very well for low power partitioning Advanced low power management such as State Retention Power Gating Leakage mitigation by power shutdown if the hardware is not being utilized Easily deployable for the partner base (estimated by end of 2008) In an ARM-Synopsys irm (implementation Reference Methodology) with: Floorplan Tcl Scripts (Complete flow from RTL to GDSII) Physical IP Libraries ARM Documentation - Core Signoff Guide providing an out-of-box solution from ARM 12