Run Blazingly Fast Algorithms with Cortex-M7 Tightly Coupled Memories Authors: Jacko Wilbrink, Senior Director, Product Marketing, Atmel Lionel Perdigon, Product Marketing Manager, Atmel The ARM Cortex -M series is undisputedly the dominant processor family for embedded systems that require low power and cost effectiveness with moderate performance. The earliest implementations of the Cortex-M processor were on the small end of the spectrum: the Cortex-M0 for lowest cost, the Cortex-M0+ for best energy efficiency, the Cortex-M3 for best power/performance balance, and the Cortex-M4 for applications requiring digital signal processing (DSP) capabilities. Now the first implementations of the highest-performance member the Cortex-M7 are emerging. This processor includes capabilities required for more complex algorithms running at higher speeds while still maintaining modest energy consumption and cost profiles. In a 65nm embedded Flash process device, the Cortex-M7 can achieve a 1500 CoreMark score while running at 300 MHz, and its DSP performance is double that available in the Cortex-M4. A double-precision floating-point unit and a double-issue instruction pipeline further position the Cortex-M7 for speed.
Thanks to these features, the Cortex-M7 becomes a natural choice when serious algorithms for audio and video are needed for rich audio and visual capabilities in low-power, low-cost applications. The Cortex-M7 has found a home in the image processing of drones, and the audio, voice control, object recognition, and complex sensor fusion of automotive and higher-end Internet of Things (IoT) sensing. Tightly coupled Memories for Fast, Deterministic Code Execution It is often the case that such algorithms as FIR, FFT or Biquad need to run as quickly and deterministically as possible for real-time response or seamless audio and video performance. To meet system performance and latency goals, such routines should operate at the full clock rate, unencumbered by cache misses, interrupts, context swaps, and other execution surprises that work against deterministic timing. This means that you cannot simply run such code out of standard memory, and you cannot rely on standard memory for storing the data being processed. Typical memories like Flash are too slow to keep up with the Cortex-M7 core, and they require caching, causing long delays when a cache miss occurs. So the Cortex-M7 architecture provides a way to bypass the standard execution mechanism using tightly coupled memories, or TCM. The ARM core provides the TCM interface, and it is intended for single-cycle random TCM access running at the full processor speed. A 64-bit instruction memory port (ITCM) supports the dual-issue processor architecture, and two data ports (DTCM) enable two simultaneous, parallel 32-bit data accesses. The architecture does not, however, specify what type of memory or how much memory should be provided. Those decisions are left to designers implementing the Cortex-M7 core in a microcontroller (MCU) as opportunities for innovation and differentiation. Figure 1. The TCM interface provides a single 64-bit instruction port and two 32-bit data ports. 2
So, for instance, it is possible to attach embedded Flash memory to the TCM interface. But Flash cannot run at the processor clock rate, so caching would be required threatening the determinism that the TCM is intended to provide. Code shadowing in DRAM is theoretically possible, but it would be cost-prohibitive. The only type of memory that is fast enough for direct, uncached access is. s provide a number of benefits: an block can be connected directly to the TCM interface, technology is easily embedded on-chip, and s permit random accesses at the speed of the processor. The only drawback of is the cost per bit, which is higher than that of Flash and DRAM. So it is critical to keep the size of the TCM limited, even if 65- or 40-nanometer technology allows a fair amount of integration at acceptable cost. The TCM might be loaded from a number of sources, and these are also not specified by the architecture. So MCU implementations may vary in how these memories can be loaded. A direct-memory access (DMA) engine would certainly be involved, but whether it is a single DMA or one of several loading data from various streams like video or USB is left to the MCU designer. The code that will be executed out of the TCM is identified by the system programmer. When preparing a software build, the programmer will need to identify which code segments and data blocks should be allocated to the TCM. This is typically done by embedding pragmas into the software and by applying linker settings so that the build places code and memory appropriately. Multiple Ports for Faster Access While TCM provides a straightforward mechanism for quick, tight execution of key routines, it is often useful to have some amount of system as a general-purpose, high-speed memory for the processor and peripherals to use by means of a DMA. While such memory would logically be separate from the TCM, it is possible that a single block could share TCM and general-purpose duties while offering the possibility to tune the split between TCM and system RAM for each use case. Peripheral data buffers implemented in general-purpose system are typically loaded by DMA transfers from system peripherals. The ability to load from a number of possible sources, however, raises the possibility of unnecessary delays and conflicts by multiple DMAs trying to access the memory at the same time. In a typical example, we might have three different entities vying for DMA access to the : the processor (64-bit access, requesting 128 bits for this example) and two separate peripheral DMA requests (DMA0 and, 32-bit access each). Let us assume that the processor has priority over the DMAs and that DMA0 has priority over. This example is illustrated in Figure 2. The left shows the performance of a 64-bit wide single-banked memory; the right shows the same transactions in the same memory organized as four 32-bit banks. 3
One 64-bit bank Four 32-bit banks Cycle 1 Processor Processor Cycle 2 Processor Processor DMA0 Cycle 3 DMA0 DMA0 Cycle 4 DMA0 DMA0 Cycle 5 Processor DMA0 Processor Cycle 6 DMA0 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 DMA0 Figure 2. By organizing the into banks, multiple DMA bursts can occur simultaneously with minimal latency. With the single-bank memory, the processor would complete its access in two cycles, and then, in the next cycle, the DMA0 burst would start. would be blocked until DMA0 was finished. Any higher-priority processor requests would interrupt the DMA loads, adding further delays. As shown above, an instruction fetch in Cycle 5 adds another cycle of delay to both DMA operations. More efficient operation can be achieved by organizing the memory into banks with interleaved addresses, each of which can be independently accessed. In this manner, the DMA0 burst would begin when the processor was done with the first bank simultaneous to the processor accessing the higher banks. would no longer have to wait until the entire DMA0 transfer was complete; it could start one cycle after DMA0, when DMA0 was finished with the first bank. Having higher priority, the processor sees no latency difference between the two arrangements. But DMA0 s latency improves from two cycles to one, and s latency goes from seven cycles down to two. An instruction fetch during the DMA operations may even occur in parallel with the DMA operations, resulting in no additional latency. Lower latency improves performance, but it may also mean that you can make peripheral FIFOs smaller. Given these possible useful memory arrangements, it is left to the MCU designer to figure out how Flash memory, system, and TCM work together. Are they all separate? Do they occupy similar space? On- or off-chip? These decisions will differentiate competing Cortex-M7 implementations. 4
A Specific Cortex- M7 TCM Example One example of a Cortex-M7 MCU is implemented in the Atmel SMART SAM S70, SAM E70, and SAM V70/1 families. These MCUs have an organized into four banks that can be used as general (which Atmel calls System ) and for TCM. In the portion configured as TCM, instructions can run with deterministic, single-cycle access at the full core speed of 300 MHz, with no delays for fetches or caching. Similarly, data located in TCM can be accessed without caching penalties. Figure 3. The Atmel SMART SAM S70/E70 family uses that can serve as TCM and/or as System for high flexibility and utilization. The Instruction TCM (ITCM) and data TCM (DTCM), which must be sized alike, can be 32, 64, or 128 KB each. This sizing is a build-time setting that cannot be changed on the fly. The full block can be up to 384 KB in size. That means that both System and TCM can be used at the same time. The amount of memory configured as TCM reduces the amount of memory available for use as System RAM; for example, if configured for 128 KB of ITCM and 128 KB of DTCM, then the amount of System RAM is reduced to 128 KB. The System runs up to 150 MHz, or half the processor speed, and it is organized into four banks, reducing DMA latency. DMA access priorities, which can be fixed or round robin, are the same for system as they are for the internal bus matrix. 5
A register bit activates or deactivates the TCM, so the target of a given access will depend on TCM activation. A device is always booted with ITCM deactivated, allowing code to be loaded from the boot memory. After booting, the TCM can be activated as needed. The ITCM address space overlaps the boot memory. If TCM is deactivated, then instruction accesses go to the boot memory. If TCM is activated, then accesses within the TCM address space come from the TCM; accesses above the TCM space come out of Flash. It is possible to protect the various spaces with the MPU to ensure that there are no inadvertent accesses to the wrong memory. Cortex-M7 Is the Next ARM Wave The Cortex-M7 is coming into its own, taking on much more aggressive algorithms while still remaining gentle with power and cost. Tightly coupled memories, one of the new Cortex-M7 features, make possible the fast execution of tight code with deterministic execution at the full processor clock rate. Exactly how this performance will be realized on a given MCU will depend on system choices made by the designers integrating the Cortex-M7 into the MCU. The highest speeds will be afforded by using to implement the TCM. This can be shared with a more general system capability, and, by organizing the in banks, DMA latency can be minimized. These architectural features, while relatively simple, provide a dramatic boost to the performance available for implementing critical embedded functions in personal, industrial, and automotive devices. Atmel Corporation 1600 Technology Drive, San Jose, CA 95110 USA T: (+1)(408) 441.0311 F: (+1)(408) 436.4200 www.atmel.com 2015 Atmel Corporation. / Rev.: Atmel-45151A-Fast-Algorithms-with-Cortex-M7_Article_US_072015 Atmel, Atmel logo and combinations thereof, Enabling Unlimited Possibilities, and others are registered trademarks or trademarks of Atmel Corporation or its subsidiaries. ARM, ARM Connected logo and others are the registered trademarks or trademarks of ARM Ltd. Other terms and product names may be the trademarks of others. Disclaimer: The information in this document is provided in connection with Atmel products. No license, express or implied, by estoppel or otherwise, to any intellectual property right is granted by this document or in connection with the sale of Atmel products. EXCEPT AS SET FORTH IN THE ATMEL TERMS AND CONDITIONS OF SALES LOCATED ON THE ATMEL WEBSITE, ATMEL ASSUMES NO LIABILITY WHATSOEVER AND DISCLAIMS ANY EXPRESS, IMPLIED OR STATUTORY WARRANTY RELATING TO ITS PRODUCTS INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. IN NO EVENT SHALL ATMEL BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE, SPECIAL OR INCIDENTAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS AND PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF THE USE OR INABILITY TO USE THIS DOCUMENT, EVEN IF ATMEL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Atmel makes no representations or warranties with respect to the accuracy or completeness of the contents of this document and reserves the right to make changes to specifications and products descriptions at any time without notice. Atmel does not make any commitment to update the information contained herein. Unless specifically provided otherwise, Atmel products are not suitable for, and shall not be used in, automotive applications. Atmel products are not intended, authorized, or warranted for use as components in applications intended to support or sustain life. 6