Run Blazingly Fast Algorithms with Cortex-M7 Tightly Coupled Memories

Similar documents
USER GUIDE EDBG. Description

AVR151: Setup and Use of the SPI. Introduction. Features. Atmel AVR 8-bit Microcontroller APPLICATION NOTE

AT88CK490 Evaluation Kit

APPLICATION NOTE. Secure Personalization with Transport Key Authentication. ATSHA204A, ATECC108A, and ATECC508A. Introduction.

Atmel AVR4921: ASF - USB Device Stack Differences between ASF V1 and V2. 8-bit Atmel Microcontrollers. Application Note. Features.

Designing Feature-Rich User Interfaces for Home and Industrial Controllers

AVR1309: Using the XMEGA SPI. 8-bit Microcontrollers. Application Note. Features. 1 Introduction SCK MOSI MISO SS

APPLICATION NOTE. AT07175: SAM-BA Bootloader for SAM D21. Atmel SAM D21. Introduction. Features

SMARTCARD XPRO. Preface. SMART ARM-based Microcontrollers USER GUIDE

Atmel AVR4920: ASF - USB Device Stack - Compliance and Performance Figures. Atmel Microcontrollers. Application Note. Features.

APPLICATION NOTE. Atmel AVR134: Real Time Clock (RTC) Using the Asynchronous Timer. Atmel AVR 8-bit Microcontroller. Introduction.

Application Note. Atmel CryptoAuthentication Product Uses. Atmel ATSHA204. Abstract. Overview

CryptoAuth Xplained Pro

More Secure, Less Costly IoT Edge Node Security Provisioning

How To Use An Atmel Atmel Avr32848 Demo For Android (32Bit) With A Microcontroller (32B) And An Android Accessory (32D) On A Microcontroller (32Gb) On An Android Phone Or

Atmel Power Line Communications. Solutions for the Smart Grid

AVR32701: AVR32AP7 USB Performance. 32-bit Microcontrollers. Application Note. Features. 1 Introduction

APPLICATION NOTE. AT16268: JD Smart Cloud Based Smart Plug Getting. Started Guide ATSAMW25. Introduction. Features

AVR106: C Functions for Reading and Writing to Flash Memory. Introduction. Features. AVR 8-bit Microcontrollers APPLICATION NOTE

AVR353: Voltage Reference Calibration and Voltage ADC Usage. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

AVR115: Data Logging with Atmel File System on ATmega32U4. Microcontrollers. Application Note. 1 Introduction. Atmel

Atmel SMART ARM Core-based Embedded Microprocessors

AVR1301: Using the XMEGA DAC. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

APPLICATION NOTE. Authentication Counting. Atmel CryptoAuthentication. Features. Introduction

AVR1900: Getting started with ATxmega128A1 on STK bit Microcontrollers. Application Note. 1 Introduction

APPLICATION NOTE. AT12405: Low Power Sensor Design with PTC. Atmel MCU Integrated Touch. Introduction

AVR32138: How to optimize the ADC usage on AT32UC3A0/1, AT32UC3A3 and AT32UC3B0/1 series. 32-bit Microcontrollers. Application Note.

Using CryptoMemory in Full I 2 C Compliant Mode. Using CryptoMemory in Full I 2 C Compliant Mode AT88SC0104CA AT88SC0204CA AT88SC0404CA AT88SC0808CA

Atmel AVR4903: ASF - USB Device HID Mouse Application. Atmel Microcontrollers. Application Note. Features. 1 Introduction

Application Note. Atmel ATSHA204 Authentication Modes. Prerequisites. Overview. Introduction

AVR1318: Using the XMEGA built-in AES accelerator. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

AT11805: Capacitive Touch Long Slider Design with PTC. Introduction. Features. Touch Solutions APPLICATION NOTE

AVR305: Half Duplex Compact Software UART. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

8-bit. Application Note. Microcontrollers. AVR282: USB Firmware Upgrade for AT90USB

General Porting Considerations. Memory EEPROM XRAM

AVR1510: Xplain training - XMEGA USART. 8-bit Microcontrollers. Application Note. Prerequisites. 1 Introduction

AVR1922: Xplain Board Controller Firmware. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

Introducing a platform to facilitate reliable and highly productive embedded developments

APPLICATION NOTE. Atmel AT04389: Connecting SAMD20E to the AT86RF233 Transceiver. Atmel SAMD20. Description. Features

32-bit AVR UC3 Microcontrollers. 32-bit AtmelAVR Application Note. AVR32769: How to Compile the standalone AVR32 Software Framework in AVR32 Studio V2

Application Note. 8-bit Microcontrollers. AVR270: USB Mouse Demonstration

AVR319: Using the USI module for SPI communication. 8-bit Microcontrollers. Application Note. Features. Introduction

APPLICATION NOTE. Atmel LF-RFID Kits Overview. Atmel LF-RFID Kit. LF-RFID Kit Introduction

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

AVR1600: Using the XMEGA Quadrature Decoder. 8-bit Microcontrollers. Application Note. Features. 1 Introduction. Sensors

USER GUIDE. ZigBit USB Stick User Guide. Introduction

AVR033: Getting Started with the CodeVisionAVR C Compiler. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

AT91SAM ARM-based Flash MCU. Application Note

AVR32788: AVR 32 How to use the SSC in I2S mode. 32-bit Microcontrollers. Application Note. Features. 1 Introduction

AVR317: Using the Master SPI Mode of the USART module. 8-bit Microcontrollers. Application Note. Features. Introduction

Dell Statistica. Statistica Document Management System (SDMS) Requirements

APPLICATION NOTE. Atmel AVR443: Sensor-based Control of Three Phase Brushless DC Motor. Atmel AVR 8-bit Microcontrollers. Features.

AVR287: USB Host HID and Mass Storage Demonstration. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

The new 32-bit MSP432 MCU platform from Texas

Application Note. C51 Bootloaders. C51 General Information about Bootloader and In System Programming. Overview. Abreviations

Capacitive Touch Technology Opens the Door to a New Generation of Automotive User Interfaces

Migrating Application Code from ARM Cortex-M4 to Cortex-M7 Processors

Architectures, Processors, and Devices

AVR127: Understanding ADC Parameters. Introduction. Features. Atmel 8-bit and 32-bit Microcontrollers APPLICATION NOTE

APPLICATION NOTE Atmel AT02509: In House Unit with Bluetooth Low Energy Module Hardware User Guide 8-bit Atmel Microcontroller Features Description

AN11008 Flash based non-volatile storage

Quest vworkspace Virtual Desktop Extensions for Linux

Atmel AVR1017: XMEGA - USB Hardware Design Recommendations. 8-bit Atmel Microcontrollers. Application Note. Features.

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Switch Fabric Implementation Using Shared Memory

Which ARM Cortex Core Is Right for Your Application: A, R or M?

Dell One Identity Cloud Access Manager How to Configure vworkspace Integration

Bandwidth Calculations for SA-1100 Processor LCD Displays

How To Design An Ism Band Antenna For 915Mhz/2.4Ghz Ism Bands On A Pbbb (Bcm) Board

SAMA5D2. Scope. Reference Documents. Atmel SMART ARM-based MPU ERRATA

Intel X38 Express Chipset Memory Technology and Configuration Guide

AVR125: ADC of tinyavr in Single Ended Mode. 8-bit Microcontrollers. Application Note. Features. 1 Introduction

QT1 Xplained Pro. Preface. Atmel QTouch USER GUIDE

AVR2006: Design and characterization of the Radio Controller Board's 2.4GHz PCB Antenna. Application Note. Features.

AN3998 Application note

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Application Note. 8-bit Microcontrollers. AVR272: USB CDC Demonstration UART to USB Bridge

AVR131: Using the AVR s High-speed PWM. Introduction. Features. AVR 8-bit Microcontrollers APPLICATION NOTE

Family 10h AMD Phenom II Processor Product Data Sheet

Dell One Identity Manager Scalability and Performance

HG2 Series Product Brief

How to Deploy Models using Statistica SVB Nodes

Quest SQL Optimizer 6.5. for SQL Server. Installation Guide

Application Note. 1. Introduction. 2. Associated Documentation. 3. Gigabit Ethernet Implementation on SAMA5D3 Series. AT91SAM ARM-based Embedded MPU

Dell Spotlight on Active Directory Server Health Wizard Configuration Guide

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

21152 PCI-to-PCI Bridge

formerly Help Desk Authority Quest Free Network Tools User Manual

8051 Flash Microcontroller. Application Note. A Digital Thermometer Using the Atmel AT89LP2052 Microcontroller

Family 12h AMD Athlon II Processor Product Data Sheet

Using the RS232 serial evaluation boards on a USB port

AT91 ARM Thumb Microcontrollers. AT91SAM CAN Bootloader. AT91SAM CAN Bootloader User Notes. 1. Description. 2. Key Features

Developing Embedded Applications with ARM Cortex TM -M1 Processors in Actel IGLOO and Fusion FPGAs. White Paper

AN LPC24XX external memory bus example. Document information

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

APPLICATION NOTE. RF System Architecture Considerations ATAN0014. Description

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

AVR1321: Using the Atmel AVR XMEGA 32-bit Real Time Counter and Battery Backup System. 8-bit Microcontrollers. Application Note.

AVR030: Getting Started with IAR Embedded Workbench for Atmel AVR. 8-bit Microcontrollers. Application Note. Features.

Transcription:

Run Blazingly Fast Algorithms with Cortex-M7 Tightly Coupled Memories Authors: Jacko Wilbrink, Senior Director, Product Marketing, Atmel Lionel Perdigon, Product Marketing Manager, Atmel The ARM Cortex -M series is undisputedly the dominant processor family for embedded systems that require low power and cost effectiveness with moderate performance. The earliest implementations of the Cortex-M processor were on the small end of the spectrum: the Cortex-M0 for lowest cost, the Cortex-M0+ for best energy efficiency, the Cortex-M3 for best power/performance balance, and the Cortex-M4 for applications requiring digital signal processing (DSP) capabilities. Now the first implementations of the highest-performance member the Cortex-M7 are emerging. This processor includes capabilities required for more complex algorithms running at higher speeds while still maintaining modest energy consumption and cost profiles. In a 65nm embedded Flash process device, the Cortex-M7 can achieve a 1500 CoreMark score while running at 300 MHz, and its DSP performance is double that available in the Cortex-M4. A double-precision floating-point unit and a double-issue instruction pipeline further position the Cortex-M7 for speed.

Thanks to these features, the Cortex-M7 becomes a natural choice when serious algorithms for audio and video are needed for rich audio and visual capabilities in low-power, low-cost applications. The Cortex-M7 has found a home in the image processing of drones, and the audio, voice control, object recognition, and complex sensor fusion of automotive and higher-end Internet of Things (IoT) sensing. Tightly coupled Memories for Fast, Deterministic Code Execution It is often the case that such algorithms as FIR, FFT or Biquad need to run as quickly and deterministically as possible for real-time response or seamless audio and video performance. To meet system performance and latency goals, such routines should operate at the full clock rate, unencumbered by cache misses, interrupts, context swaps, and other execution surprises that work against deterministic timing. This means that you cannot simply run such code out of standard memory, and you cannot rely on standard memory for storing the data being processed. Typical memories like Flash are too slow to keep up with the Cortex-M7 core, and they require caching, causing long delays when a cache miss occurs. So the Cortex-M7 architecture provides a way to bypass the standard execution mechanism using tightly coupled memories, or TCM. The ARM core provides the TCM interface, and it is intended for single-cycle random TCM access running at the full processor speed. A 64-bit instruction memory port (ITCM) supports the dual-issue processor architecture, and two data ports (DTCM) enable two simultaneous, parallel 32-bit data accesses. The architecture does not, however, specify what type of memory or how much memory should be provided. Those decisions are left to designers implementing the Cortex-M7 core in a microcontroller (MCU) as opportunities for innovation and differentiation. Figure 1. The TCM interface provides a single 64-bit instruction port and two 32-bit data ports. 2

So, for instance, it is possible to attach embedded Flash memory to the TCM interface. But Flash cannot run at the processor clock rate, so caching would be required threatening the determinism that the TCM is intended to provide. Code shadowing in DRAM is theoretically possible, but it would be cost-prohibitive. The only type of memory that is fast enough for direct, uncached access is. s provide a number of benefits: an block can be connected directly to the TCM interface, technology is easily embedded on-chip, and s permit random accesses at the speed of the processor. The only drawback of is the cost per bit, which is higher than that of Flash and DRAM. So it is critical to keep the size of the TCM limited, even if 65- or 40-nanometer technology allows a fair amount of integration at acceptable cost. The TCM might be loaded from a number of sources, and these are also not specified by the architecture. So MCU implementations may vary in how these memories can be loaded. A direct-memory access (DMA) engine would certainly be involved, but whether it is a single DMA or one of several loading data from various streams like video or USB is left to the MCU designer. The code that will be executed out of the TCM is identified by the system programmer. When preparing a software build, the programmer will need to identify which code segments and data blocks should be allocated to the TCM. This is typically done by embedding pragmas into the software and by applying linker settings so that the build places code and memory appropriately. Multiple Ports for Faster Access While TCM provides a straightforward mechanism for quick, tight execution of key routines, it is often useful to have some amount of system as a general-purpose, high-speed memory for the processor and peripherals to use by means of a DMA. While such memory would logically be separate from the TCM, it is possible that a single block could share TCM and general-purpose duties while offering the possibility to tune the split between TCM and system RAM for each use case. Peripheral data buffers implemented in general-purpose system are typically loaded by DMA transfers from system peripherals. The ability to load from a number of possible sources, however, raises the possibility of unnecessary delays and conflicts by multiple DMAs trying to access the memory at the same time. In a typical example, we might have three different entities vying for DMA access to the : the processor (64-bit access, requesting 128 bits for this example) and two separate peripheral DMA requests (DMA0 and, 32-bit access each). Let us assume that the processor has priority over the DMAs and that DMA0 has priority over. This example is illustrated in Figure 2. The left shows the performance of a 64-bit wide single-banked memory; the right shows the same transactions in the same memory organized as four 32-bit banks. 3

One 64-bit bank Four 32-bit banks Cycle 1 Processor Processor Cycle 2 Processor Processor DMA0 Cycle 3 DMA0 DMA0 Cycle 4 DMA0 DMA0 Cycle 5 Processor DMA0 Processor Cycle 6 DMA0 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 DMA0 Figure 2. By organizing the into banks, multiple DMA bursts can occur simultaneously with minimal latency. With the single-bank memory, the processor would complete its access in two cycles, and then, in the next cycle, the DMA0 burst would start. would be blocked until DMA0 was finished. Any higher-priority processor requests would interrupt the DMA loads, adding further delays. As shown above, an instruction fetch in Cycle 5 adds another cycle of delay to both DMA operations. More efficient operation can be achieved by organizing the memory into banks with interleaved addresses, each of which can be independently accessed. In this manner, the DMA0 burst would begin when the processor was done with the first bank simultaneous to the processor accessing the higher banks. would no longer have to wait until the entire DMA0 transfer was complete; it could start one cycle after DMA0, when DMA0 was finished with the first bank. Having higher priority, the processor sees no latency difference between the two arrangements. But DMA0 s latency improves from two cycles to one, and s latency goes from seven cycles down to two. An instruction fetch during the DMA operations may even occur in parallel with the DMA operations, resulting in no additional latency. Lower latency improves performance, but it may also mean that you can make peripheral FIFOs smaller. Given these possible useful memory arrangements, it is left to the MCU designer to figure out how Flash memory, system, and TCM work together. Are they all separate? Do they occupy similar space? On- or off-chip? These decisions will differentiate competing Cortex-M7 implementations. 4

A Specific Cortex- M7 TCM Example One example of a Cortex-M7 MCU is implemented in the Atmel SMART SAM S70, SAM E70, and SAM V70/1 families. These MCUs have an organized into four banks that can be used as general (which Atmel calls System ) and for TCM. In the portion configured as TCM, instructions can run with deterministic, single-cycle access at the full core speed of 300 MHz, with no delays for fetches or caching. Similarly, data located in TCM can be accessed without caching penalties. Figure 3. The Atmel SMART SAM S70/E70 family uses that can serve as TCM and/or as System for high flexibility and utilization. The Instruction TCM (ITCM) and data TCM (DTCM), which must be sized alike, can be 32, 64, or 128 KB each. This sizing is a build-time setting that cannot be changed on the fly. The full block can be up to 384 KB in size. That means that both System and TCM can be used at the same time. The amount of memory configured as TCM reduces the amount of memory available for use as System RAM; for example, if configured for 128 KB of ITCM and 128 KB of DTCM, then the amount of System RAM is reduced to 128 KB. The System runs up to 150 MHz, or half the processor speed, and it is organized into four banks, reducing DMA latency. DMA access priorities, which can be fixed or round robin, are the same for system as they are for the internal bus matrix. 5

A register bit activates or deactivates the TCM, so the target of a given access will depend on TCM activation. A device is always booted with ITCM deactivated, allowing code to be loaded from the boot memory. After booting, the TCM can be activated as needed. The ITCM address space overlaps the boot memory. If TCM is deactivated, then instruction accesses go to the boot memory. If TCM is activated, then accesses within the TCM address space come from the TCM; accesses above the TCM space come out of Flash. It is possible to protect the various spaces with the MPU to ensure that there are no inadvertent accesses to the wrong memory. Cortex-M7 Is the Next ARM Wave The Cortex-M7 is coming into its own, taking on much more aggressive algorithms while still remaining gentle with power and cost. Tightly coupled memories, one of the new Cortex-M7 features, make possible the fast execution of tight code with deterministic execution at the full processor clock rate. Exactly how this performance will be realized on a given MCU will depend on system choices made by the designers integrating the Cortex-M7 into the MCU. The highest speeds will be afforded by using to implement the TCM. This can be shared with a more general system capability, and, by organizing the in banks, DMA latency can be minimized. These architectural features, while relatively simple, provide a dramatic boost to the performance available for implementing critical embedded functions in personal, industrial, and automotive devices. Atmel Corporation 1600 Technology Drive, San Jose, CA 95110 USA T: (+1)(408) 441.0311 F: (+1)(408) 436.4200 www.atmel.com 2015 Atmel Corporation. / Rev.: Atmel-45151A-Fast-Algorithms-with-Cortex-M7_Article_US_072015 Atmel, Atmel logo and combinations thereof, Enabling Unlimited Possibilities, and others are registered trademarks or trademarks of Atmel Corporation or its subsidiaries. ARM, ARM Connected logo and others are the registered trademarks or trademarks of ARM Ltd. Other terms and product names may be the trademarks of others. Disclaimer: The information in this document is provided in connection with Atmel products. No license, express or implied, by estoppel or otherwise, to any intellectual property right is granted by this document or in connection with the sale of Atmel products. EXCEPT AS SET FORTH IN THE ATMEL TERMS AND CONDITIONS OF SALES LOCATED ON THE ATMEL WEBSITE, ATMEL ASSUMES NO LIABILITY WHATSOEVER AND DISCLAIMS ANY EXPRESS, IMPLIED OR STATUTORY WARRANTY RELATING TO ITS PRODUCTS INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. IN NO EVENT SHALL ATMEL BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE, SPECIAL OR INCIDENTAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS AND PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF THE USE OR INABILITY TO USE THIS DOCUMENT, EVEN IF ATMEL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Atmel makes no representations or warranties with respect to the accuracy or completeness of the contents of this document and reserves the right to make changes to specifications and products descriptions at any time without notice. Atmel does not make any commitment to update the information contained herein. Unless specifically provided otherwise, Atmel products are not suitable for, and shall not be used in, automotive applications. Atmel products are not intended, authorized, or warranted for use as components in applications intended to support or sustain life. 6