Five Families of ARM Processor IP



Similar documents
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ARM Microprocessor and ARM-Based Microcontrollers

The ARM Processor Architecture

ARM Webinar series. ARM Based SoC. Abey Thomas

Architecture and Implementation of the ARM Cortex -A8 Microprocessor

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Which ARM Cortex Core Is Right for Your Application: A, R or M?

Migrating Application Code from ARM Cortex-M4 to Cortex-M7 Processors

A case study of mobile SoC architecture design based on transaction-level modeling


The ARM Architecture. With a focus on v7a and Cortex-A8

Lessons from the ARM Architecture. Richard Grisenthwaite Lead Architect and Fellow ARM

Performance evaluation

The ARM Cortex-A9 Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

System Design Issues in Embedded Processing

EE361: Digital Computer Organization Course Syllabus

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Cortex-A9 MPCore Software Development

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

A Computer Vision System on a Chip: a case study from the automotive domain

Architectures, Processors, and Devices

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Mobile Processors: Future Trends

Embedded Development Tools

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Cortex -A7 MPCore. Technical Reference Manual. Revision: r0p5. Copyright ARM. All rights reserved. ARM DDI 0464F (ID080315)

ARM Processors and the Internet of Things. Joseph Yiu Senior Embedded Technology Specialist, ARM

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

Figure 1: Graphical example of a mergesort 1.

Software based Finite State Machine (FSM) with general purpose processors

Deeply Embedded Real-Time Hypervisors for the Automotive Domain Dr. Gary Morgan, ETAS/ESC

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Design Cycle for Microprocessors

VtRES Towards Hardware Embedded Virtualization Technology: Architectural Enhancements to an ARM SoC. ESRG Embedded Systems Research Group

Instruction Set Design

CFD Implementation with In-Socket FPGA Accelerators

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Operating System Impact on SMT Architecture

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Developing Embedded Applications with ARM Cortex TM -M1 Processors in Actel IGLOO and Fusion FPGAs. White Paper

Types of Workloads. Raj Jain. Washington University in St. Louis

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Giving credit where credit is due

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

POWER8 Performance Analysis

Overview of the Cortex-M3

CS352H: Computer Systems Architecture

What is a System on a Chip?

Am186ER/Am188ER AMD Continues 16-bit Innovation

High Performance or Cycle Accuracy?

USB 3.0 Connectivity using the Cypress EZ-USB FX3 Controller

Computer Organization and Components

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Week 1 out-of-class notes, discussions and sample problems

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Architectures and Platforms

GETTING STARTED WITH ANDROID DEVELOPMENT FOR EMBEDDED SYSTEMS

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

İSTANBUL AYDIN UNIVERSITY

Low Power AMD Athlon 64 and AMD Opteron Processors

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

STM32 F-2 series High-performance Cortex-M3 MCUs

The new 32-bit MSP432 MCU platform from Texas

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

The Future of the ARM Processor in Military Operations

Trace Port Analysis for ARM7-ETM and ARM9-ETM Microprocessors

CSE597a - Cell Phone OS Security. Cellphone Hardware. William Enck Prof. Patrick McDaniel

7a. System-on-chip design and prototyping platforms

SOC architecture and design

on an system with an infinite number of processors. Calculate the speedup of

Cortex -A15. Technical Reference Manual. Revision: r2p0. Copyright 2011 ARM. All rights reserved. ARM DDI 0438C (ID102211)

GSM/GPRS PHYSICAL LAYER ON SANDBLASTER DSP

A Lab Course on Computer Architecture

Exception and Interrupt Handling in ARM

Linux Performance Optimizations for Big Data Environments

XtratuM hypervisor redesign for LEON4 multicore processor

Performance Comparison of RTOS

Soft processors for microcontroller programming education

MPSoC Virtual Platforms

LogiCORE IP AXI Performance Monitor v2.00.a

Intel StrongARM SA-110 Microprocessor

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

big.little Technology: The Future of Mobile Making very high performance available in a mobile envelope without sacrificing energy efficiency

StrongARM** SA-110 Microprocessor Instruction Timing

Course on Advanced Computer Architectures

Binary search tree with SIMD bandwidth optimization using SSE

EMBEDDED X86 PROCESSOR PERFORMANCE RATING SYSTEM

Transcription:

ARM1026EJ-S Synthesizable ARM10E Family Processor Core Eric Schorn CPU Product Manager ARM Austin Design Center Five Families of ARM Processor IP Performance ARM preserves SW & HW investment through code and process portability ARM9 - Dual Caches - Hard-IP Design - Performance - 5 Stage Pipe - 3 product offerings ARM7 - Low Power - Area Efficient - Code Density - 3 Stage Pipe - 4 product offerings ARM9E - Config $/TCM - Soft-IP Design - DSP Instr. - Java in HW - 3 product offerings Time ARM10E - Multiple $/TCM - Float in HW - Dual 64bit I/O - 6 Stage Pipe - 2 product offerings -New:ARM1026EJ-S SecurCore - Secure Apps - Performance - Power, Area - Code Density - 4 product offerings Future This presentation - See New ARM Microarchitecture Implementing the ARMv6 Architecture EPF presentation 2

ARM1026EJ-S Objectives Significant new functionality and flexibility enhancements to the ARM10E family of processor cores Ideal extension of ARM9E based technology Microarchitecture performance optimizations based on detailed benchmark analysis, Dhrystone, OS Boot, key apps Maximize area efficiency through process-specific compiled SRAM memory arrays Configurable cache and TCM sizes on a case-by-case basis Soft-IP design delivery, with hard-ip available Rapid process migration and design flow compatibility 3 ARM1026EJ-S Overview Jazelle technology-enhanced ARM10E family processor for platform and embedded applications Convergence of proven ARM1020E and ARM926EJ-S technology ARMv5TEJ architecture: ARM, Thumb, DSP, and Java instructions MMU and MPU support Configurable caches and TCMs Extensive 64-bit internal bussing Dual 64-bit AHB I/O interfaces IEEE 754 FP support with VFP10 Real-time trace with ETM10RV * According to strict interpretation of Dhrystone v2.1 rules 266-325 MHz worst-case on 0.13μ 325-400+ Dhrystone 2.1 MIPS * 4

New Features in ARM1026EJ-S Jazelle technology Industry leading Java bytecode execution in HW Compatible with ARM7EJ-S, ARM926EJ-S, and future cores MMU and MPU support via shared logic Enables both platform OS and RTOS solutions on the same part Windows CE, Symbian OS, VxWorks, Linux, Direct-attach vector interrupt controller (VIC) port Supports more interrupt sources and vectors Significant reduction in interrupt handling latency Solution is compatible with future cores 5 ARM1026EJ-S in Context Industry-leading feature set ARM1026EJ-S ARM920T ARM926EJ-S Intel XScale ARM Architecture Core pipeline depth v5tej 6 v4t 5 v5tej 5 v5te 7, 8 Branch prediction Static Dynamic Memory management MMU & MPU MMU MMU MMU Real-time trace Java bytecodes in HW Yes (ETM10RV) Yes Yes (ETM9) Yes (ETM9) Yes - - IEEE floating-point in HW CPU Cache/TCM Vector interrupt support Yes (VFP10) Config Caches/TCMs Yes 16k/16k Caches Yes (VFP9) Config Caches/TCMs - 32k/32k Caches - Design delivery Portable Soft-IP Portable Hard-IP Portable Soft-IP ASSP CPU core I/O width(s) 2 x 64 AHB 1 x 32 AHB 2 x 32 AHB 1 x 32 Industry-leading support infrastructure 6

ARM1026EJ-S Performance Analysis and Tuning 7 Squeezing MIPS from a Dhrystone Dhrystone has a long history with well defined rules See http://www.arm.com for links to discussion, references, and repeatable code examples Yet, confusion appears prevalent even today ARM1020E is 1.23 MIPS/MHz according to the rules of v2.1 Auto-inlining improves the score to 1.45 MIPS/MHz File merging improves the score further to 1.71 MIPS/MHz Version 1.1 on an unreleased compiler improves the score further to 1.92 MIPS/MHz Better benchmarks are needed for accurate comparisons Ideally would also be more relevant to the real world Reliable, consistent, repeatable 8

ARM1020E Certified Benchmarks The ARM1020E processor was the starting point for ARM1026EJ-S definition and RTL development Completed certification of ARM1020E with Verilog RTL *with the exception of ConsumerMark on cycle model due to simulation time Base measurements are ECL certified and publicly available from Vendor Part # Instruction Issue Tool Chain NetMark AutoMark Developed a highly-accurate cycle model of ARM1020E in C Verified against RTL using Dhrystone and Enables what-if analysis for ARM1026EJ-S ISA L2 Cache Size ConsumerMark OAMark TeleMark ARM ARM1020E ARM v5te ne Single ARM ADS 1.2 5.42 120.5 19.3* 206.2 4.14 9 ARM1026EJ-S Branch Prediction ARM1020E branch predictor inspects only single entry of prefetch queue Predictor can miss a branch since code alignment, CPU stalls, and branch sequences can affect queue fills ARM1026EJ-S branch predictor operates on two queue entries branches missed by predictor 11% improvement on NetMark 3% improvement on Dhrystone 2.1 Instruction Cache Entry A Entry B Entry C Entry D Integer CPU Core 64b: 2 instr/cycle Prefetch Queue Observations Dramatic (~40%) benefit on NetMark relative to ARM926EJ-S Branch prediction on entire Telecom suite is over 82% correct Static predictor is now over 70% accurate on average New Instr+predec Branch Predictor Prediction 10

ARM1026EJ-S Memory Subsystem Dhrystone loops don t cause linefills but real code does Optimized BIU to further increase BW of 2x64bit AHB I/O Most improvement seen on larger NetMark packet flows 38% improvement on the first iteration of Dhrystone Tremendous improvement in OS boot cycle count over ARM9 Optimized write buffer for sequential stores Collapse sequential stores into AHB burst transaction Biggest change in any single component 68% in Consumer:CMY Observations All Net/Auto/Tele code fits in the ICache see code size stats Data cache hit rate for Net/Auto/Tele upwards of 95% 11 ARM1026EJ-S Return Stack R14 is our link register to hold return addresses New feature to predict MOV PC, R14 and BX R14 branch addresses common return from subroutine Longer ARM10E pipeline allows this logic to fit nicely Observations Improvement varied wildly across algorithms Automotive PWM shows over 10% improvement Networking Open Shortest Path First shows 0% improvement Various stack depths were investigated returns diminish quickly as complexity increases 12

ARM1026EJ-S MHz Optimization Migrating from Hard-IP to Soft-IP is non-trivial Objective is to maintain both MHz and IPC Previous IPC enhancements helped to open up headroom Extensive timing analysis was done to determine what features of the design were penalizing speed Objective is to analyze total performance tradeoff, on a feature by feature basis, as measured by, Dhrystone, and OS boots Several places to tradeoff against the uncommon case Unpredicted and mispredicted branches mostly hidden by improved branch prediction and return stack Data to LS address forwarding 2-5% impact Byte/half-word load forwarding 8% impact on TeleMark only 13 ARM1026EJ-S Design Summary ARM1026EJ-S builds upon ARM1020E, by adding: Soft-IP option with configurable caches and TCMs Jazelle, VIC port, and both MMU/MPU support Improves both performance and area efficiency 7% improvement on geometric mean of suites 5% improvement on Dhrystone v2.1 ARM1026EJ-S is the ideal extension of ARM9E family technology Additional pipeline stage to increase frequency Branch prediction, branch folding, and return stack to improve CPI Extensive 64-bit internal bussing and dual 64-bit AHB I/O for increased bandwidth ARM9E ARM1026EJ-S ARM1020E 14

ARM1026EJ-S System Overview 15 ARM1026EJ-S System Solution VFP10 FP Coprocessor L2 SRAM Arrays ARM1026EJ-S CPU ETM10RV Trace Macrocell Vector Interrupt Controller L2 Cache/TCM Controller AMBA AHB System Busses ARM1026EJ-S Integer CPU ARM, Thumb, DSP, and Java Configurable caches and TCMs Dual 64-bit AHB I/O 400+ Dhrystone 2.1 MIPS VFP10 FP Coprocessor IEEE 754 SP/DP, vector support ETM10RV Embedded Trace Macro Real-time, zero perf overhead 16

ARM1026EJ-S System Solution Optional L2 cache/tcm controller Multiple AHB ports in multiple clock domains up to CPU speed Configurable SRAM array sizes and latencies, set associativity, line lengths, write back/through modes, and line allocation policy Flexible cache and TCM combinations are supported Allows optimization for both performance and cost Vector Interrupt Controller Applies prioritized interrupt and associated vector to CPU Addresses significant latency overhead in systems with a large number of frequent interrupts 17 Conclusion ARM1026EJ-S is a large step along the ARM roadmap Further extending flexibility, functionality, and compatibility Performance tuning that reflects real-world applications Benchmarking remains a difficult and subjective art More and more data enables intelligent judgments t all important features and effects make it to the bottom line Final performance is the sum of a lot of small improvements ARM1026EJ-S demonstrates ARM s continued focus on the entire solution CPU performance AND flexibility, compatibility, functionality, area, power, integration, software, tools, support Minimizing total development and production cost 18