Lessons from the ARM Architecture. Richard Grisenthwaite Lead Architect and Fellow ARM

Similar documents

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Architecture and Implementation of the ARM Cortex -A8 Microprocessor

ARM Microprocessor and ARM-Based Microcontrollers

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Architectures, Processors, and Devices

Five Families of ARM Processor IP

Which ARM Cortex Core Is Right for Your Application: A, R or M?

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

The ARM Cortex-A9 Processors

Hardware accelerated Virtualization in the ARM Cortex Processors

The ARM Architecture. With a focus on v7a and Cortex-A8

ARM Cortex-R Architecture

Details of a New Cortex Processor Revealed. Cortex-A9. ARM Developers Conference October 2007

Cortex-A9 MPCore Software Development

ARM Webinar series. ARM Based SoC. Abey Thomas

ARM VIRTUALIZATION FOR THE MASSES. Christoffer Dall

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

ARMv8 Technology Preview. By Richard Grisenthwaite Lead Architect and Fellow. ARM

Cortex -A15. Technical Reference Manual. Revision: r2p0. Copyright 2011 ARM. All rights reserved. ARM DDI 0438C (ID102211)

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

StrongARM** SA-110 Microprocessor Instruction Timing

İSTANBUL AYDIN UNIVERSITY

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Instruction Set Design

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

WAR: Write After Read

CSE597a - Cell Phone OS Security. Cellphone Hardware. William Enck Prof. Patrick McDaniel

Cortex -A7 MPCore. Technical Reference Manual. Revision: r0p5. Copyright ARM. All rights reserved. ARM DDI 0464F (ID080315)

Instruction Set Architecture (ISA)

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Virtualization in the ARMv7 Architecture Lecture for the Embedded Systems Course CSD, University of Crete (May 20, 2014)

An Introduction to the ARM 7 Architecture

CISC, RISC, and DSP Microprocessors

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

SOC architecture and design

Cortex -A7 Floating-Point Unit

Intel 8086 architecture

CS352H: Computer Systems Architecture

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

High Performance or Cycle Accuracy?

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

Cortex -A Series. Programmer s Guide. Version: 2.0. Copyright 2011 ARM. All rights reserved. ARM DEN0013B (ID082411)

Compilation Tools. RealView. Developer Guide. Version 4.0. Copyright ARM. All rights reserved. ARM DUI 0203J (ID101213)

VLIW Processors. VLIW Processors

ARM Processors and the Internet of Things. Joseph Yiu Senior Embedded Technology Specialist, ARM

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Computer Architecture TDTS10

LSN 2 Computer Processors

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

OC By Arsene Fansi T. POLIMI

Cortex -A Series ARM. Programmer s Guide. Version: 4.0. Copyright ARM. All rights reserved. ARM DEN0013D (ID012214)

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Precise and Accurate Processor Simulation

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Computer Organization and Architecture

Putting it all together: Intel Nehalem.

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Intel Pentium 4 Processor on 90nm Technology

Intel StrongARM SA-110 Microprocessor

Performance evaluation

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

ARM Caches: Giving you enough rope... to shoot yourself in the foot. Marc Zyngier KVM Forum 15

Instruction Set Architecture (ISA) Design. Classification Categories

Exception and Interrupt Handling in ARM

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

VtRES Towards Hardware Embedded Virtualization Technology: Architectural Enhancements to an ARM SoC. ESRG Embedded Systems Research Group

ARM Accredited Engineer AAE Syllabus. Document number: ARM AEG 0052C Copyright ARM Limited 2012

Chapter 5 Instructor's Manual

CPU Performance Equation

Week 1 out-of-class notes, discussions and sample problems

EE361: Digital Computer Organization Course Syllabus

Migrating Application Code from ARM Cortex-M4 to Cortex-M7 Processors

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Generations of the computer. processors.

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Design Cycle for Microprocessors

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Next Generation GPU Architecture Code-named Fermi

big.little Technology: The Future of Mobile Making very high performance available in a mobile envelope without sacrificing energy efficiency

Embedded Parallel Computing

A Unified View of Virtual Machines

CPU Organization and Assembly Language

SPARC64 VIIIfx: CPU for the K computer

on an system with an infinite number of processors. Calculate the speedup of

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Thread level parallelism

x64 Servers: Do you want 64 or 32 bit apps with that server?

Architecture of the Kernel-based Virtual Machine (KVM)

Transcription:

Lessons from the ARM Architecture Richard Grisenthwaite Lead Architect and Fellow ARM 1

ARM Processor Applications 2

Overview Introduction to the ARM architecture Definition of Architecture History & evolution Key points of the basic architecture Examples of ARM implementations Something to the micro-architects interested My Lessons on Architecture design or what I wish I d known 15 years ago 3

Definition of Architecture The Architecture is the contract between the Hardware and the Software Confers rights and responsibilities to both the Hardware and the Software MUCH more than just the instruction set The architecture distinguishes between: Architected behaviors: Must be obeyed May be just the limits of behavior rather than specific behaviors Implementation specific behaviors that expose the micro-architecture Certain areas are declared implementation specific. E.g.: Power-down Cache and TLB Lockdown Details of the Performance Monitors Code obeying the architected behaviors is portable across implementations Reliance on implementation specific behaviors gives no such guarantee Architecture is different from Micro-architecture What vs How 4

History ARM has quite a lot of history First ARM core (ARM1) ran code in April 1985 3 stage pipeline very simple RISC-style processor Original processor was designed for the Acorn Microcomputer Replacing a 6502-based design ARM Ltd formed in 1990 as an Intellectual Property company Taking the 3 stage pipeline as the main building block This 3 stage pipeline evolved into the ARM7TDMI Still the mainstay of ARM s volume Code compatibility with ARM7TDMI remains very important Especially at the applications level The ARM architecture has features which derive from ARM1 Strong applications level compatibility focus in the ARM products 5

Evolution of the ARM Architecture Original ARM architecture: 32-bit RISC architecture focussed on core instruction set 16 Registers - 1 being the Program counter generally accessible Conditional execution on all instructions Load/Store Multiple operations - Good for Code Density Shifts available on data processing and address generation Original architecture had 26-bit address space Augmented by a 32-bit address space early in the evolution Thumb instruction set was the next big step ARMv4T architecture (ARM7TDMI) Introduced a 16-bit instruction set alongside the 32-bit instruction set Different execution states for different instruction sets Switching ISA as part of a branch or exception Not a full instruction set ARM still essential ARMv4 architecture was still focused on the Core instruction set only 6

Evolution of the Architecture (2) ARMv5TEJ (ARM926EJ-S) introduced: Better interworking between ARM and Thumb Bottom bit of the address used to determine the ISA DSP-focussed additional instructions Jazelle-DBX for Java byte code interpretation in hardware Some architecting of the virtual memory system ARMv6K (ARM1136JF-S) introduced: Media processing SIMD within the integer datapath Enhanced exception handling Overhaul of the memory system architecture to be fully architected Supported only 1 level of cache ARMv7 rolled in a number of substantive changes: Thumb-2* - variable length instruction set TrustZone* Jazelle-RCT Neon 7

Extensions to ARMv7 MPE Multiprocessing Extensions Added Cache and TLB Maintenance Broadcast for efficient MP VE - Virtualization Extensions Adds hardware support for virtualization: 2 stages of translation in the memory system New mode and privilege level for holding an Hypervisor With associated traps on many system relevant instructions Support for interrupt virtualization Combines with a System MMU LPAE Large Physical Address Extensions Adds ability to address up to 40-bits of physical address space 8

VFP ARM s Floating-point solution VFP Vector Floating-point Vector functionality has been deprecated in favour of Neon Described as a coprocessor Originally a tightly-coupled coprocessor Executed instructions from ARM instruction stream via dedicated interface Now more tightly integrated into the CPU Single and Double precision floating-point Fully IEEE compliant Until VFPv3, implementations required support code for denorms Alternative Flush to Zero handling of denorms also supported Recent VFP versions: VFPv3 adding more DP registers (32 DP registers) VFPv4 adds Fused MAC and Half-precision support (IEEE754-2008) 9

ARM Architecture versions and products Key architecture revisions and products: ARMv1-ARMv3: largely lost in the mists of time ARMv4T: ARM7TDMI first Thumb processor ARMv5TEJ(+VFPv2): ARM926EJ-S ARMv6K(+VFPv2): ARM1136JF-S, ARM1176JFZ-S, ARM11MPCore first Multiprocessing Core ARMv7-A+VFPv3 Cortex-A8 ARMv7-A+MPE+VFPv3: Cortex-A5, Cortex-A9 ARMv7-A+MPE+VE+LPAE+VFPv4 Cortex-A15 ARMv7-R : ARMv6-M ARMv7-M: Cortex-R4, Cortex-R5 Cortex M0 Cortex-M3, Cortex-M4 10

ARMv7TDMI Simple 3 stage pipeline Fetch, Decode, Execute Multiple cycles in execute stage for Loads/Stores Simple core Roll your own memory system 11

ARM926EJ-S 5 stage pipeline single issue core Fetch, Decode, Execute, Memory, Writeback Most common instructions take 1 cycle in each pipeline stage Split Instruction/Data Level1 caches Virtually tagged MMU hardware page table walk based Java Decode Stack Management Java Decode Register Read Instruction Fetch Thumb Decode Register Decode Register Read Compute Partial Products Shift + ALU Sum/Accumulate & Saturation Memory Access Register Write Instruction Stream ARM Decode Register Decode Register Read FETCH DECODE EXECUTE MEMORY WRITEBACK 12

ARM1176JZF-S 8 stage pipeline single issue Split Instruction/Data Level1 caches Physically tagged Two cycle memory latency MMU hardware page table walk based Hardware branch prediction PF1 PF2 I-Cache Access + Dynamic Branch Prediction DE Decode + StaticB P RStack ISS SH ALU SAT Instr Issue + Regist er Read ALU and MAC Pipeline MAC1 MAC2 MAC3 WB ex LS add LSU Pipeline DC1 DC2 WB LS 13

Cortex-A8 Dual Issue, in-order 10 stage pipeline (+ Neon Engine) 2 levels of cache L1 I/D split, L2 unified Aggressive Branch Prediction F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5 M0 M1 M2 M3 N1 N2 N3 N4 N5 N6 Branch mispredict penalty Replay penalty Instruction Execute and Load/Store Integer register writeback NEON NEON register writeback ALU pipe 0 Integer ALU pipe Instruction Fetch Instruction Decode MUL pipe 0 ALU pipe 0 NEON Instruction Decode Integer MUL pipe Integer shift pipe Non-IEEE FP ADD pipe LS pipe 0 or 1 Non-IEEE FP MUL pipe L1 instruction cache miss L1 data cache miss L2 data L1 data Load and store data queue IEEE FP engine LS permute pipe BIU pipeline NEON store data L1 L2 L3 L4 L5 L6 L7 L8 L2 Tag Array L2 Data Array L9 Embedded Trace Macrocell T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 L3 memory system External trace port 14

Cortex-A9 Dual Issue, out-of-order core MP capable delivered as clusters of 1 to 4 CPUs MESI based coherency scheme across L1 data caches Shared L2 cache (PL310) Integrated interrupt controller 15

Cortex-A15 Just Announced - Core Detail 2.5Ghz in 28 HP process 12 stage in-order, 3-12 stage OoO pipeline 3.5 DMIPS/Mhz ~ 8750 DMIPS @ 2.5GHz ARMv7A with 40-bit PA Dynamic repartitioning Virtualization Fast state save and restore Move execution between cores/clusters 128-bit AMBA 4 ACE bus Supports system coherency ECC on L1 and L2 caches Eagle Pipeline Fetch Decode 12 Stage In-order pipeline (Fetch, 3 Instruction Decode, Rename) Rename 64-bit/128bit AMBA4 interface Simple Cluster 0 Simple Cluster 1 Multiply Accumulate Complex Complex Load & Store 0 Load & Store 1 3-12 Stage out-of-order pipeline (capable of 8 issue) 16

Just some basic lessons from experience Architecture is part art, part science There is no one right way to solve any particular problem.though there are plenty of wrong ways Decision involves balancing many different criteria Some quantitative, some more subjective Weighting those criteria is inherently subjective Inevitably architectures have fingerprints of their architects Hennessy & Patterson quantitative approach is excellent But it is only a framework for analysis a great toolkit Computer science we experiment using benchmarks/apps but the set of benchmarks/killer applications is not constant Engineering is all about technical compromise, and balancing factors The art of Good Enough 17

First Lesson It s all about Compatibility Customers absolutely expect compatibility Customers buy your roadmap, not just your products The software they write is just expected to work Trying to obsolete features is a long-term task Issues from the real world: Nobody actually really knows what their code uses and they ve often lost the sources/knowledge People don t use features as the architect intended Bright software engineers come up with clever solutions The clever solutions find inconvenient truths Compatibility is with what the hardware actually does Not with how you wanted it to be used There is a thing called de facto architecture 18

Second lesson orthogonality Classic computer texts tell you orthogonality is good Beware false orthogonalities ARM architecture R15 being the program counter Orthogonality says you can do lots of wacky things using the PC On a simple implementation, the apparent orthogonality is cheap ARM architecture has shifts with all data processing Orthogonality from original ARM1 pipeline But the behaviour has to be maintained into the future Not all useful control configurations come in powers of 2 Fear the words It just falls out of the design True for today s microarchitecture but what about the next 20 years? Try to only architect the functionality you think will actually be useful Avoid less useful functionality that emerges from the micro-architecture 19

Third Lesson Microarchitecture led features Few successful architectures started as architectures Code runs on implementations, not on architectures People buy implementations, not architectures IP licensing notwithstanding Most architectures have micro-architecture led features it just fell out Optimisations based on first target micro-architecture MIPS delayed branch slots ARM the PC offset of 2 instructions Made the first implementation cheaper/easier than the pure approach but becomes more expensive on subsequent implementations Surprisingly difficult trade-off Short-term/long-term balance Meeting a real need sustainably vs overburdening early implementations 20

Fourth Lesson: New Features Successful architectures get pulled by market forces Success in a particular market adds features for that market Different points of success pull successively over time Solutions don t necessarily fit together perfectly Unsuccessful architectures don t get same pressures which is probably why they appear so clean! Be very careful adding new features: Easy to add, difficult to remove Especially for user code Trap and Emulate is an illusion of compatibility Performance differential is too great for most applications 21

Lessons on New Features New features rarely stay in the application space you expect. Or want architects are depressingly powerless Customers will exploit whatever they find So worry about the general applicability of that feature Assume it has to go everywhere in the roadmap If a feature requires a combination of hardware and specific software be afraid development timescales are different Be very afraid of language specific features All new languages appear to be different.but very rarely are Avoid solving this years problem in next year s architecture... Next year s problem may be very different Point solutions often become warts all architectures have them If the feature has a shelf-life, plan for obsolence Example: Jazelle-DBX in the ARM architecture 22

Thoughts on instruction set design What is the difference between an instruction and a micro-op? RISC principles said they were the same Very few PURE RISC architectures exist today ARM doesn t pretend to be hard-core RISC I ll claim some RISC credentials Choice of Micro-ops is micro-architecture dependent An architecture should be micro-architecturally independent Therefore mapping of instructions to micro-ops is inherently risky Splitting instructions easier than fusing instructions If an instruction can plausibly be done in one block, might be right to express it Even if some micro-architectures are forced to split the instruction But, remember legacy lasts for a very long time Imagine explaining the instructions in 5 years time Avoid having instructions that provide 2 ways to do much the same thing Everyone will ask you which is better - lots of times If it feels a bit clunky when you first design it. it won t improve over time 23

Final point Architecture is not enough Not Enough to ensure perfect write once, run everywhere Customers expectations of compatibility go beyond architected behaviour People don t write always code to the architecture and they certainly can t easily test it to the architecture ARM is developing tools to help address this Architecture Envelope Models a suite of badly behaved legal implementations The architecture defines whether they are software bugs or hardware incompatibilities allows you to assign blame (and fix the problem consistently) Beware significant performance anomalies between architectural compliant cores If I buy a faster core, I want it to go reliably faster without recompiling Multi-processing introduce huge scope for functional differences from timing Especially in badly synchronised code Concurrency errors BUT THE ARCHITECTURE IS A STRATEGIC ASSET 24

History the Passage of Time 25

Microprocessor Forum 1992 26

Count the Architectures (11) PA ARM MIPS 29K N32x16 88xxx NVAX Alpha SPARC x86 x86 x86 i960 x86 27

The Survivors At Most 2 ARM MIPS SPARC x86 28

What Changed? Some business/market effects Some simple scale of economy effects Some technology effects Adoption and Ecosystem effects It wasn t all technology not all of the disappearing architectures were bad Not all survivors are good! Not all good science succeeds in the market! Never forget Good enough 29

Thank you Questions 30