Systemy RT i embedded Wykład 6 Rdzenie ARM, część 2

Systemy RT i embedded Wykład 6 Rdzenie ARM, część 2 Wrocław 2013

Pipelining

What Is A Pipeline? Pipelining is used by virtually all modern microprocessors to enhance performance by overlapping the execution of instructions. In terms of a pipeline within a CPU, each instruction is broken up into different stages.

What Is A Pipeline? Ideally if each stage is balanced (all stages are ready to start at the same time and take an equal amount of time to execute.) the time taken per instruction (pipelined) is defined as: Time per instruction (unpipelined) / Number of stages

What is a pipeline If the stages of a pipeline are not balanced and one stage is slower than another, the entire throughput of the pipeline is affected.

What is a pipeline In terms of a CPU, the implementation of pipelining has the effect of reducing the average instruction time, therefore reducing the average CPI (Clocks per Instruction). Example: If each instruction in a microprocessor takes 5 clock cycles (unpipelined) and we have a 4 stage pipeline, the ideal average CPI with the pipeline will be 1.25.

Classical 5-stage pipeline Usually we have 5 cycle deep pipeline: Instruction Fetch Cycle Instruction Decode/Register Fetch Cycle Execution Cycle Memory Access Cycle Write-Back Cycle

Instruction Fetch (IF) Cycle During IF cycle the instructiondecoder fetches an instruction from an instruction memory

Instruction Decode (ID)/Register Fetch Cycle Decoding the instruction and at the same time reading in the values of the register involved. As the registers are being read, the equality test is done in case the instruction decodes as a branch or jump. Instruction can be decoded in parallel with reading the registers because the register addresses are at fixed locations.

Execution (EX)/Effective Address Cycle If a branch or jump did not occur in the previous cycle, the arithmetic logic unit (ALU) can execute the instruction. At this point the instruction falls into three different types: Memory Reference: ALU adds the base register and the offset to form the effective address. Register-Register: ALU performs the arithmetic, logical, etc operation as per the opcode. Register-Immediate: ALU performs operation based on the register and the immediate value (sign extended).

Memory Access (MEM) Cycle If a load, the effective address computed from the previous cycle is referenced and the memory is read. The actual data transfer to the register does not occur until the next cycle. If a store, the data from the register is written to the effective address in memory.

Write-Back (WB) Cycle Occurs with Register-Register ALU instructions or load instructions. Simple operation whether the operation is a register-register operation or a memory load operation, the resulting data is written to the appropriate register.

Problems With The Previous Figure The memory is accessed twice during each clock cycle. This problem is avoided by using separate data and instruction caches. It is important to note that if the clock period is the same for a pipelined processor and an non-pipelined processor, the memory must work five times faster. Another problem that we can observe is that the registers are accessed twice every clock cycle. To try to avoid a resource conflict we perform the register write in the first half of

Problems With The Previous Figure (cont d) We write in the first half because therefore a write operation can be read by another instruction further down the pipeline. A third problem arises with the interaction of the pipeline with the PC. We use an adder to increment PC by the end of IF. Within ID we may branch and modify PC. How does this affect the pipeline?

Pipeline Hazards The performance gain from using pipelining occurs because we can start the execution of a new instruction each clock cycle. In a real implementation this is not always possible. Another important note is that in a pipelined processor, a particular instruction still takes at least as long to execute as non-pipelined. Pipeline hazards prevent the execution of

Types Of Hazards There are three types of hazards in a pipeline, they are as follows: Structural Hazards: are created when the data path hardware in the pipeline cannot support all of the overlapped instructions in the pipeline. Data Hazards: When there is an instruction in the pipeline that affects the result of another instruction in the pipeline. Control Hazards: The PC causes these due to the pipelining of branches and other instructions that change the PC.

A Hazard Will Cause A Pipeline Stall We can look at pipeline performance in terms of a faster clock cycle time as well: Speedup = CPI unpipelined CPI pipelined x Clock cycle time unpipelined Clock cycle time pipelined Clock cycle pipelined = Clock cycle time unpipelined Pipeline Depth Speedup = 1 1 + Pipeline stalls per Ins x Pipeline Depth

Dealing With Structural Hazards Structural hazards result from the CPU data path not having resources to service all the required overlapping resources. Suppose a processor can only read and write from the registers in one clock cycle. This would cause a problem during the ID and WB stages. Assume that there are not separate instruction and data caches, and only one memory access can occur during one clock cycle. A hazard would be caused during the IF and MEM cycles.

Dealing With Structural Hazards A structural hazard is dealt with by inserting a stall or pipeline bubble into the pipeline. This means that for that clock cycle, nothing happens for that instruction. This effectively slides that instruction, and subsequent instructions, by one clock cycle. This effectively increases the average CPI.

Dealing With Structural Hazards (cont d) Speedup = CPI no haz CPI haz x Clock cycle time no haz Clock cycle time haz Speedup = 1 1+0.4*1 x 1 1/1.05 = 0.75

Dealing With Structural Hazards (cont d) We can see that even though the clock speed of the processor with the hazard is a little faster, the speedup is still less than 1. Therefore the hazard has quite an effect on the performance. Sometimes computer architects will opt to design a processor that exhibits a structural hazard. Why? A: The improvement to the processor data path is too costly. B: The hazard occurs rarely enough so that the processor will still perform to specifications.

Data Hazards (A Programming Problem?) We haven t looked at assembly programming in detail at this point. Consider the following operations: DADD R1, R2, R3 DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11

Pipeline Registers What are the problems?

Data Hazard Avoidance In this trivial example, the programmer cannot be expected to reorder his/her operations. Assuming this is the only code we want to execute. Data forwarding can be used to solve this problem. To implement data forwarding we need to bypass the pipeline register flow: Output from the EX/MEM and MEM/WB stages must be fed back into the ALU input. We need routing hardware that detects when the next instruction depends on the write of a previous

General Data Forwarding It is easy to see how data forwarding can be used by drawing out the pipelined execution of each instruction. Now consider the following instructions: DADD R1, R2, R3 LD R4, O(R1) SD R4, 12(R1)

Problems Can data forwarding prevent all data hazards? NO! The following operations will still cause a data hazard. This happens because the further down the pipeline we get, the less we can LDuse forwarding. R1, O(R2) DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9

Problems We can avoid the hazard by using a pipeline interlock. The pipeline interlock will detect when data forwarding will not be able to get the data to the next instruction in time. A stall is introduced until the instruction can get the appropriate data from the previous instruction.

Control Hazards Control hazards are caused by branches in the code. During the IF stage remember that the PC is incremented by 4 in preparation for the next IF cycle of the next instruction. What happens if there is a branch performed and we aren t simply incrementing the PC by 4. The easiest way to deal with the occurrence of a branch is to perform the IF stage again once the branch occurs.

Performing IF Twice We take a big performance hit by performing the instruction fetch whenever a branch occurs. Note, this happens even if the branch is taken or not. This guarantees that the PC will get the correct value. IF ID EX MEM WB branch IF ID EX MEM WB IF IF ID EX MEM WB

Performing IF Twice This method will work but as always in computer architecture we should try to make the most common operation fast and efficient. By performing IF twice we will encounter a performance hit between 10%-30% Next class we will look at some methods for dealing with Control Hazards.

Control Hazards (other solutions) What if every branch is treated as not taken. Than not only the registers are read during ID, but we also an equality test is performed in case a branch is necessary or not. The performance can be improved by assuming that the branch will not be taken. The complexity arises when the branch evaluates and we end up needing to

Control Hazards (other solutions) If the branch is actually taken than the pipeline needs to be cleared of any code loaded in from of the not-taken path. Likewise it can be assumed that the branch is always taken.

Control Hazards (other solutions) The next method for dealing with a control hazard is to implement a delayed branch scheme. In this scheme an instruction is inserted into the pipeline that is useful and not dependent on whether the branch is taken or not. It is the job of the compiler to determine the delayed branch instruction.

How To Implement a Pipeline

Multi-clock Operations Sometimes operations require more than one clock cycle to complete. Examples are: Floating Point Multiply Floating Point Divide Floating Point Add

Dependences and Hazards Types of data hazards: RAW: read after write WAW: write after write WAR: write after read RAW hazard was already shown. WAW hazards occur due to output dependence. WAR hazards do not usually occur because of the amount of time between the read cycle and write cycle in a pipeline.

Dynamic Scheduling In the statically scheduled pipeline the instructions are fetched and then issued. If the users code has a data dependency / control dependence it is hidden by forwarding. If the dependence cannot be hidden a stall occurs. Dynamic Scheduling is an important technique in which both dataflow and exception behavior of the program are maintained.

Dynamic Scheduling (continued) Data dependence can cause stalling in a pipeline that has long execution times for instructions that dependencies. EX: Let s consider this code (.D is floating point), DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14

Dynamic Scheduling (continued) Longer execution times of certain floating point operations give the possibility of WAW and WAR hazards. EX: DIV.D F0, F2, F4 ADD.D F6, F0, F8 SUB.D F8, F10, F14 MUL.D F6, F10, F8

Dynamic Scheduling (continued) If we want to execute instructions out of order in hardware (if they are not dependent etc ) we need to modify the ID stage of the 5 stage pipeline. Split ID into the following stages: Issue: Decode instructions, check for structural hazards. Read Operands: Wait until no data hazards, then read operands. IF still precedes ID and will store the instruction into a register or queue.

Branch Prediction In Hardware Data hazards can be overcome by dynamic hardware scheduling, control hazards need also to be addressed. Branch prediction is extremely useful in repetitive branches, such as loops. A simple branch prediction can be implemented using a small amount of memory and the lower order bits of the address of the branch instruction. The memory only needs to contain one bit, representing whether the branch was taken or

Branch Prediction In Hardware If the branch is taken the bit is set to 1. The next time the branch instruction is fetched we will know that the branch occurred and we can assume that the branch will be taken. This scheme adds some history to our previous discussion on branch taken and branch not taken control hazard avoidance.

2-bit Prediction Scheme This method is more reliable than using a single bit to represent whether the branch was recently taken or not. The use of a 2-bit predictor will allow branches that favor taken (or not taken) to be mispredicted less often than the onebit case.

Branch Predictors The size of a branch predictor memory will only increase it s effectiveness so much. We also need to address the effectiveness of the scheme used. Just increasing the number of bits in the predictor doesn t do very much either. Some other predictors include: Correlating Predictors Tournament Predictors

Branch Predictors Correlating predictors will use the history of a local branch AND some overall information on how branches are executing to make a decision whether to execute or not. Tournament Predictors are even more sophisticated in that they will use multiple predictors local and global and enable them with a selector to improve accuracy.

ARM cores, part 2

Plan ARM9 AMBA Cortex-M Cortex-R

Source: [2]

ARM9 Source: [2]

ARM9 - features Over 5 Billion ARM9 processors have been shipped so far The ARM9 family is the most popular ARM processor family ever 250+ silicon licensees 100+ licensees of the ARM926EJ-S processor ARM9 processors continue to be successfully deployed across a wide range of products and applications. The ARM9 family offers proven, low risk and easy to use designs which reduce costs and enable rapid time to market. The ARM9 family consists of three processors - ARM926EJ-S, ARM946E-S and ARM968E-S

ARM9 family features Main features Based on ARMv5TE architecture Efficient 5-stage pipeline for faster throughput and system performance Fetch/Decode/Execute/Memory/Writeback Supports both ARM and Thumb instruction sets Efficient ARM-Thumb interworking allows optimal mix of performance and code density

ARM9 family features Main features Harvard architecture - Separate Instruction & Data memory interfaces Increased available memory bandwidth Simultaneous access to I & D memory Improved performance 31 x 32-bit registers 32-bit ALU & barrel shifter Enhanced 32-bit MAC block

ARM9 DSP enhancements Single cycle 32x16 multiplier implementation Speeds up all multiply instructions Pipelined design allows one 16x16 or 32x16 to start each cycle New 32x16 and 16x16 multiply instructions Allow independent access to 16-bit halves of registers

ARM9 DSP enhancements Gives efficient use of 32-bit bandwidth for packed 16-bit operands ARM ISA provides 32x32 multiply instructions Efficient fractional saturating arithmetic QADD, QSUB, QDADD, QDSUB Count leading zeros instruction CLZ for faster normalisation and division

ARM9 features comparison Source: [3]

AMBA - Advanced Microcontroller Bus Architecture AMBA onchip communications standard for designing high-performance embedded microcontrollers introduced by ARM in 1996 A few versions: AHB (Advanced High-performance Bus) ASB (Advanced System Bus) APB (Advanced Peripheral Bus) AXI

AMBA first specification Buses defined : Advanced System Bus (ASB) Advanced Peripheral Bus (APB)

AMBA 2 specification Buses defined : Advanced High-performance Bus (AHB) - widely used on ARM7, ARM9 and ARM Cortex- M based designs Advanced System Bus (ASB) Advanced Peripheral Bus (APB2 or APB)

AMBA 3 specification Buses defined: Advanced extensible Interface (AXI3 or AXI v1.0) - widely used on ARM Cortex-A processors including Cortex-A9 Advanced High-performance Bus Lite (AHB- Lite v1.0) Advanced Peripheral Bus (APB3 v1.0) Advanced Trace Bus (ATB v1.0)

AMBA 2 specification Buses defined : AXI Coherency Extensions (ACE) - widely used on the latest ARM Cortex-A processors including Cortex-A7 and Cortex-A15 AXI Coherency Extensions Lite (ACE-Lite) Advanced extensible Interface 4 (AXI4) Advanced extensible Interface 4 Lite (AXI4-Lite) Advanced extensible Interface 4 Stream (AXI4- Stream v1.0) Advanced Trace Bus (ATB v1.1) Advanced Peripheral Bus (APB4 v2.0)

APB APB designed for low-power system modules, for example register interfaces on system peripherals optimized for minimal power consumption and reduced interface complexity to support peripheral functions It has to support 32bit and 66 MHz signals.

ASB ASB designed for high-performance system modules alternative system bus suitable for use where the high-performance features of AHB are not required supports also the efficient connection of: processors, on-chip memories off-chip external memory interfaces with low-power peripheral macrocell functions

AHB AHB designed for high-performance, high clock frequency system modules acts as the high-performance system backbone bus supports the efficient connection of: processors, on-chip memories off-chip external memory interfaces with low-power peripheral macrocell functions

AHB Features: single edge clock protocol split transactions several bus masters burst transfers pipelined operations single-cycle bus master handover non-tristate implementation large bus-widths (64/128 bit).

AHB - Lite AHB Lite is a subset of AHB This subset simplifies the design for a bus with a single master

AXI AXI designed for high-performance, high clock frequency system modules with low latency enables high-frequency operation without using complex bridges provides flexibility in the implementation of interconnect architectures is backward-compatible with existing AHB and APB interfaces.

AXI Features: separate address/control and data phases support for unaligned data transfers using byte strobes burst based transactions with only start address issued issuing of multiple outstanding addresses with out of order responses easy addition of register stages to provide timing closure.

Typical AMBA system

Cortex-M

Source: [2]

Cortex family Currently Cortex family is strongly introduced to the market by ARM corporation Cortex family consists of three subfamilies: Cortex-M cores for microcontrollers and costsensitive applications; Thumb-2 instructions supported

Cortex family Cortex family consists of three subfamilies: Cortex-R cores for real time systems appliactions; ARM, Thumb and Thumb-2 instructions supported Cortex-A the most complex and the most powerful cores, for multimedia devices and application processors; ARM, Thumb and Thumb-2 instructions supported

Cortex-M Source: [4]

Cortex-M

Cortex-M Main features: 32-bit processor 3 stage pipelining Thumb-2 instruction list concise and efficient code Many power saving modes and domains Nested Vectored Interrupt Controller well defined times and methods of interrupts invoking RTOS support Debugger support (JTAG, SWD Serial Wire Debug)

Cortex-M0/M0+ Source: [2]

Cortex-M0 Main features: The armest version of ARM cores The most power saving version of ARM cores only 85mW/MHz Upward compatibility with Cortex-M3 Only 12000 gates Only 56 C-optimized instructions Support for low power wireless communication: Bluetooth Low Energy (BLE), ZigBee, etc. Performance 0.9 DMIPS/MHz Single cycle 32x32 multiply instructions Interrupt execution delay: 16 cycles

Cortex-M0 Source: [2]

Cortex-M0 Processor modes: Thread mode: Used to execute application software. The processor enters Thread mode when it comes out of reset. Handler mode: Used to handle exceptions. The processor returns to Thread mode when it has finished all exception processing.

Cortex-M0 core registers

Cortex-M0 memory map

Cortex-M0 vector table

Cortex-M0 register stacking

Cortex-M1 Main features: Core destined for FPGA applications Support for Actel, Altera and Xilinx chips Easy migration from FPGA (development) to ASIC (production) Source: [2]

Cortex-M1 Main features: A general-purpose 32-bit microprocessor, which executes the ARMv6-M subset of the Thumb-2 instruction set and offers high performance operation and small size in FPGAs. It has: a three-stage pipeline a three-cycle hardware multiplier little-endian format for accessing all memory. A system control block containing memory-mapped control registers. Source: [2]

Cortex-M1 Main features: An integrated Operating System (OS) extensions system timer. An integrated Nested Vectored Interrupt Controller (NVIC) for low-latency interrupt processing. A memory model that supports accesses to both memory and peripheral registers. Integrated and configurable Tightly Coupled Memories (TCMs) Optional debug support. Source: [2]

Cortex-M1 Processor modes as in Cortex-M0 Source: [2]

Cortex-M1 Memory Map Source: [2]

Cortex-M3 Main features: Introduced to the market in 2004 Destined for the most demanding microcontrollers High performance and many additional features Low power consumption (12.5 DMIPS/mW) Up to 240 interrupt sources!!! Support for many serial protocols

Cortex-M3 Main features: Performance of 1.25DMIPS/MHz Support for bit operations Single cycle 32x32bit multiply; 2-12 cycle division Three stage pipelining with branch prediction Memory Protection Unit (MPU) Max speed: up to 275 MHz /340 DMIPS

Cortex-M3

Cortex-M3 Core features: Thumb instruction set (ARMv7) Banked Stack Pointer Hardware integer divide instructions Automatic processor state saving and restoration for low latency Interrupt Service Routine (ISR) entry and exit.

Cortex-M3 NVIC (Nested Vector Interrupt Controller) features: External interrupts, configurable from 1 to 240. Bits of priority, configurable from 3 to 8. Dynamic reprioritization of interrupts. Priority grouping - selection of preempting interrupt levels and non preempting interrupt levels. Support for tail-chaining and late arrival of interrupts. This enables back-to-back interrupt processing without the overhead of state saving and restoration between interrupts. Processor state automatically saved on interrupt entry, and restored on interrupt exit, with no instruction overhead. Optional Wake-up Interrupt Controller (WIC), providing ultra-low power sleep mode support.

Cortex-M3 MPU features features: Eight memory regions. Sub Region Disable (SRD), enabling efficient use of memory regions. The ability to enable a background region that implements the default memory map attributes.

Cortex-M3 Bus interfaces: Three Advanced High-performance Bus-Lite (AHB-Lite) interfaces: ICode, DCode, and System bus interfaces. Private Peripheral Bus (PPB) based on Advanced Peripheral Bus (APB) interface. Bit-band support that includes atomic bit-band write and read operations. Memory access alignment. Write buffer for buffering of write data. Exclusive access transfers for multiprocessor systems.

Cortex-M3 The processor supports two modes of operation, Thread mode and Handler mode: The processor enters Thread mode on Reset, or as a result of an exception return. Privileged and Unprivileged code can run in Thread mode. The processor enters Handler mode as a result of an exception. All code is privileged in Handler mode.

Cortex-M3 The processor can operate in one of two operating states: Thumb state. This is normal execution running 16-bit and 32-bit halfword aligned Thumb instructions. Debug State. This is the state when the processor is in halting debug.

Cortex-M3

Cortex-M3 bit band mapping

Cortex-M4 Main features: The richest version of Cortex-M subfamily Destined for low power digital signal applications Integrated 32b CPU and DSP Single precision FPU unit Other features like in Cortex-M3 DSP instructions Max speed: up to 300 MHz /375 DMIPS

Cortex-M4

Cortex-M4 FPU features: 32-bit instructions for single-precision (C float) dataprocessing operations. Combined Multiply and Accumulate instructions for increased precision (Fused MAC). Hardware support for conversion, addition, subtraction, multiplication with optional accumulate, division, and square-root. Hardware support for denormals and all IEEE rounding modes. 32 dedicated 32-bit single precision registers, also addressable as 16 double-word registers. Decoupled three stage pipeline.

Cortex-M4 - FPU FPU registers: sixteen 64-bit doubleword registers, D0-D15 or thirty-two 32-bit single-word registers, S0- S31

Cortex-R4 Source: [2]

Cortex-R4 Main features: A mid-range processor for use in deeply-embedded, real-time systems Includes Thumb-2 technology for optimum code density and processing throughput Integrated 32b CPU and DSP Single precision FPU unit (in versions R4F) ARM and Thumb instructions Tightly-Coupled Memory (TCM) ports for low-latency and deterministic accesses to local RAM, in addition to caches for higher performance to general memory

Cortex-R4 Main features: High-speed Advanced Microprocessor Bus Architecture (AMBA) Advanced extensible Interfaces (AXI) for master and slave interfaces Dynamic branch prediction with a global history buffer, and a 4-entry return stack The ability to implement and use redundant core logic, for example, in fault detection ECC Error Corrcting Codes - Optional singlebit error correction and two-bit error detection for cache and/or TCM memories with ECC bits

Cortex-R4 Main features: A Harvard L1 memory system with: optional Tightly-Coupled Memory (TCM) interfaces with support for error correction or parity checking memories optional caches with support for optional error correction schemes optional ARMv7-R architecture Memory Protection Unit (MPU) optional parity and Error Checking and Correction (ECC) on all RAM blocks. An L2 memory interface: single 64-bit master AXI interface 64-bit slave AXI interface to TCM RAM blocks and cache RAM blocks.

Cortex-R4 Operating modes: User (USR) mode - the usual mode for the execution of ARM or Thumb programs. Fast interrupt (FIQ) mode entered on taking a fast interrupt. Interrupt (IRQ) mode entered on taking a normal interrupt. Supervisor (SVC) mode is a protected mode for the operating system entered on taking a Supervisor Call (SVC), formerly SWI. Abort (ABT) mode entered after a data or instruction abort. System (SYS) mode is a privileged user mode for the operating system.

Cortex-R4 register set

Cortex-R4 status register

Cortex-R5 Main features: Improved (extended) version of Cortex-R4 processor Added hardware Accelerator Coherency Port (ACP) to reduce the requirement for slow software cache maintenance operations when sharing memory with other master Added Vector Floatin-Point v3 Added Multiprocessing Extensions for multiprocessing functionality Added Low Latency Peripheral Port for integration of latency sensitive peripherals with processor

Cortex-R5 Implementation example:

Thank you for your attention

Cortex-R5 VFP v3-d16: The FPU fully supports single-precision and double-precision add, subtract, multiply, divide,multiply and accumulate, and square root operations provides conversions between fixed-point and floating-point data formats, and floating-point constant instructions includes 16 double-precision registers

Cortex-R5 Vector instructions:

Cortex-R7 Main features: The highest perfoming Cortex-R processor On a 40 nm G process the Cortex-R7 processor can be implemented to run at well over 1 GHz when it delivers over 2700 Dhrystone MIPS performance On a 28nm process the perfomance is estimated to reach 4600 Dhrystone MIPS

Cortex-R7 Main features: Eleven-stage pipeline with instruction prefetch, branch prediction, superscalar and out of order execution divide and floating-point 2.53 Dhrystone MIPS/MHz Added LLRAM low latency memory port designed specifically to connect to local memory (64-bit)

Thank you for your attention

References [1] ARM7TDMI core documentation; www.arm.com [2] www.arm.com [3] ARM9 family documentation; www.arm.com [4] Cortex family documentation; www.arm.com [5] http://www.engr.mun.ca/~venky/pipelining.ppt#256,1,p ipelining: Basic and Intermediate Concepts