A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey



Similar documents
ARM Microprocessor and ARM-Based Microcontrollers

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

The ARM Cortex-A9 Processors

SOC architecture and design

Which ARM Cortex Core Is Right for Your Application: A, R or M?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Application Performance Analysis of the Cortex-A9 MPCore

Cortex-A9 MPCore Software Development

The ARM Architecture. With a focus on v7a and Cortex-A8

Chapter 1 Computer System Overview

The Future of the ARM Processor in Military Operations

Design Cycle for Microprocessors

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

ARM Processors and the Internet of Things. Joseph Yiu Senior Embedded Technology Specialist, ARM

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

ZigBee Technology Overview

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

What is a System on a Chip?

CISC, RISC, and DSP Microprocessors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

High Performance or Cycle Accuracy?

Industry First X86-based Single Board Computer JaguarBoard Released

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

7a. System-on-chip design and prototyping platforms

İSTANBUL AYDIN UNIVERSITY

I vantaggi dell?utilizzo di JAVA nella strategia M2M

Generations of the computer. processors.

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Architecture and Implementation of the ARM Cortex -A8 Microprocessor

SPARC64 VIIIfx: CPU for the K computer

GPU Architecture. Michael Doggett ATI

Getting Started with RemoteFX in Windows Embedded Compact 7

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

CSE597a - Cell Phone OS Security. Cellphone Hardware. William Enck Prof. Patrick McDaniel

Processor Architectures

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Mobile Processors: Future Trends

NVIDIA Tegra 4 Family CPU Architecture

ARM Processor Evolution

Architectures and Platforms

How To Understand The Design Of A Microprocessor

Five Families of ARM Processor IP

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Instruction Set Design

Rethinking SIMD Vectorization for In-Memory Databases

Week 1 out-of-class notes, discussions and sample problems

Low Power AMD Athlon 64 and AMD Opteron Processors

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

Choosing the Right DSP for High-Resolution Imaging in Mobile and Wearable Applications

OC By Arsene Fansi T. POLIMI

Introducción. Diseño de sistemas digitales.1


Driving Embedded Innovation with ARM Ecosystem

ELEC 5260/6260/6266 Embedded Computing Systems

How To Improve Performance On A P4080 Processor

picojava TM : A Hardware Implementation of the Java Virtual Machine

Introduction to Cloud Computing

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Introduction to Silicon Labs. November 2015

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Power Reduction Techniques in the SoC Clock Network. Clock Power

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

System Design Issues in Embedded Processing

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Secured Embedded Many-Core Accelerator for Big Data Processing

Java and Real Time Storage Applications

ARM Webinar series. ARM Based SoC. Abey Thomas

VDI Clients. Delivering Tomorrow's Virtual Desktop Today

Introduction to Microprocessors

Performance of Host Identity Protocol on Nokia Internet Tablet

Computer Architectures

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

PC Solutions That Mean Business

Feb.2012 Benefits of the big.little Architecture

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

QUESTIONS & ANSWERS. ItB tender 72-09: IT Equipment. Elections Project

QorIQ T4 Family of Processors. Our highest performance processor family. freescale.com

Software based Finite State Machine (FSM) with general purpose processors

ARM Cortex-R Architecture

GPUs for Scientific Computing

NVIDIA GeForce GTX 580 GPU Datasheet

Intel Labs at ISSCC Copyright Intel Corporation 2012

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Standardization with ARM on COM Qseven. Zeljko Loncaric, Marketing engineer congatec

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Next Generation GPU Architecture Code-named Fermi

Transcription:

A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1

Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets: Netbooks, tablets, smart phones, game console Digital Home Entertainment Home and Web 2.0 Servers Wireless Infrastructure Design Goals Performance, Power, Easy Synthesis 2

ARM Cortex A9/A15 1-4 Cores Out-of-Order Superscalar Branch predicator 32KB L1 I/D caches ~4MB L2 caches with Coherency NEON(SIMD) & FPU 32/28nm (A15) 45nm (A9) 3

Texas Instrument OMAP5 4

Comparison of ARM, Atom, i7 Cortex A15 (no L2, 32nm) Cortex A9 (no L2, 40nm ) Atom N270 (45nm) I7 960 (45nm) Number of Cores 2 (4 maximum) 2 (4 maximum) 1 Core, 2 HT threads 4 Cores, 8 HT threads Frequency 1Ghz 2.5 Ghz 800Mhz (Po) 2Ghz (Per) 1.6 Ghz 3.2 Ghz Out-of-Order? Yes Yes No Yes L1 cache size 32KB I/D 32KB I/D 32KB I/D 32KB I/D L2 cache size N/A N/A 512KB 1MB + 8MB L3 Issue Width 4 4 2 4? Pipeline Stages? 8 16 14 ~ 24 (?) Supply Voltage? 1.05V (Per) 0.9 1.1625 V 0.8-1.375 V Transistor Count? 26,00,000? 47,000,000 731,000,000 Die size? 4.6 mm2 (Po) 6.7 mm2 (Per) 26 mm2 263 mm2 Power Consumption? 0.5 W (Po) 1.9 W (Per) 2.5W (TDP) 130W (TDP) 5

Comparison of ARM SoC, Atom, i7 TI OMAP5 (28nm) Nvidia Tegra 2 (40nm) Atom N450 (45nm) I7 2600S (32nm) CPU Cores 2 x A15 2 x M4 2 x A9 1 Core, 2 HT threads 4 Cores, 8 HT threads CPU Freq. 2Ghz (A15) 1Ghz 1.66Ghz 2.6Ghz GPUs ASICs Video, Audio, Encryption, Display, 2D/3D 8x GPUs, Audio, Video, ISP 1 GPU 1 GPU L2? 1MB 512KB 1MB+8MB Die Size? 49mm2 66mm2? Transistors? 260,000,000 123,000,000? Package Size 17 x 17 mm2 23 x 23 mm2 22 x 22 mm2 37.5 x 37.5 mm2 Power Consumption? 150~500mW? 5.5W (TDP) 65W (TDP) 6

Power/Performance Optimization as a SoC Application-specific SoC design Integrate different ASICs Customize Cortex Processors Reduced memory bandwidth & frequency Mixing High Vt / Low Vt transistors Twisting floorplan, routing, clock tree design Power gating/clock gating/dvfs Four modes: Run, Standby, Dormant, Shutdown Fine-grained pipeline shutdown Faster register save and restore (state save/restore) Power domains & voltage domains 7

Power Saving as SoC: Power Gating Different power domains Cores NEON/VFP Debug Interface L2 cache tags (per bank) L2 cache control Interrupt Controllers Impact of power gating 3% reduction in performance 2% increase in area 4% increase in dynamic power 95% decrease in power when turned off 8

Power/Performance as a CPU Performance Enhancement (power hungry techniques) Dynamic issue design 4-way superscalar Complex Branch predictor Large L1/L2 caches Power savings Accurate branch prediction Micro TLB RISC SIMD, Jazzelle RCT etc. 9

ARM Instruction Set Architecture ARM processor architecture supports 32-bit ARM and 16-bit Thumb ISAs ARM architecture -- RISC architecture Large uniform register file Load/store architecture Simple addressing modes Auto-increment and auto-decrement addressing modes Load and Store multiple instructions Instructions can also be "conditionalised" based on condition code in Application Program Status Register 10

ARM Instruction Set Architecture Thumb Extension to the 32-bit ARM architecture Features a subset of the most commonly used 32-bit ARM instructions compressed into 16-bit opcodes Excellent code-density for minimal system memory size, reduced cost and power efficiency Designers have the flexibility to emphasize performance or code size "Thumb-aware" core is a standard ARM processor fitted with a Thumb decompressor in the instruction pipeline ARM uses the Universal Assembly Language 11

ISA extension DSP Features: new instructions to load and store pairs of registers, 2-3 x DSP performance improvement over ARM7 Eliminates the need for additional hardware accelerators Provides high performance solution with low power consumption Reuses existing OS and application code Supports including servo motor control, Voice over IP (VOIP) and video & audio codecs 12

SIMD 75% higher performance for multimedia processing in embedded devices Near zero" increase in power consumption Simultaneous computation of 2x16-bit or 4x8-bit operands Offers single tool-chain and processing device, transparent of OS 13

NEON Cleanly architected and works seamlessly with its own independent pipeline and register file Large NEON register file with its dual 128-bit/64- bit views enables efficient handling of data Minimizes access to memory, enhancing data throughput Designed for autovectorizing compilers and hand coding Provides flexible and powerful acceleration for consumer multimedia applications Supports the widest range of multimedia codecs used for internet applications 14

NEON 15

Vector Floating Point Architecture Coprocessor extension to the ARM architecture Supports floating point operations in half-, single- and double-precision floating point arithmetic Fully IEEE 754 compliant with full software library support Supports execution of short vector instructions but these operate on each vector element sequentially Three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications 16

Jazzelle Combined hardware and software solution for accelerating execution Software -- fully featured multi-tasking JVM Hardware -- coprocessor CP14 provides support for the hardware acceleration Jazelle DBX technology for direct bytecode execution Direct interpretation bytecode to machine code Jazelle RCT technology supports efficient AOT and JIT compilation with and beyond Java 17

Jazzelle Jazelle DBX and RCT are cache and memory efficient, maintaining low power Jazelle DBX is a robust and proven solution and easy to integrate Jazelle RCT provides an excellent target for any runtime compilation technology Developers Flexibility Resource constraint device: Jazelle DBX only On high-end platforms, Jazelle RCT alone with JIT and AOT 18

Conclusion Aggressive power hungry design targeting at high single thread performance Out-of-Order Execution Wide superscalar Large caches with coherency protocols Power saving techniques for ARM CPUs RISC ISA Optimization: Thumb, Thumb2, ThumbEE Application-Specific Components: SIMD, DSP, VFPUs, Jazzelle Power saving techniques for SoC chips Fine-grained power gating & clock gating & DVFS Fine-grained pipeline shutdown fast registers saving/restoring Customizable CPU components Mixing high Vt and low Vt transistors 19

Reading materials ARM Cortex-A9 Technical Reference Manual ARM Cortex-A9 MPCore Technical Reference Manual Keys to Silicon Realization of Gigahertz Performance and Low Power ARM Cortex-A15, Lamber A. et. al., ARM Technology Conference 2010 2GHz Capable Cortex-A9 Dual Core Processor Implementation, http://www.arm.com/files/downloads/osprey_analyst_presentation_v2a.pdf Circuit Design: High performance AND low power, the ARM way, http://www.arm.com/files/downloads/enabling_high_performance_cpu_implementation.pdf ARM MPCore Architecture Performance Enhancement, http://www.arm.com/files/downloads/mpf_2008_japan_-_arm_cortex-a9_final.pdf Cortex-A9 Processor Microarchitecture, http://www.arm.com/files/downloads/cortex- A9_Devcon_2007_Microarchitecture.pdf Details of a New Cortex Processor, Revealed,http://www.arm.com/files/downloads/Cortex- A9_Devcon-talk_Introduction_FINAL-02.pdf ARM Cortex-A9 Performance, http://www.arm.com/products/processors/cortex-a/cortex-a9.php 20