Embedded System Hardware - Processing -

Similar documents

Embedded System Hardware - Processing (Part II)

7a. System-on-chip design and prototyping platforms

Networking Virtualization Using FPGAs

VLIW Processors. VLIW Processors

Kirchhoff Institute for Physics Heidelberg

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Introduction to Microprocessors

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Architectures and Platforms

CHAPTER 7: The CPU and Memory

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Computer Architectures

1. Introduction to Embedded System Design

FPGA-based MapReduce Framework for Machine Learning

Aperiodic Task Scheduling

Data Center and Cloud Computing Market Landscape and Challenges

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Central Processing Unit (CPU)

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

FPGA Synthesis Example: Counter

IA-64 Application Developer s Architecture Guide

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

ARM Microprocessor and ARM-Based Microcontrollers

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Am186ER/Am188ER AMD Continues 16-bit Innovation

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Chapter 2 Logic Gates and Introduction to Computer Architecture

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Compiling PCRE to FPGA for Accelerating SNORT IDS

FPGA area allocation for parallel C applications

Chapter 7 Memory and Programmable Logic

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Design Cycle for Microprocessors

AMD Opteron Quad-Core

Instruction Set Design

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

picojava TM : A Hardware Implementation of the Java Virtual Machine

MPSoC Virtual Platforms

1. Memory technology & Hierarchy

OpenSPARC T1 Processor

Accelerate Cloud Computing with the Xilinx Zynq SoC

A SystemC Transaction Level Model for the MIPS R3000 Processor

Price/performance Modern Memory Hierarchy

CPU Organization and Assembly Language

Introducción. Diseño de sistemas digitales.1

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

FlexPath Network Processor

SOC architecture and design

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Performance of Host Identity Protocol on Nokia Internet Tablet

İSTANBUL AYDIN UNIVERSITY

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2

Computer Architecture

路論 Chapter 15 System-Level Physical Design

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

OPENSPARC T1 OVERVIEW

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

DS1104 R&D Controller Board

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

- Nishad Nerurkar. - Aniket Mhatre

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Memory Architecture and Management in a NoC Platform

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Defining Platform-Based Design. System Definition. Platform Based Design What is it? Platform-Based Design Definitions: Three Perspectives

Computer Architecture TDTS10

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Computer Organization and Components

System on Chip Platform Based on OpenCores for Telecommunication Applications

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

FPGA Design From Scratch It all started more than 40 years ago

Network Traffic Monitoring an architecture using associative processing.

Chapter 1 Computer System Overview

Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC

White Paper FPGA Performance Benchmarking Methodology

EE361: Digital Computer Organization Course Syllabus

How To Write Security Enhanced Linux On Embedded Systems (Es) On A Microsoft Linux (Amd64) (Amd32) (A Microsoft Microsoft 2.3.2) (For Microsoft) (Or

Power Reduction Techniques in the SoC Clock Network. Clock Power

Model-based system-on-chip design on Altera and Xilinx platforms

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Five Families of ARM Processor IP

Microprocessor and Microcontroller Architecture

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

Optimizing Configuration and Application Mapping for MPSoC Architectures

Building Blocks for PRU Development

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Transcription:

12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund Germany Graphics: Alexandra Nolte, Gesine Marwedel, 2003 2010 年 11 月 15 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Key idea of very long instruction word (VLIW) computers Instructions included in long instruction packets. Instruction packets are assumed to be executed in parallel. Fixed association of packet bits with functional units. - 2 -

Very long instruction word (VLIW) architectures Very long instruction word ( instruction packet ) contains several instructions, all of which are assumed to be executed in parallel. Compiler is assumed to generate these parallel packets Complexity of finding parallelism is moved from the hardware (RISC/CISC processors) to the compiler; Ideally, this avoids the overhead (silicon, energy,..) of identifying parallelism at run-time. A lot of expectations into VLIW machines Explicitly parallel instruction set computers (EPICs) are an extension of VLIW architectures: parallelism detected by compiler, but no need to encode parallelism in 1 word. - 3 -

EPIC: TMS 320C6xx as an example Bit in each instruction encodes end of parallel execution 31 0 31 0 31 0 31 0 31 0 31 0 31 0 0 1 1 0 1 1 0 Instr. A Instr. B Instr. C Instr. D Instr. E Instr. F Instr. G Cycle Instruction 1 A 2 B C D 3 E F G Instructions B, C and D use disjoint functional units, cross paths and other data path resources. The same is also true for E, F and G. Parallel execution cannot span several packets. - 4 -

Partitioned register files Many memory ports are required to supply enough operands per cycle. Memories with many ports are expensive. Registers are partitioned into (typically 2) sets, e.g. for TI C60x: - 5 -

More encoding flexibility with IA-64 Itanium 3 instructions per bundle: 127 0 instruc 1 instruc 2 instruc 3 template There are 5 instruction types: A: common ALU instructions I: more special integer instructions (e.g. shifts) M: Memory instructions F: floating point instructions B: branches The following combinations can be encoded in templates: MII, MMI, MFI, MIB, MMB, MFB, MMF, MBB, BBB, MLX with LX = move 64-bit immediate encoded in 2 slots Instruction grouping information - 6 -

Templates and instruction types End of parallel execution called stops. Stops are denoted by underscores. Example: bundle 1 bundle 2 MMI M_II MFI_ MII MMI MIB_ Group 1 Group 2 Group 3 Very restricted placement of stops within bundle. Parallel execution within groups possible. Parallel execution can span several bundles - 7 -

Instruction types are mapped to functional unit types There are 4 functional unit (FU) types: M: Memory Unit I: Integer Unit F: Floating-Point Unit B: Branch Unit Instruction types corresponding FU type, except type A (mapping to either I or M-functional units). - 8 -

L3 cache Implementation: Itanium 2 (2003) 410M transistors 374 mm 2 die size 6MB on-die L3 cache 1.5 GHz at 1.3V [ftp://download.intel.com/design/itaniu m2/download/madison_slides_r1.pdf] Intel, 2003-9 -

Philips TriMedia- Processor For multimediaapplications, up up to to 5 instructions/ cycle. http://www.nxp.com/acrobat/ datasheets/pnx15xx_ser_n_3. pdf (incompatible with firefox?) NXP - 10 -

Large # of delay slots, a problem of VLIW processors add sub and or sub mult xor div ld st mv beq - 11 -

Large # of delay slots, a problem of VLIW processors add sub and or sub mult xor div ld st mv beq - 12 -

Large # of delay slots, a problem of VLIW processors add sub and or sub mult xor div ld st mv beq The execution of many instructions has been started before it is realized that a branch was required. Nullifying those instructions would waste compute power Executing those instructions is declared a feature, not a bug. How to fill all delay slots with useful instructions? Avoid branches wherever possible. - 13 -

Predicated execution: Implementing IF-statements branch-free Conditional Instruction [c] I consists of: condition c instruction I c = true => I executed c = false => NOP - 14 -

Predicated execution: Implementing IF-statements branch-free : TI C6x if (c) { a = x + y; b = x + z; } else { a = x - y; b = x - z; } Conditional branch [c] B L1 NOP 5 B L2 NOP 4 SUB x,y,a SUB x,z,b L1: ADD x,y,a ADD x,z,b L2: max. 12 cycles Predicated execution [c] ADD x,y,a [c] ADD x,z,b [!c] SUB x,y,a [!c] SUB x,z,b 1 cycle - 15 -

Microcontrollers - MHS 80C51 as an example - 8-bit CPU optimised for control applications Extensive Boolean processing capabilities 64 k Program Memory address space 64 k Data Memory address space 4 k bytes of on chip Program Memory 128 bytes of on chip data RAM 32 bi-directional and individually addressable I/O lines Two 16-bit timers/counters Full duplex UART 6 sources/5-vector interrupt structure with 2 priority levels On chip clock oscillators Very popular CPU with many different variations Features for Embedded Systems Moved from 3.4.3.4-16 -

Trend: multiprocessor systems-on-a-chip (MPSoCs) http://www.mpsoc-forum.org/2007/slides/hattori.pdf - 17 -

Multiprocessor systems-on-a-chip (MPSoCs) (2) http://www.mpsoc-forum.org/2007/slides/hattori.pdf - 18 -

Multiprocessor systems-on-a-chip (MPSoCs) (3) http://www.mpsoc-forum.org/2007/slides/hattori.pdf - 19 -

Multiprocessor systems-on-a-chip (MPSoCs) (4) Hugo De Man, IMEC, 2007 ~50% inherent power efficiency of silicon - 20 -

12 Embedded System Hardware - Reconfigurable Hardware - Peter Marwedel Informatik 12 TU Dortmund Germany Graphics: Alexandra Nolte, Gesine Marwedel, 2003 2010 年 06 月 12 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Energy Efficiency of FPGAs inherent power efficiency of silicon Hugo De Man, IMEC, Philips, 2007-22 -

Reconfigurable Logic Full custom chips may be too expensive, software too slow. Combine the speed of HW with the flexibility of SW HW with programmable functions and interconnect. Use of configurable hardware; common form: field programmable gate arrays (FPGAs) Applications: bit-oriented algorithms like encryption, fast object recognition (medical and military) Adapting mobile phones to different standards. Very popular devices from XILINX (XILINX Vertex II are recent devices) Actel, Altera and others - 23 -

Floor-plan of VIRTEX II FPGAs More recent: Virtex 5, but no floor-plan found for Virtex 5. - 24 -

Virtex 5 Configurable Logic Block (CLB) - 25 -

Virtex 5 Slice (simplified) Memories typically used as look-up tables to implement any Boolean function of 6 variables. - 26 -

Virtex 5 SliceM SliceM supports using memories for storing data and as shift registers - 27 -

Resources available in Virtex 5 devices [ and source: Xilinx Inc.: Virtex 5 FPGA User Guide, May, 2009 //www.xilinx.com] - 28 -

Interconnect for Virtex II - 29 - Hierarchical Routing Resources; no routing plan found for Virtex 5.

Virtex II Pro Devices include up to 4 PowerPC processor cores Virtex 5 Devices include up to 2 PowerPC processor cores [ and source: Xilinx Inc.: Virtex-II Pro Platform FPGAs: Functional Description, Sept. 2002, //www.xilinx.com] - 30 -

12 Memory Peter Marwedel Informatik 12 TU Dortmund Germany 2009/11/22

Memory Memories? Oops! Memories! For the memory, efficiency is again a concern: speed (latency and throughput); predictable timing energy efficiency size cost other attributes (volatile vs. persistent, etc) - 32 -

Access times and energy consumption increases with the size of the memory Example (CACTI Model): "Currently, the size of some applications is doubling every 10 months" [STMicroelectronics, Medea+ Workshop, Stuttgart, Nov. 2003] - 33 -

Access times and energy consumption for multi-ported register files Cycle Time (ns) Area (λ 2 x10 6 ) Power (W) 1.8 1.7 7 6 14 12 1.6 5 10 1.5 1.4 4 8 1.3 3 6 1.2 2 4 1.1 1 2 1 0 0 16 32 64 128 16 32 64 128 16 32 64 128 Register File Size Register File Size GP6M2 GP6M3 Rixner s et al. model [HPCA 00], Technology of 0.18 µm Source and H. Valero, 2001-34 -

Memory system frequently consumes >50 % of the energy used for processing 29% Cache ($)-less monoprocessor Processor Energy Main Mem. Energy 71% Multiprocessor with cache ($) 51,9% 28,1% Proc. Energy I-Cache Energy D-Cache Energy Main Mem. Energy Average over 200 benchmarks analyzed by Verma (U. Dortmund) 5,2% 14,8% [M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, 2007] - 35 -

Similar information according to other sources EBOX 8% DMMU 8% Clock 10% IMMU 9% Others 5% Ibox 18% Strong ARM IEEE Journal of SSC Nov. 96 Icache 26% Dcache 16% [Based on slide by and : Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001] SysCtl 3% CP 15 2% PATag RAM 1% Clocks 4% BIU 8% arm9 25% Other 4% I MMU 4% D Cache 19% I Cache 25% D MMU 5% [Segars 01 according to Vahid@ISSS01] - 36 -

Energy consumption in mobile devices [O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source. - 37 -

Trends for the Speeds Speed gap between processor and main DRAM increases 8 4 2 1 0 Speed 1 CPU Performance (1.5-2 p.a.) 2x every 2 years DRAM (1.07 p.a.) 2 3 4 5 years Similar problems also for embedded systems & MPSoCs In the future: Memory access times >> processor cycle times Memory wall problem [P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane] - 38 -

Set-associative cache n-way cache Set = 2 Address Tag Index way 0 $ ( ) way 1 Tags data block Tags data block = = 1 Data - 39 -

Hierarchical memories using scratch pad memories (SPM) SPM is a small, physically separate memory mapped into the address space Hierarchy main SPM processor 0 FFF.. Address space scratch pad memory no tag memory select SPM Selection is by an appropriate address decoder (simple!) Example ARM7TDMI cores, wellknown for low power consumption - 40 -

Comparison of currents using measurements E.g.: ATMEL board with ARM7TDMI and ext. SRAM Current 32 Bit-Load Instruction (Thumb) 200 ma 150 100 50 0 116 77,2 82,2 1,16 48,2 50,9 44,4 53,1 Prog Main/ Data Main Prog Main/ Data SPM Prog SPM/ Data Main Prog SPM/ Data SPM Core+SPM (ma) Main Memory Current (ma) - 41 -

Why not just use a cache? 2. Energy for parallel access of sets, in comparators, muxes. 9 Energy per access [nj]. 8 7 6 5 4 3 2 1 Scratch pad Cache, 2way, 4GB space Cache, 2way, 16 MB space Cache, 2way, 1 MB space 0 256 512 1024 2048 4096 8192 16384 memory size [R. Banakar, S. Steinke, B.-S. Lee, 2001] - 42 -

Influence of the associativity Parameters different from previous slides [P. Marwedel et al., ASPDAC, 2004] - 43 -

Summary Processing VLIW/EPIC processors MPSoCs FPGAs Memories Small is beautiful (in terms of energy consumption, access times, size) - 44 -