Embedded System Hardware - Processing (Part II)



Similar documents
Embedded System Hardware - Processing -

1. Introduction to Embedded System Design

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Architectures and Platforms

Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2

Computer Architecture TDTS10

PROBLEMS #20,R0,R1 #$3A,R2,R4

CHAPTER 7: The CPU and Memory

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

ARM Microprocessor and ARM-Based Microcontrollers

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems

Let s put together a Manual Processor

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

A Lab Course on Computer Architecture

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Instruction Set Architecture

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Microprocessor & Assembly Language

Chapter 2 Heterogeneous Multicore Architecture

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

LSN 2 Computer Processors

Instruction Set Design

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

(Refer Slide Time: 00:01:16 min)

Introduction to the Latest Tensilica Baseband Solutions

Chapter 2 Logic Gates and Introduction to Computer Architecture

Aperiodic Task Scheduling

Lecture N -1- PHYS Microcontrollers

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

CS352H: Computer Systems Architecture

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

İSTANBUL AYDIN UNIVERSITY


FLIX: Fast Relief for Performance-Hungry Embedded Applications

Systolic Computing. Fundamentals

Pipelining Review and Its Limitations

Networking Virtualization Using FPGAs

7a. System-on-chip design and prototyping platforms

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Chapter 1 Computer System Overview

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

BASIC COMPUTER ORGANIZATION AND DESIGN

Scalability and Classifications

Memory Systems. Static Random Access Memory (SRAM) Cell

SOC architecture and design

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Xeon+FPGA Platform for the Data Center

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

MICROPROCESSOR AND MICROCOMPUTER BASICS

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

AMD Opteron Quad-Core

Central Processing Unit (CPU)

The Evolution of CCD Clock Sequencers at MIT: Looking to the Future through History

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Computer Organization and Components

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

FPGA area allocation for parallel C applications

IA-64 Application Developer s Architecture Guide

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

OC By Arsene Fansi T. POLIMI

Float to Fix conversion

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

VLIW Processors. VLIW Processors

NIOS II Based Embedded Web Server Development for Networking Applications

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

IMPLEMENTATION OF FPGA CARD IN CONTENT FILTERING SOLUTIONS FOR SECURING COMPUTER NETWORKS. Received May 2010; accepted July 2010

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Embedded Systems: map to FPGA, GPU, CPU?

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Driving force. What future software needs. Potential research topics

Digital Hardware Design Decisions and Trade-offs for Software Radio Systems

ELEC 5260/6260/6266 Embedded Computing Systems

CPU Organisation and Operation

General Pipeline System Setup Information

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Spacecraft Computer Systems. Colonel John E. Keesee

International Workshop on Field Programmable Logic and Applications, FPL '99

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

MACHINE ARCHITECTURE & LANGUAGE

Operating Systems 4 th Class

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

Computer organization

Course on Advanced Computer Architectures

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Introduction to System-on-Chip

Transcription:

12 Embedded System Hardware - Processing (Part II) Jian-Jia Chen (Slides are based on Peter Marwedel) Informatik 12 TU Dortmund Germany Springer, 2010 2014 年 11 月 11 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Embedded System Hardware Embedded system hardware is frequently used in a loop ( hardware in a loop ): F cyber-physical systems - 2 -

Key requirement #2: Code-size efficiency Overview: http://www-perso.iro.umontreal.ca/~latendre/ codecompression/codecompression/node1.html Compression techniques: key idea - 3 -

Code-size efficiency Compression techniques (continued): 2nd instruction set, e.g. ARM Thumb instruction set: 001 10 Rd Constant major opcode minor opcode source= destination 16-bit Thumb instr. ADD Rd #constant zero extended 1110 001 01001 0 Rd 0 Rd 0000 Constant Reduction to 65-70 % of original code size 130% of ARM performance with 8/16 bit memory 85% of ARM performance with 32-bit memory Same approach for LSI TinyRisc, Requires support by compiler, assembler etc. Dynamically decoded at run-time [ARM, R. Gupta] - 4 -

Dictionary approach, two level control store (indirect addressing of instructions) Dictionary-based coding schemes cover a wide range of various coders and compressors. Their common feature is that the methods use some kind of a dictionary that contains parts of the input sequence which frequently appear. The encoded sequence in turn contains references to the dictionary elements rather than containing these over and over. [Á. Beszédes et al.: Survey of Code size Reduction Methods, Survey of Code-Size Reduction Methods, ACM Computing Surveys, Vol. 35, Sept. 2003, pp 223-267] - 5 -

Key idea (for d bit instructions) instruction address c 2 b CPU b S a b «d bit For each instruction address, S contains table address of instruction. table of used instructions ( dictionary ) d bit Uncompressed storage of a d-bit-wide instructions requires a x d bits. In compressed code, each instruction pattern is stored only once. small Hopefully, axb+cxd < axd. Called nanoprogramming in the Motorola 68000. - 6 -

Key requirement #3: Run-time efficiency - Domain-oriented architectures - Example: Filtering in Digital signal processing (DSP) Signal at t=t s (sampling points) - 7 -

Key Requirement #4: Real-time capability Timing behavior has to be predictable Features that cause problems: Unpredictable access to shared resources - Caches with difficult to predict replacement strategies - Unified caches (conflicts between instructions and data) Pipelines with difficult to predict stall cycles ("bubbles") Unpredictable communication times for multiprocessors Branch prediction, speculative execution Interrupts that are possible any time Memory refreshes that are possible any time Instructions that have data-dependent execution times F Trying to avoid as many of these as possible. [Dagstuhl workshop on predictability, Nov. 17-19, 2003] - 8 -

12 DSP and Multimedia Processors Springer, 2010 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Saturating arithmetic Returns largest/smallest number in case of over/ underflows Example: a 0111 b + 1001 standard wrap around arithmetic (1)0000 saturating arithmetic 1111 (a+b)/2: correct 1000 wrap around arithmetic 0000 saturating arithmetic + shifted 0111 Appropriate for DSP/multimedia applications: No timeliness of results if interrupts are generated for overflows Precise values less important Wrap around arithmetic would be worse. almost correct - 10 -

On-Site Exercise - 11 -

Fixed-point arithmetic Shifting required after multiplications and divisions in order to maintain binary point. - 12 -

On-Site Exercise - 13 -

Multimedia-Instructions, Short vector extensions, Streaming extensions, SIMD instructions Multimedia instructions exploit that many registers, adders etc are quite wide (32/64 bit), whereas most multimedia data types are narrow F 2-8 values can be stored per register and added. E.g.: a1 32 bits a2 b1 32 bits b2 c1 + 32 bits c2 2 additions per instruction; no carry at bit 16 Cheap way of using parallelism F SSE instruction set extensions, SIMD instructions - 14 -

Key idea of very long instruction word (VLIW) computers (1) Instructions included in long instruction packets. Multiple instruction are assumed to be executed in parallel. Fixed association of packet bits with functional units. Compiler is assumed to generate these parallel packets - 15 -

Key idea of very long instruction word (VLIW) computers (2) Complexity of finding parallelism is moved from the hardware (RISC/CISC processors) to the compiler; Ideally, this avoids the overhead (silicon, energy,..) of identifying parallelism at run-time. F A lot of expectations into VLIW machines However, possibly low code efficiency, due to many NOPs F Explicitly parallel instruction set computers (EPICs) are an extension of VLIW architectures: parallelism detected by compiler, but no need to encode parallelism in 1 word. - 16 -

EPIC: TMS 320C6xxx as an example 1 Bit per instruction encodes end of parallel exec. 31 0 31 0 31 0 31 0 31 0 31 0 31 0 0 1 1 0 1 1 0 Instr. A Instr. B Instr. C Instr. D Instr. E Instr. F Instr. G Cycle Instruction 1 A 2 B C D 3 E F G Instructions B, C and D use disjoint functional units, cross paths and other data path resources. The same is also true for E, F and G. - 17 -

Partitioned register files Many memory ports are required to supply enough operands per cycle. Memories with many ports are expensive. F Registers are partitioned into (typically 2) sets, e.g. for TI C6xxx: - 18 -

Microcontrollers - MHS 80C51 as an example - 8-bit CPU optimised for control applications Extensive Boolean processing capabilities 64 k Program Memory address space 64 k Data Memory address space 4 k bytes of on chip Program Memory 128 bytes of on chip data RAM 32 bi-directional and individually addressable I/O lines Two 16-bit timers/counters Full duplex UART 6 sources/5-vector interrupt structure with 2 priority levels On chip clock oscillators Very popular CPU with many different variations - 19 -

Energy Efficiency of FPGAs Hugo De Man, IMEC, Philips, 2007-20 -

Reconfigurable Logic Custom HW may be too expensive, SW too slow. Combine the speed of HW with the flexibility of SW F HW with programmable functions and interconnect. F Use of configurable hardware; common form: field programmable gate arrays (FPGAs) Applications: algorithms like de/encryption, pattern matching in bioinformatics, high speed event filtering (high energy physics), high speed special purpose hardware. Very popular devices from XILINX, Actel, Altera and others - 21 -

Fine-Grained Reconfigurable Fabric CLB: Configurable Logic Block PSM: Programmable Switch Matrix Additionally: I/O Blocks, RAM Blocks, Multiplier, CPUs, src: Xilinx Data Sheets - 22 -

Fine-Grained Reconfigurable Fabric Logic Layer: Perform Computations Configuration Layer: Which Computations to Perform Partial Reconfiguration: Only parts of the Fabric are reconfigured Configuration Data Logic layer Logic layer Configuration Layer Configuration Memory (off-chip) src: Kalenteridis et al. A complete platform and toolset for system implementation on fine-grained reconfigurable hardware, Microprocessors and Microsystems 2004-23 -

Coarse-Grained Reconfigurable Fabric Pros: Reduced Area Higher Frequency Reduced reconfiguration latency Cons Reduced Flexibility Area Wastage à In case computations do not suite the fabric (e.g. bit-level operations, bit shuffling) src: PACT s XPP 64-A1 architecture - 24 -

Reconfigurable Computing: an example Core Processor (CISA) Reconfiguration Reconfigurable HW SIs are executed using core Processor during Reconfiguration Exec. of ME Exec. of EE FPGA partitioned into Containers Reconfiguration Reconfiguration Exec. of LF ME Motion Estimation EE LF LEGEND: Encoding Engine Loop Filter TIME Runtime reconfigurable processor: + Efficient use of hardware: through high degree of parallelism, multiplexed used of reconf. fabric => energy efficient Reconfiguration Latency and Energy - 25 -

Summary Processing VLIW/EPIC processors MPSoCs FPGAs - 26 -

Spare Slides - 27 -

Filtering in digital signal processing ADSP 2100 -- outer loop over -- sampling times t s { MR:=0; A1:=1; A2:=s-1; MX:=w[s]; MY:=a[0]; for (k=0; k <= (n 1); k++) { MR:=MR + MX * MY; MX:=w[A2]; MY:=a[A1]; A1++; A2--; } x[s]:=mr; } Maps nicely - 28 -

DSP-Processors: multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions MR:=0; A1:=1; A2:=s-1; MX:=w[s]; MY:=a[0]; for ( k:=0 <= n-1) {MR:=MR+MX*MY; MY:=a[A1]; MX:=w[A2]; A1++; A2--} Multiply/accumulate (MAC) instruction Zero-overhead loop (ZOL) instruction preceding MAC instruction. Loop testing done in parallel to MAC operations. - 29 -

Heterogeneous registers Example (ADSP 210x): D P Addressregisters A0, A1, A2.. Address generation unit (AGU) AX AY +,-,.. AR AF MX * +,- MR MY MF Different functionality of registers An, AX, AY, AF,MX, MY, MF, MR - 30 -

Separate address generation units (AGUs) Example (ADSP 210x): Data memory can only be fetched with address contained in A, but this can be done in parallel with operation in main data path (takes effectively 0 time). A := A ± 1 also takes 0 time, same for A := A ± M; A := <immediate in instruction> requires extra instruction F Minimize load immediates F Optimization in optimization chapter - 31 -

Modulo addressing Modulo addressing: Am++ Am:=(Am+1) mod n (implements ring or circular buffer in memory) w sliding window t1 t n most recent values.. w[t1-1] w[t1] w[t1-n+1] w[t1-n+2].... w[t1-1] w[t1] w[t1+1] w[t1-n+2].. Memory, t=t1 Memory, t2= t1+1-32 -