Chapter 2 Heterogeneous Multicore Architecture



Similar documents
Chapter 1 Computer System Overview

Discovering Computers Living in a Digital World

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

SOC architecture and design

OpenSPARC T1 Processor

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Next Generation GPU Architecture Code-named Fermi

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Memory Architecture and Management in a NoC Platform

OC By Arsene Fansi T. POLIMI

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

How To Design A Single Chip System Bus (Amba) For A Single Threaded Microprocessor (Mma) (I386) (Mmb) (Microprocessor) (Ai) (Bower) (Dmi) (Dual

UMBC. ISA is the oldest of all these and today s computers still have a ISA bus interface. in form of an ISA slot (connection) on the main board.

Microtronics technologies Mobile:

A case study of mobile SoC architecture design based on transaction-level modeling

SRAM Scaling Limit: Its Circuit & Architecture Solutions

What is a System on a Chip?

Computer System Design. System-on-Chip

SPARC64 VIIIfx: CPU for the K computer

Computer Systems Structure Main Memory Organization

Low-Overhead Hard Real-time Aware Interconnect Network Router

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

How To Improve Performance On A P4080 Processor

Communicating with devices

On-Chip Interconnection Networks Low-Power Interconnect

- Nishad Nerurkar. - Aniket Mhatre

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

Computer Systems Structure Input/Output

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Lecture 2: Computer Hardware and Ports.

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

ARM Microprocessor and ARM-Based Microcontrollers

Serial port interface for microcontroller embedded into integrated power meter

Motivation: Smartphone Market

SoC-Based Microcontroller Bus Design In High Bandwidth Embedded Applications

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Embedded System Hardware - Processing (Part II)

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Continuous-Time Converter Architectures for Integrated Audio Processors: By Brian Trotter, Cirrus Logic, Inc. September 2008

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

How To Build A Cloud Computer

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

The Motherboard Chapter #5

Types Of Operating Systems

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

Intel Labs at ISSCC Copyright Intel Corporation 2012

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

8-Bit Flash Microcontroller for Smart Cards. AT89SCXXXXA Summary. Features. Description. Complete datasheet available under NDA

8051 hardware summary

Building Blocks for PRU Development

Parallel Programming Survey

Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2

Semiconductor Device Technology for Implementing System Solutions: Memory Modules

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

AMD Opteron Quad-Core

On-Chip Communications Network Report

PC Notebook Diagnostic Card

Introduction to System-on-Chip

Single Phase Two-Channel Interleaved PFC Operating in CrM

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Performance Monitor Unit Design for an AXI-based Multi-Core SoC Platform

Implementing a Digital Answering Machine with a High-Speed 8-Bit Microcontroller

Chapter 2 Parallel Computer Architecture

Testing of Digital System-on- Chip (SoC)

ontroller LSI with Built-in High- Performance Graphic Functions for Automotive Applications

Single Phase Two-Channel Interleaved PFC Operating in CrM Using the MC56F82xxx Family of Digital Signal Controllers

8 Gbps CMOS interface for parallel fiber-optic interconnects

Software Programmable Data Allocation in Multi-Bank Memory of SIMD Processors

Architectures and Platforms

Hitachi Releases SuperH Mobile Application Processor SH-Mobile for optimum processing of multimedia applications for next-generation mobile phone

MirrorBit Technology: The Foundation for Value-Added Flash Memory Solutions FLASH FORWARD

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Chapter 3 ATM and Multimedia Traffic

85MIV2 / 85MIV2-L -- Components Locations

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Caching Mechanisms for Mobile and IOT Devices

DDR3 memory technology

How System Settings Impact PCIe SSD Performance

Accelerating Data Compression with Intel Multi-Core Processors

Operating Systems 4 th Class

Bus Data Acquisition and Remote Monitoring System Using Gsm & Can

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Accelerate Cloud Computing with the Xilinx Zynq SoC

Low Power AMD Athlon 64 and AMD Opteron Processors

Transcription:

Chapter 2 Heterogeneous Multicore Architecture 2.1 Architecture Model In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater fl exibility, it is necessary to develop parallel processing on chips by taking advantage of the advances being made in semiconductor integration. Figure 2.1 illustrates the basic architecture of our heterogeneous multicore [1, 2 ]. Several low-power C cores and special purpose processor (SPP) cores, such as a digital signal processor, a media processor, and a dynamically recon fi gurable processor, are embedded on a chip. In the fi gure, the number of C cores is m. There are two types of SPP cores, SPP a and SPP b, on the chip. The values n and k represent the respective number of SPP a and SPP b cores. Each processor core includes a processing unit (), a local memory (), and a data transfer unit () as the main elements. The executes various kinds of operations. For example, in a C core, the includes arithmetic units, register fi les, a program counter, control logic, etc., and executes machine instructions. With some SPP cores like the dynamic recon fi gurable processor, the executes a large quantity of data in parallel using its array of arithmetic units. The is a small-size and low-latency memory and is mainly accessed by the in the same core during the s execution. Some cores may have caches as well as an or may only have caches without an. The is necessary to meet the real-time requirements of embedded systems. The access time to a cache is non-deterministic because of cache misses. On the other hand, the access to an is deterministic. By putting a program and data in the, we can accurately estimate the execution cycles of a program that has hard real-time requirements. A data transfer unit () is also embedded in the core to achieve parallel execution of internal operation in the core and data transfer operations between cores and memories. Each in a core processes the data on its or its cache, and the simultaneously executes memory-to-memory data transfer between cores. The is like a direct memory controller (DMAC) and executes a command that transfers data between several kinds of memories, then checks and waits for the end of the data transfer, etc. Some s are capable of K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, DOI 10.1007/978-1-4614-0284-8_2, Springer Science+Business Media New York 2012 11

12 2 Heterogeneous Multicore Architecture Chip C #0 C #m SPPa #0 SPPa #n On-chip interconnect SPPb #0 SPPb #k On-chip shared memory (CSM) main memory Fig. 2.1 Heterogeneous multicore architecture command chaining, where multiple commands are executed in order. The frequency and voltage controller () connected to each core controls the frequency, voltage, and power supply of each core independently and reduces the total power consumption of the chip. If the frequencies or power supplies of the core s,, and can be independently controlled, the can vary their frequencies and power supplies individually. For example, the can stop the frequency of the and run the frequencies of the and when the core is executing only data transfers. The on-chip shared memory (CSM) is a medium-sized on-chip memory that is commonly used by cores. Each core is connected to the on-chip interconnect, which may be several types of buses or crossbar switches. The chip is also connected to the off-chip main memory, which has a large capacity but high latency. Figure 2.1 illustrates a typical model of a heterogeneous multicore architecture. A number of variations based on this architecture model are possible. Several variations of an structure are shown in Fig. 2.2. Case (a) is a hierarchical structure where the has two levels. 1 is a fi rst-level, small-size, low-latency local memory. 2 is a second-level, medium-sized, not-so-low-latency local memory. For example, the latency from the to 1 is one processor cycle, and the latency to 2 is a few processor cycles. Case (b) is a Harvard type. The is divided into an i that stores instructions and an d that stores data. The has an independent access path to each. This structure allows parallel accesses to instructions and data and enhances processing performance. Case (c) is a combination of (a) and (b). The i and d are fi rst-level local memories for instructions and data, respectively. 2 is a second-level local memory that stores both instructions and data. In each case, each is mapped on a different address area; that is, the accesses each with different addresses.

2.1 Architecture Model 13 Fig. 2.2 Structures of various local memories a b c 1 i d i d 2 Hierarchical Harvard 2 Hierarchical - Harvard C #0 C #m SPP #0 SPP #n On-chip bus (left) On-chip bus (right) main memory CSM l CSM r DMAC Fig. 2.3 Example of other heterogeneous multicore configurations In Fig. 2.3, we can see other con fi gurations of a, CSM,, and an on-chip interconnect. First, processor cores are divided into two clusters. The C cores, the CSM l, and the off-chip main memory are tightly connected in the left cluster. The SPP cores, CSM r, and the DMAC are nearly connected in the right cluster. Not every SPP core has a inside. Instead, the DMAC that has multiple channels is commonly used for data transfer between an and a memory outside an SPP core. For example, when data are transferred from an to the CSM r, the DMAC reads data in the via the right on-chip bus, and the data are written on the CSM r from the DMAC. We need two bus transactions for this data transfer. On the other hand, if a in a C core on the left cluster is used for the same transfer, data are read from an by the in the core, and the data are written on the CSM l via the on-chip bus by the. Only one transaction on the on-chip bus is necessary in this case, and the data transfer is more ef fi cient compared with the case using the off-core DMAC. Although each C core in the left cluster has an individual, the SPP cores in the right cluster share an. With this simpler con fi guration, all SPP cores operate at the same voltage and the same frequency, which are controlled simultaneously.

14 2 Heterogeneous Multicore Architecture Time C #0 Processing P1 Data Transfer T2 P8 W1 P11 C #1 P2 P5 P9 T1 T5 T8 SPPa #0 P3 T4 P7 T7 W2 SPPb #0 P4 T3 P6 T6 P10 Fig. 2.4 Parallel operation When a program is executed on a heterogeneous multicore, it is divided into small parts, and each is executed in parallel in the most suitable processor core, as shown in Fig. 2.4. Each core processes data on its or cache in a Pi period, and the of a core simultaneously executes a memory memory data transfer in a Ti period. For example, C #1 processes data on its at a P2 period, and its transfers processed data from the of C #1 to the of SPP b #0 at the T1 period. After the data transfer, SPP b #0 starts to process data on its at a P6 period. C #1 also starts a P5 process that overlaps with the T1 period. In the parallel operation of Fig. 2.4, there is a time slot like W1 when the corresponding core C #0 does not need to process or transfer data from the core. During this time slot, the frequencies of the and of C #0 can be slowed down or stopped, or their power supplies can be cut off by control of the connected. As there are no internal operations of SPP a #0 during the time slot W2, the power of SPP a #0 can be cut off during this time slot. This control reduces redundant power consumption of cores and can result in lowering the power consumption of a heterogeneous multicore chip. Here, we show an example of our architecture model applied to a heterogeneous multicore chip. Figure 2.5 is a photograph of the RP-X chip (see Sect. 4.4 ) [ 3 5 ]. Figure 2.6 depicts the internal block diagram. The chip includes eight C cores and seven three-type SPP cores. The C (see Sect. 3.1 ) includes a two-level as well as a 32-KB instruction cache and a 32-KB operand cache. The consists of a 16-KB ILRAM for instruction storage, a 16-KB OLRAM for data storage, and a 64-KB URAM for instruction and data storage. Each C has a local clock pulse generator () that corresponds to the and controls the C s clock frequency independently. The eight Cs are divided into two clusters. Each cluster of four Cs is connected to independent on-chip buses. Additionally, each cluster has a 256-KB CSM and a DDR3 port which is connected to off-chip DDR3 DRAMs.

2.1 Architecture Model 15 Fig. 2.5 Heterogeneous multicore chip C #3 C C #2 #2 C #1 #0 C #0 :16/16KB I/OLRAM:16/16KB DSM:64KB URAM:64KB CSM #0 256KB C #7 C C #6#2 C #1 #5 #0 C #4 :16/16KB I/OLRAM:16/16KB DSM:64KB URAM:64KB Local Clock Pulse Generator CSM #1 256KB DMA controller DMAC #1 DDR3 port#0 On-chip bus #0 V :300KB Video Processor Unit DDR3 DRAM DMAC #0 FE #0 :30KB On-chip bus #1 MX #0 :30KB DDR3 port #1 FE #1 MX #1 FE #2 FE #3 Matrix Processor Flexible Engine DDR3 DRAM Fig. 2.6 Block diagram of heterogeneous multicore chip Three types of SPPs are embedded on the chip. The fi rst SPP is a video processing unit (V, see Sect. 3.4 ) which is specialized for video processing such as MPEG-4 and H.264 codec. The V has a 300-KB and a built-in. The second and third SPPs are four fl exible engines (FEs, see Sect. 3.2 ), and two matrix processors (MXs, see Sect. 3.3 ), and they are included in another cluster. The FE is a dynamically recon fi gurable processor which is suitable for data-parallel processing such as

16 2 Heterogeneous Multicore Architecture digital signal processing. The FE has an internal 30-KB but does not have a. The on-chip DMA controller (DMAC) that can be used in common by on-chip units or a of another core is used to transfer data between the and other memories. The MX has 1,024-way single instruction multiple data (SIMD) architecture that is suitable for highly data-intensive processing such as video recognition. The MX has an internal 128-KB but does not have its, just as with the FE. In the chip photograph in Fig. 2.5, the upper-left island includes four Cs, and the lower-left island has the V with other blocks. The left cluster in Fig. 2.6 includes these left islands and a DDR3 port depicted at the lower-left side. The lower-right island in the photo in Fig. 2.5 includes another four Cs, the center-right island has four FEs, and the upper-right has two MXs. The right cluster in Fig. 2.6 includes these right islands and a DDR3 port depicted at the upper-right side. With these 15 on-chip heterogeneous cores, the chip can execute a wide variety of multimedia and digital-convergence applications at high-speed and low-power consumption. The details of the chip and its applications are described in Chaps. 4 6. 2.2 Address Space There are two types of address spaces de fi ned for a heterogeneous multicore chip. One is a public address space where all major memory resources on and off the chip are mapped and can be accessed by processor cores and DMA controllers in common. The other is a private address space where the addresses looked for from inside the processor core are de fi ned. The thread of a program on a processor core runs on the private address space of the processor core. The private address space of each processor core is de fi ned independently. Figure 2.7a shows a public address space of the heterogeneous multicore chip depicted in Fig. 2.1. The CSM, the s of C #0 to C #m, the s of SPP a #0 to SPP a #n, and the s of SPP b #0 to SPP b #k are mapped in the public address space, as well as the off-chip main memory. Each in each processor core can access the off-chip main memory, the CSM, and the s in the public address space and can transfer data between various kinds of memories. A private address space is independently de fi ned per processor core. The private addresses are generated by the of each processor core. For a C core, the address would be generated during the execution of a load or store instruction in the. Figure 2.7b, c shows examples of private address spaces of a C and SPP. The of the C core accesses data of the off-chip main memory, the CSM, and its own mapped on the private address space of Fig. 2.7b. If the of another processor core is not mapped on this private address space, the load/store instructions executed by the of the C core cannot access data on the other processor core s. Instead, the of the C core transfers data from the other processor core s to its own, the CSM, or the off-chip main memory using the public address space, and the accesses the data in its private address space. In the SPP example (Fig. 2.7c ), the of the SPP core can access only its own in this case. The data transfer

2.2 Address Space 17 Fig. 2.7 Public/private address spaces a b main memory main memory CSM (C #0) CSM (C #m) (SPPa #0) (SPPb #k) Other resources Public address space Private address space (C core) c Private address space (SPP core) Fig. 2.8 Private address space (Hierarchical Harvard) i d i d 2 2 Hierarchical-Harvard structure Private address space between its own and memories outside the core is done by its own on the public address space. The address mapping of a private address space varies according to the structure of the local memory. Figure 2.8 illustrates the case of the hierarchical Harvard structure of Fig. 2.2c. The i and d are fi rst-level local memories for instructions and data, respectively. The 2 is a second-level local memory that stores both instructions and data. The i, d, and 2 are mapped on different address areas in the private address space. The accesses each with different addresses. The size of the address spaces depends on the implementation of the heterogeneous multicore chip and its system. For example, a 40-bit address is assigned for a public address space, a 32-bit address for a C core s private address space, a 16-bit address for the SPP s private address space, and so on. In this case, the sizes of each space are 1 TB, 4 GB, and 64 KB, respectively. Concrete examples of this are described in Chaps. 3 and 4.

18 2 Heterogeneous Multicore Architecture References 1. Uchiyama K (2008) Power-ef fi cient heterogeneous parallelism for digital convergence, digest of technical papers of 2008 Symposium of VLSI circuits, Honolulu, USA, pp 6 9 2. Uchiyama K (2010) Power-ef fi cient heterogeneous multicore for digital convergence, Proceedings of 10th International Forum on Embedded MPSoC and Multicore, Gifu, Japan, pp 339 356 3. Yuyama Y, et al (2010) A 45 nm 37.3GOPS/W heterogeneous multi-core SoC, ISSCC Dig: 100 101 4. Nito T, et al (2010) A 45 nm heterogeneous multi-core SoC supporting an over 32-bits physical address space for digital appliance, COOL Chips XIII Proceedings, Session XI, no. 1 5. Arakawa F (2011) Low power multicore for embedded systems, CMOS Emerging Technology 2011, Session 5B, no. 1

http://www.springer.com/978-1-4614-0283-1