Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip



Similar documents
Architectures and Platforms

7a. System-on-chip design and prototyping platforms

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

ZigBee Technology Overview

Building Blocks for PRU Development

Operating System Support for Multiprocessor Systems-on-Chip

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

Multiprocessor System-on-Chip

OpenSPARC T1 Processor

Real-Time Operating Systems for MPSoCs

Defining Platform-Based Design. System Definition. Platform Based Design What is it? Platform-Based Design Definitions: Three Perspectives

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Optimizing Configuration and Application Mapping for MPSoC Architectures

How to Perform Real-Time Processing on the Raspberry Pi. Steven Doran SCALE 13X

Lecture 23: Multiprocessors

High Performance or Cycle Accuracy?

Introduction to System-on-Chip

Interconnection Networks

Operating Systems 4 th Class

Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)

MPSoC Virtual Platforms

Principles and characteristics of distributed systems and environments

Distributed Systems LEEC (2005/06 2º Sem.)

EEM870 Embedded System and Experiment Lecture 1: SoC Design Overview

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

CMSC 611: Advanced Computer Architecture

Multicore Programming with LabVIEW Technical Resource Guide

Chapter 13. PIC Family Microcontroller

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

DS1104 R&D Controller Board

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

SOC architecture and design

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Multistage Interconnection Network for MPSoC: Performances study and prototyping on FPGA

System Software Integration: An Expansive View. Overview

Chapter 2 Heterogeneous Multicore Architecture

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

Computer Systems Structure Input/Output

Am186ER/Am188ER AMD Continues 16-bit Innovation

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Hybrid Platform Application in Software Debug

CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS

Palaparthi.Jagadeesh Chand. Associate Professor in ECE Department, Nimra Institute of Science & Technology, Vijayawada, A.P.

Mixed-Criticality: Integration of Different Models of Computation. University of Siegen, Roman Obermaisser

SBC8600B Single Board Computer

LS DYNA Performance Benchmarks and Profiling. January 2009

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

AMD Opteron Quad-Core

Memory Architecture and Management in a NoC Platform

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

theguard! ApplicationManager System Windows Data Collector

Hardware accelerated Virtualization in the ARM Cortex Processors

SoCLib : Une plate-forme de prototypage virtuel pour systèmes multi-processeurs intégrés sur puce

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Network connectivity controllers

Advanced Core Operating System (ACOS): Experience the Performance

An Implementation Of Multiprocessor Linux

Design and Implementation of the Heterogeneous Multikernel Operating System

Chapter 02: Computer Organization. Lesson 04: Functional units and components in a computer organization Part 3 Bus Structures

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Mobile Processors: Future Trends

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

Parallel Firewalls on General-Purpose Graphics Processing Units

Motivation: Smartphone Market

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Operating Systems. Lecture 03. February 11, 2013

Tensilica Software Development Toolkit (SDK)

Chapter 3 Operating-System Structures

Operating System for the K computer

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Packetization and routing analysis of on-chip multiprocessor networks

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

How To Write A Windows Operating System (Windows) (For Linux) (Windows 2) (Programming) (Operating System) (Permanent) (Powerbook) (Unix) (Amd64) (Win2) (X

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Chapter 2. Multiprocessors Interconnection Networks

How To Design A Single Chip System Bus (Amba) For A Single Threaded Microprocessor (Mma) (I386) (Mmb) (Microprocessor) (Ai) (Bower) (Dmi) (Dual

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Microcontrollers in Practice

Computer Architecture

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

Deeply Embedded Real-Time Hypervisors for the Automotive Domain Dr. Gary Morgan, ETAS/ESC

Low Power AMD Athlon 64 and AMD Opteron Processors

ECLIPSE Performance Benchmarks and Profiling. January 2009

High Performance Computing. Course Notes HPC Fundamentals

Software Stacks for Mixed-critical Applications: Consolidating IEEE AVB and Time-triggered Ethernet in Next-generation Automotive Electronics

Networking Virtualization Using FPGAs

Transcription:

Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC architecture Power modeling Operating system Cache coherence in MPSoCs 2 Introduction Multiprocessor Systems on Chip High level of integration Very complex systems: Systems-on-Chip Need for high performance AND low energy consumption Need for analysis and exploration tools Modeling and simulation accuracy is key for performance characterization Impact of low level architectural details The whole hardware and software architecture must be modelled Software can make the difference High level of integration Increasing delays on clock distribution Single clock-domain is no more feasible Use of third party predesigned sub-systems () High number of sub-systems which comunicate Today more than 100 processing elements on a chip Tomorrow more than 1000 3 4 A MPSoC Example: Nexperia DVP General-purpose Scalable RISC Processor 50 to 300+ MHz 32-bit or 64-bit Library of Device Blocks Image coprocessors DSPs UART 1394 USB and more MS D$ I$ MS CPU PRxxxx DVP SYSTEM SILICON PI SDRAM MMI DVP ORY PI TriMedia TriMedia CPU TM-xxxx D$ I$ Scalable VLIW Media Processor: 100 to 300+ MHz 32-bit or 64-bit Nexperia System Buses 32-128 bit A New Paradigm: Network on Chip Communication become the key issue GALS: Globally Asynchronous Locally Synchronous Many types of interconnects Shared bus Crossbar Micro network 5 6 1

Traffic Modeling Multiprocessor Simulation Platform Stochastic traffic models Analytical distributions Easily parameterizable Trace-based models Higher accuracy Do not consider dynamic trafficdependent effects (eg inter-processor communication) Functional traffic Traffic directly generated by running applications May require OS support Need of simulation of functional traffic Complexity Accuracy 7 8 Interconnections Interconnections: Shared Bus Shared Bus Low cost Not scalable Capacitance grows quickly Energy consumption raises Crossbar Parallel interconnections High cost Micro Network Scalable Complex A single communication channel shared among all the devices 1 2 3 4 9 10 Interconnections: Crossbar Interconnections: Micro Network Many communication channels: simultaneous tranfers are possible High flexibility in topology and routing policy 1 1 2 2 3 3 4 4 11 12 2

Power Models Power Models: Processing elements Cycle accurate simulations require cycle accurate power models Processing elements are modeled at Instruction-level Each module must have its power model ISS wrapped into a SystemC module Processing elements (Cores) RAMs Caches Interconnect Other specialized hardware Instruction level power models Energy consumption is evaluated at each cycle Black box approach Leverage on foundry data 13 14 Power Models RAMs and Caches Power Models - Interconnection ories are arrays of transistors Interconnect is modeled at signal level Data from foundry is mandatory Extraction of a linear model by interpolation E = A + B Size E = A + B N row + C N bit Coefficients will depend on Access type (READ/WRITE/IDLE) ory type (high-speed, low-power, etc) Power modeling From foundry data Synthesizing and Characterizing 15 16 Operating system Operating System - Architecture Protection Protect devices and critical memory areas from wrong usage Scheduling Handle multitasking on each processing element Hardware masquerading Offer a standard interface to the programmer APIs Sys Calls HAL Calls ory accesses Special instructions Applications Application Libraries Communication (MP), IO, Synch, Domain specific computation Kernel services process, communication, power management Device drivers Network interface, Coprocessor & Local ory Management Hardware 17 18 3

Operating System - RTEMS Cache coherence - 1 Includes POSIX APIs Heterogeneous multi-processor support Multi-tasking support Inter-processor synchronization and communication primitives rtems_message_queue_send rtems_message_queue_receive Exploits software locality Spatial locality Temporal locality Cache coherent architectures Type of interconnect Shared medium: snoop-based coherence Non-shared interconnect: directory-based coherence Cache policy Write-invalidate Write-update 19 20 Cache coherence - 2 Snoop Device Handling shared data owned by more than a processor What if -1 modifies X? -2 read the data again -2 invalidates the cache line Other? 1 X 2 X X Invalidate/Update Address and Data SNOOP DEVICE INTERFACE 21 22 Target platform 1 Target platform 2 Configurable platform: Up to N cores Shared and on-chip memories Dedicated synchronization hardware Different bus topologies Signal- and cycle-accurate simulations Real-Time OS (RTEMS) ported POSIX APIs Multiprocessor support Interprocessor syncronization and communication primitives SEM Shared INT 23 24 4

Target platform 3 Energy Characterization ARM7 Interrupt Controller Timer Local Bus I$,D$ MMU UART C++ Class (SWARM) SystemC Module (wrapper) bus master I- and D-cache are modelled Hardware blocks for OS support: timers, IntCntrl, ISS instantiated as a C++ class No inter-process communication overhead Wrapper synchronizes the ISS with the system The only core-specific block For deciding what optimization may be more effective, it essential to have quantitative data about the power breakdown over various components Accuracy affected by: Models Chosen workload Benchmarks vs synthetic 25 26 The benchmarks 1 The benchmarks 2 in_sample 0 in_sample 2 in_sample 4 FFT ARM-0 FILTER FFT -1 ARM-3 out_sample 0 out_sample 2 out_sample 4 in_sample 0 in_sample 1 in_sample 2 FFT FILTER FFT -1 ARM-0 ARM-1 ARM-2 out_sample 0 out_sample 1 out_sample 2 ARM-2 in_sample 1 in_sample 3 in_sample 5 ARM-1 FFT ARM-4 FFT -1 out_sample 1 out_sample 3 out_sample 5 in_sample 0 in_sample 1 in_sample 2 ARM-3 out_sample 0 out_sample 1 out_sample 2 FFT + FILTER + FFT -1 FILT5: a 5 processors digital filter FILT3-1: a 3+1 processors digital filter 27 28 Results: Power Breakdown 1 Results: Power Breakdown 2 1 Power Breakdown for FILT3-1 16 2048 Power Breakdown for FILT5 16 2048 ARM4-core ARM4-cache RAM5 29 30 5

Results: Power Breakdown 3 Results: Power Breakdown 4 Power in FILT5 varying memory latency 1k - 1cyc 1k - 4cyc ARM4-core ARM4-cache RAM5 Power in FILT5 varying cache size 1k - 1cyc 2k - 1cyc 4k - 1cyc ARM4-core ARM4-cache RAM5 31 32 Conclusions Caches are dominant High cache-hit ratio due to the software locality Caches are high speed memories Energy-aware applications Knowledge on power breakdown The energy consumption depends on: Application features and operating conditions System parameters Need a deep and accurate exploration Cycle-accurate simulations 33 6