Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC architecture Power modeling Operating system Cache coherence in MPSoCs 2 Introduction Multiprocessor Systems on Chip High level of integration Very complex systems: Systems-on-Chip Need for high performance AND low energy consumption Need for analysis and exploration tools Modeling and simulation accuracy is key for performance characterization Impact of low level architectural details The whole hardware and software architecture must be modelled Software can make the difference High level of integration Increasing delays on clock distribution Single clock-domain is no more feasible Use of third party predesigned sub-systems () High number of sub-systems which comunicate Today more than 100 processing elements on a chip Tomorrow more than 1000 3 4 A MPSoC Example: Nexperia DVP General-purpose Scalable RISC Processor 50 to 300+ MHz 32-bit or 64-bit Library of Device Blocks Image coprocessors DSPs UART 1394 USB and more MS D$ I$ MS CPU PRxxxx DVP SYSTEM SILICON PI SDRAM MMI DVP ORY PI TriMedia TriMedia CPU TM-xxxx D$ I$ Scalable VLIW Media Processor: 100 to 300+ MHz 32-bit or 64-bit Nexperia System Buses 32-128 bit A New Paradigm: Network on Chip Communication become the key issue GALS: Globally Asynchronous Locally Synchronous Many types of interconnects Shared bus Crossbar Micro network 5 6 1
Traffic Modeling Multiprocessor Simulation Platform Stochastic traffic models Analytical distributions Easily parameterizable Trace-based models Higher accuracy Do not consider dynamic trafficdependent effects (eg inter-processor communication) Functional traffic Traffic directly generated by running applications May require OS support Need of simulation of functional traffic Complexity Accuracy 7 8 Interconnections Interconnections: Shared Bus Shared Bus Low cost Not scalable Capacitance grows quickly Energy consumption raises Crossbar Parallel interconnections High cost Micro Network Scalable Complex A single communication channel shared among all the devices 1 2 3 4 9 10 Interconnections: Crossbar Interconnections: Micro Network Many communication channels: simultaneous tranfers are possible High flexibility in topology and routing policy 1 1 2 2 3 3 4 4 11 12 2
Power Models Power Models: Processing elements Cycle accurate simulations require cycle accurate power models Processing elements are modeled at Instruction-level Each module must have its power model ISS wrapped into a SystemC module Processing elements (Cores) RAMs Caches Interconnect Other specialized hardware Instruction level power models Energy consumption is evaluated at each cycle Black box approach Leverage on foundry data 13 14 Power Models RAMs and Caches Power Models - Interconnection ories are arrays of transistors Interconnect is modeled at signal level Data from foundry is mandatory Extraction of a linear model by interpolation E = A + B Size E = A + B N row + C N bit Coefficients will depend on Access type (READ/WRITE/IDLE) ory type (high-speed, low-power, etc) Power modeling From foundry data Synthesizing and Characterizing 15 16 Operating system Operating System - Architecture Protection Protect devices and critical memory areas from wrong usage Scheduling Handle multitasking on each processing element Hardware masquerading Offer a standard interface to the programmer APIs Sys Calls HAL Calls ory accesses Special instructions Applications Application Libraries Communication (MP), IO, Synch, Domain specific computation Kernel services process, communication, power management Device drivers Network interface, Coprocessor & Local ory Management Hardware 17 18 3
Operating System - RTEMS Cache coherence - 1 Includes POSIX APIs Heterogeneous multi-processor support Multi-tasking support Inter-processor synchronization and communication primitives rtems_message_queue_send rtems_message_queue_receive Exploits software locality Spatial locality Temporal locality Cache coherent architectures Type of interconnect Shared medium: snoop-based coherence Non-shared interconnect: directory-based coherence Cache policy Write-invalidate Write-update 19 20 Cache coherence - 2 Snoop Device Handling shared data owned by more than a processor What if -1 modifies X? -2 read the data again -2 invalidates the cache line Other? 1 X 2 X X Invalidate/Update Address and Data SNOOP DEVICE INTERFACE 21 22 Target platform 1 Target platform 2 Configurable platform: Up to N cores Shared and on-chip memories Dedicated synchronization hardware Different bus topologies Signal- and cycle-accurate simulations Real-Time OS (RTEMS) ported POSIX APIs Multiprocessor support Interprocessor syncronization and communication primitives SEM Shared INT 23 24 4
Target platform 3 Energy Characterization ARM7 Interrupt Controller Timer Local Bus I$,D$ MMU UART C++ Class (SWARM) SystemC Module (wrapper) bus master I- and D-cache are modelled Hardware blocks for OS support: timers, IntCntrl, ISS instantiated as a C++ class No inter-process communication overhead Wrapper synchronizes the ISS with the system The only core-specific block For deciding what optimization may be more effective, it essential to have quantitative data about the power breakdown over various components Accuracy affected by: Models Chosen workload Benchmarks vs synthetic 25 26 The benchmarks 1 The benchmarks 2 in_sample 0 in_sample 2 in_sample 4 FFT ARM-0 FILTER FFT -1 ARM-3 out_sample 0 out_sample 2 out_sample 4 in_sample 0 in_sample 1 in_sample 2 FFT FILTER FFT -1 ARM-0 ARM-1 ARM-2 out_sample 0 out_sample 1 out_sample 2 ARM-2 in_sample 1 in_sample 3 in_sample 5 ARM-1 FFT ARM-4 FFT -1 out_sample 1 out_sample 3 out_sample 5 in_sample 0 in_sample 1 in_sample 2 ARM-3 out_sample 0 out_sample 1 out_sample 2 FFT + FILTER + FFT -1 FILT5: a 5 processors digital filter FILT3-1: a 3+1 processors digital filter 27 28 Results: Power Breakdown 1 Results: Power Breakdown 2 1 Power Breakdown for FILT3-1 16 2048 Power Breakdown for FILT5 16 2048 ARM4-core ARM4-cache RAM5 29 30 5
Results: Power Breakdown 3 Results: Power Breakdown 4 Power in FILT5 varying memory latency 1k - 1cyc 1k - 4cyc ARM4-core ARM4-cache RAM5 Power in FILT5 varying cache size 1k - 1cyc 2k - 1cyc 4k - 1cyc ARM4-core ARM4-cache RAM5 31 32 Conclusions Caches are dominant High cache-hit ratio due to the software locality Caches are high speed memories Energy-aware applications Knowledge on power breakdown The energy consumption depends on: Application features and operating conditions System parameters Need a deep and accurate exploration Cycle-accurate simulations 33 6